[LU-5407] Failover failure on test suite replay-single test_58c: test_58c failed with 2 Created: 24/Jul/14  Updated: 01/Nov/16  Resolved: 26/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: zfs
Environment:

lustre-b2_6-rc2 client is SLES11 SP3


Issue Links:
Related
is related to LU-5420 Failure on test suite sanity test_17m... Resolved
is related to LU-6573 multiple tests: client evicted, Input... Resolved
Severity: 3
Rank (Obsolete): 15044

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/79386184-0f6e-11e4-aee3-5254006e85c2.

The sub-test test_58c failed with the following error:

test_58c failed with 2

May related with LU-3625
test log

== replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520)
CMD: onyx-63vm3 dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 |
		grep -E -q '(ea_inode|large_xattr)'
Starting client: onyx-63vm1: -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
CMD: onyx-63vm1 mkdir -p /mnt/lustre2
CMD: onyx-63vm1 mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
mount.lustre: mount onyx-63vm3:onyx-63vm7:/lustre at /mnt/lustre2 failed: Input/output error
Is the MGS running?
CMD: onyx-63vm3 lctl set_param fail_loc=0x123
fail_loc=0x123
CMD: onyx-63vm1 setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single
CMD: onyx-63vm3 lctl set_param fail_loc=0
fail_loc=0
getfattr: /mnt/lustre2/d58c.replay-single/f58c.replay-single: No such file or directory
 replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 

client dmesg

[129081.475975] Lustre: DEBUG MARKER: == replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520)
[129081.758649] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2
[129081.771133] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
[129081.784961] LustreError: 15c-8: MGC10.2.5.138@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[129081.785213] Lustre: Unmounted lustre-client
[129081.785422] LustreError: 22063:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
[129082.023875] Lustre: DEBUG MARKER: setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single
[129134.337350] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 
[129134.445678] Lustre: DEBUG MARKER: replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2


 Comments   
Comment by Jian Yu [ 20/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Distro/Arch: RHEL6.5/x86_64
Test Group: failover
FSTYPE=zfs

The same failure occurred:
https://testing.hpdd.intel.com/test_sets/21ecf6d8-26a7-11e4-84f2-5254006e85c2

Comment by Jian Yu [ 21/Aug/14 ]

One more instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/c237917a-2904-11e4-9362-5254006e85c2

Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)
Distro/Arch: RHEL6.5/x86_64
Test Group: failover
FSTYPE=zfs

The same failure occurred: https://testing.hpdd.intel.com/test_sets/d3d551f2-3123-11e4-b503-5254006e85c2

Comment by James Nunez (Inactive) [ 10/Feb/15 ]

replay-single test 58b fails with the same (similar?) getfattr error:

== replay-single test 58b: test replay of setxattr op == 01:19:17 (1423041557)
pdsh@c13: mds01: ssh exited with exit code 1
Starting client: c13:  -o user_xattr,flock mds01@o2ib:/scratch /lustre/scratch2
mount.lustre: mount mds01@o2ib:/scratch at /lustre/scratch2 failed: Input/output error
Is the MGS running?
Filesystem           1K-blocks    Used  Available Use% Mounted on
mds01@o2ib:/scratch 1253608724 2831236 1185541688   1% /lustre/scratch
Failing mds1 on mds01
Stopping /lustre/scratch/mdt0 (opts:) on mds01
pdsh@c13: mds01: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to mds01
01:19:37 (1423041577) waiting for mds01 network 900 secs ...
01:19:37 (1423041577) network interface is UP
mount facets: mds1
Starting mds1:   /dev/lvm-sdc/MDT0 /lustre/scratch/mdt0
Started scratch-MDT0000
c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
getfattr: /lustre/scratch2/d58b.replay-single/f58b.replay-single: No such file or directory
 replay-single test_58b: @@@@@@ FAIL: test_58b failed with 1 

Results for lustre-master 2.6.93 tag at https://testing.hpdd.intel.com/test_sets/c7685e64-ade2-11e4-a0b6-5254006e85c2

Comment by nasf (Inactive) [ 09/Mar/15 ]

Hit it on master:
https://testing.hpdd.intel.com/test_sets/8c1f6af0-c5a2-11e4-9ec4-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 09/Mar/15 ]

+1 at https://testing.hpdd.intel.com/test_sets/495484d0-c3cc-11e4-869d-5254006e85c2.

Comment by John Hammond [ 08/Apr/15 ]

I think this is likely the same issue as LU-5420.

Comment by James Nunez (Inactive) [ 08/May/15 ]

Replay-single test 58c used to fail with error

test_58c failed with 2

After the patch for LU-2524 with change ID Ib09102e50f855550db801180be3f7fc42911191a the failure message was changed to the unfortunately long

xattr set (yyyyyyyyyyyyyyyyyyyyyyyyy … yyy) is not what was returned ()

or

xattr set (bar) is not what was returned ()

Here are all the test 58c failures for the since 01 April 2015. Just looking at the number of failures in April compared to the failures so far in May, it looks like this test is failing more often in the past week. It also looks to be impacting ZFS more than ldiskfs servers (marked with “ldiskfs”) lately.

ldiskfs - 2015-04-06 08:08:06 - https://testing.hpdd.intel.com/test_sets/32c396f8-dc6c-11e4-a69a-5254006e85c2
ldiskfs - 2015-04-06 10:21:25 - https://testing.hpdd.intel.com/test_sets/5525c23c-dcd2-11e4-9d45-5254006e85c2
ldiskfs - 2015-04-06 12:29:29 - https://testing.hpdd.intel.com/test_sets/56cae55c-dc8e-11e4-a69a-5254006e85c2
ldiskfs - 2015-04-06 14:48:52 - https://testing.hpdd.intel.com/test_sets/d090287c-dceb-11e4-babc-5254006e85c2
ldiskfs - 2015-04-07 11:42:00 - https://testing.hpdd.intel.com/test_sets/1d3d0418-dd57-11e4-babc-5254006e85c2
ldiskfs - 2015-04-07 18:42:37 - https://testing.hpdd.intel.com/test_sets/5b4c0db4-dd87-11e4-a807-5254006e85c2
2015-04-08 12:53:15 - https://testing.hpdd.intel.com/test_sets/04435bac-de37-11e4-90b9-5254006e85c2
ldiskfs - 2015-04-08 18:56:27 - https://testing.hpdd.intel.com/test_sets/9bebb710-de64-11e4-90b9-5254006e85c2
ldiskfs - 2015-04-09 23:09:15 - https://testing.hpdd.intel.com/test_sets/74066ef2-df9a-11e4-b5b0-5254006e85c2
ldiskfs - 2015-04-11 03:57:25 - https://testing.hpdd.intel.com/test_sets/ccd71cf2-e08f-11e4-93dc-5254006e85c2
ldiskfs - 2015-04-20 14:21:18 - https://testing.hpdd.intel.com/test_sets/de50f082-e7a3-11e4-bdb1-5254006e85c2
2015-04-26 11:36:43 - https://testing.hpdd.intel.com/test_sets/6c1a6f4c-ec31-11e4-8eb7-5254006e85c2
2015-05-01 19:16:59 - https://testing.hpdd.intel.com/test_sets/722e18e4-f04a-11e4-9bb2-5254006e85c2
2015-05-02 08:13:07 - https://testing.hpdd.intel.com/test_sets/03f97924-f11a-11e4-9e14-5254006e85c2
2015-05-02 13:40:16 - https://testing.hpdd.intel.com/test_sets/75a610ae-f143-11e4-bb65-5254006e85c2
2015-05-03 23:34:14 - https://testing.hpdd.intel.com/test_sets/2244f05c-f200-11e4-aa2e-5254006e85c2
2015-05-04 13:41:15 - https://testing.hpdd.intel.com/test_sets/0d145610-f2d1-11e4-aad2-5254006e85c2
2015-05-04 20:22:35 - https://testing.hpdd.intel.com/test_sets/4787eeca-f307-11e4-9186-5254006e85c2
2015-05-05 08:11:46 - https://testing.hpdd.intel.com/test_sets/7a76c3c6-f382-11e4-a51d-5254006e85c2
2015-05-05 18:14:34 - https://testing.hpdd.intel.com/test_sets/29879778-f3b0-11e4-8b3b-5254006e85c2
2015-05-05 23:56:04 - https://testing.hpdd.intel.com/test_sets/9febf6d6-f3e0-11e4-b108-5254006e85c2
2015-05-06 04:48:54 - https://testing.hpdd.intel.com/test_sets/a6e591e2-f401-11e4-a594-5254006e85c2
2015-05-06 09:08:40 - https://testing.hpdd.intel.com/test_sets/08da0d34-f442-11e4-a594-5254006e85c2
2015-05-06 16:55:59 - https://testing.hpdd.intel.com/test_sets/ff57f6e2-f476-11e4-ac9e-5254006e85c2
ldiskfs - 2015-05-06 17:03:23 - https://testing.hpdd.intel.com/test_sets/9825fe38-f4a3-11e4-b783-5254006e85c2
ldiskfs - 2015-05-07 06:25:06 - https://testing.hpdd.intel.com/test_sets/62441088-f535-11e4-ac9e-5254006e85c2
2015-05-07 20:17:28 - https://testing.hpdd.intel.com/test_sets/b4ccb126-f552-11e4-91fd-5254006e85c2
2015-05-07 20:58:18 - https://testing.hpdd.intel.com/test_sets/b89ca102-f54e-11e4-91fd-5254006e85c2
2015-05-07 22:18:20 - https://testing.hpdd.intel.com/test_sets/04cf6b6a-f566-11e4-90f4-5254006e85c2
2015-05-07 23:29:52 - https://testing.hpdd.intel.com/test_sets/23893fd8-f573-11e4-90f4-5254006e85c2
2015-05-08 00:38:48 - https://testing.hpdd.intel.com/test_sets/375e2178-f586-11e4-af5c-5254006e85c2
2015-05-08 02:55:30 - https://testing.hpdd.intel.com/test_sets/8647e13c-f588-11e4-90f4-5254006e85c2
2015-05-08 04:02:37 - https://testing.hpdd.intel.com/test_sets/80073dac-f5ad-11e4-91fd-5254006e85c2

Comment by Andreas Dilger [ 08/May/15 ]

I think recent cases marked LU-5407 are actually LU-6573.

Comment by Andreas Dilger [ 09/May/15 ]

It looks to me that this failure has become much more common since May 2. LIkely candidates are two patches landed on May 1:
http://review.whamcloud.com/9286 "LU-3266 test: regression tests for nrs policies"
http://review.whamcloud.com/14505 "LU-6477 build: Update SPL/ZFS to 0.6.4.1"

Comment by Andreas Dilger [ 09/May/15 ]

James, would it be possible for you to fix test_58c so that the error message isn't so long?

Comment by John Hammond [ 10/May/15 ]

> James, would it be possible for you to fix test_58c so that the error message isn't so long?

And to check for errors from mount?

Comment by Peter Jones [ 11/May/15 ]

Hongchao

Could you please look into this issue?

Thanks

Peter

Comment by Gerrit Updater [ 11/May/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14766
Subject: LU-5407 tests: Error message for replay-single 58b and 58c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: da8b8fe156ec0fcb06a096c1f6cddb85beec3fe9

Comment by James Nunez (Inactive) [ 11/May/15 ]

Please note that patch http://review.whamcloud.com/14766 only modifies the error message and checks if the client mount succeeds.

This patch does not fix the test failure.

Comment by Andreas Dilger [ 12/May/15 ]

This problem is causing about 1/3 of all test failures in review-zfs, increasing to blocker status.

Hong Chao, can you please treat this as a priority.

Comment by Hongchao Zhang [ 13/May/15 ]

Hi Andreas, Okay, I'll try to analysis and create the patch asap, thanks!

Comment by Hongchao Zhang [ 13/May/15 ]

this problem is caused by the disconnected import state of MGC at the client, which is not recovered yet after the MDS is failed over in "test_58b".

the log from MDT

00000100:00100000:0.0:1431060944.791714:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 74
00000100:00100000:0.0:1431060944.791720:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101
00000020:00080000:0.0:1431060944.791729:0:20964:0:(tgt_handler.c:622:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.1.5.31@tcp
00000100:00100000:0.0:1431060944.791779:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 Request procesed in 59us (193us total) trans 0 rc -107/-107
00000100:00100000:0.0:1431060944.791787:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 74
00000100:00100000:0.0:1431060944.792290:0:3344:0:(events.c:349:request_in_callback()) peer: 12345-10.1.5.31@tcp
00000100:00100000:0.0:1431060944.792296:0:20964:0:(service.c:1927:ptlrpc_server_handle_req_in()) got req x1500570864992712
00000100:00100000:0.0:1431060944.792313:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 75
00000100:00100000:0.0:1431060944.792314:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:539:x1500570864992712:12345-10.1.5.31@tcp:250
00010000:00080000:0.0:1431060944.792327:0:20964:0:(ldlm_lib.c:1045:target_handle_connect()) MGS: connection from 5108cdc4-cb46-6929-784d-073b1dfbc568@10.1.5.31@tcp t0 exp (null) cur 1431060944 last 0
00000020:00000080:0.0:1431060944.792353:0:20964:0:(genops.c:1146:class_connect()) connect: client 5108cdc4-cb46-6929-784d-073b1dfbc568, cookie 0x96ce524a05765d59
00000020:01000000:0.0:1431060944.792358:0:20964:0:(lprocfs_status_server.c:307:lprocfs_exp_setup()) using hash ffff8800791d5180
00000100:00100000:0.0:1431060944.792411:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:5108cdc4-cb46-6929-784d-073b1dfbc568+4:539:x1500570864992712:12345-10.1.5.31@tcp:250 Request procesed in 96us (122us total) trans 0 rc 0/0
00000100:00100000:0.0:1431060944.792413:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 75

the config lock is enqueued by MGC at 1431060944.791714 and failed in "tgt_request_handle" with "operation 101 on unconnected OST from 12345-10.1.5.31@tcp",
then the MGC sends MGS_CONNECT(250) to MGS a little later (00000100:00100000:0.0:1431060944.792313).
in test_58c, the lustre mount at "/mnt/lustre2" failed for the above reason then the "getfxattr" faild with -2 (ENOENT).

Comment by Gerrit Updater [ 13/May/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14792
Subject: LU-5407 test: wait MGC import to reconnect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e529aea4d9e52e433084287c40792700c9bdd63

Comment by Gerrit Updater [ 19/May/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14864
Subject: LU-5407 tests: Disable replay-single test 58c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0dbceb9e6d2835c00733a2fd76b950ff77976305

Comment by Gerrit Updater [ 20/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14864/
Subject: LU-5407 tests: Disable replay-single test 58c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b0ce8303e4033b3c7b09fda50f013e6d9d002b0

Comment by Gerrit Updater [ 26/May/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14792/
Subject: LU-5407 test: wait MGC import to finish recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c8602de66d24be2e4cf4750ce79a95e51ef5676d

Comment by Gerrit Updater [ 09/Jun/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14766/
Subject: LU-5407 tests: Error message for replay-single 58b and 58c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 88e555dbabfc35521345851ff41516156217b1ec

Generated at Sat Feb 10 01:51:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.