[LU-5407] Failover failure on test suite replay-single test_58c: test_58c failed with 2 Created: 24/Jul/14 Updated: 01/Nov/16 Resolved: 26/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
lustre-b2_6-rc2 client is SLES11 SP3 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 15044 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/79386184-0f6e-11e4-aee3-5254006e85c2. The sub-test test_58c failed with the following error:
May related with == replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520) CMD: onyx-63vm3 dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 | grep -E -q '(ea_inode|large_xattr)' Starting client: onyx-63vm1: -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2 CMD: onyx-63vm1 mkdir -p /mnt/lustre2 CMD: onyx-63vm1 mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2 mount.lustre: mount onyx-63vm3:onyx-63vm7:/lustre at /mnt/lustre2 failed: Input/output error Is the MGS running? CMD: onyx-63vm3 lctl set_param fail_loc=0x123 fail_loc=0x123 CMD: onyx-63vm1 setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single CMD: onyx-63vm3 lctl set_param fail_loc=0 fail_loc=0 getfattr: /mnt/lustre2/d58c.replay-single/f58c.replay-single: No such file or directory replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 client dmesg [129081.475975] Lustre: DEBUG MARKER: == replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520) [129081.758649] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2 [129081.771133] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2 [129081.784961] LustreError: 15c-8: MGC10.2.5.138@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [129081.785213] Lustre: Unmounted lustre-client [129081.785422] LustreError: 22063:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount (-5) [129082.023875] Lustre: DEBUG MARKER: setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single [129134.337350] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 [129134.445678] Lustre: DEBUG MARKER: replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 |
| Comments |
| Comment by Jian Yu [ 20/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/ The same failure occurred: |
| Comment by Jian Yu [ 21/Aug/14 ] |
|
One more instance on Lustre b2_5 branch: |
| Comment by Jian Yu [ 31/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) The same failure occurred: https://testing.hpdd.intel.com/test_sets/d3d551f2-3123-11e4-b503-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 10/Feb/15 ] |
|
replay-single test 58b fails with the same (similar?) getfattr error: == replay-single test 58b: test replay of setxattr op == 01:19:17 (1423041557) pdsh@c13: mds01: ssh exited with exit code 1 Starting client: c13: -o user_xattr,flock mds01@o2ib:/scratch /lustre/scratch2 mount.lustre: mount mds01@o2ib:/scratch at /lustre/scratch2 failed: Input/output error Is the MGS running? Filesystem 1K-blocks Used Available Use% Mounted on mds01@o2ib:/scratch 1253608724 2831236 1185541688 1% /lustre/scratch Failing mds1 on mds01 Stopping /lustre/scratch/mdt0 (opts:) on mds01 pdsh@c13: mds01: ssh exited with exit code 1 reboot facets: mds1 Failover mds1 to mds01 01:19:37 (1423041577) waiting for mds01 network 900 secs ... 01:19:37 (1423041577) network interface is UP mount facets: mds1 Starting mds1: /dev/lvm-sdc/MDT0 /lustre/scratch/mdt0 Started scratch-MDT0000 c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec getfattr: /lustre/scratch2/d58b.replay-single/f58b.replay-single: No such file or directory replay-single test_58b: @@@@@@ FAIL: test_58b failed with 1 Results for lustre-master 2.6.93 tag at https://testing.hpdd.intel.com/test_sets/c7685e64-ade2-11e4-a0b6-5254006e85c2 |
| Comment by nasf (Inactive) [ 09/Mar/15 ] |
|
Hit it on master: |
| Comment by Bruno Faccini (Inactive) [ 09/Mar/15 ] |
|
+1 at https://testing.hpdd.intel.com/test_sets/495484d0-c3cc-11e4-869d-5254006e85c2. |
| Comment by John Hammond [ 08/Apr/15 ] |
|
I think this is likely the same issue as |
| Comment by James Nunez (Inactive) [ 08/May/15 ] |
|
Replay-single test 58c used to fail with error test_58c failed with 2 After the patch for xattr set (yyyyyyyyyyyyyyyyyyyyyyyyy … yyy) is not what was returned () or xattr set (bar) is not what was returned () Here are all the test 58c failures for the since 01 April 2015. Just looking at the number of failures in April compared to the failures so far in May, it looks like this test is failing more often in the past week. It also looks to be impacting ZFS more than ldiskfs servers (marked with “ldiskfs”) lately. ldiskfs - 2015-04-06 08:08:06 - https://testing.hpdd.intel.com/test_sets/32c396f8-dc6c-11e4-a69a-5254006e85c2 |
| Comment by Andreas Dilger [ 08/May/15 ] |
| Comment by Andreas Dilger [ 09/May/15 ] |
|
It looks to me that this failure has become much more common since May 2. LIkely candidates are two patches landed on May 1: |
| Comment by Andreas Dilger [ 09/May/15 ] |
|
James, would it be possible for you to fix test_58c so that the error message isn't so long? |
| Comment by John Hammond [ 10/May/15 ] |
|
> James, would it be possible for you to fix test_58c so that the error message isn't so long? And to check for errors from mount? |
| Comment by Peter Jones [ 11/May/15 ] |
|
Hongchao Could you please look into this issue? Thanks Peter |
| Comment by Gerrit Updater [ 11/May/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14766 |
| Comment by James Nunez (Inactive) [ 11/May/15 ] |
|
Please note that patch http://review.whamcloud.com/14766 only modifies the error message and checks if the client mount succeeds. This patch does not fix the test failure. |
| Comment by Andreas Dilger [ 12/May/15 ] |
|
This problem is causing about 1/3 of all test failures in review-zfs, increasing to blocker status. Hong Chao, can you please treat this as a priority. |
| Comment by Hongchao Zhang [ 13/May/15 ] |
|
Hi Andreas, Okay, I'll try to analysis and create the patch asap, thanks! |
| Comment by Hongchao Zhang [ 13/May/15 ] |
|
this problem is caused by the disconnected import state of MGC at the client, which is not recovered yet after the MDS is failed over in "test_58b". the log from MDT 00000100:00100000:0.0:1431060944.791714:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 74 00000100:00100000:0.0:1431060944.791720:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 00000020:00080000:0.0:1431060944.791729:0:20964:0:(tgt_handler.c:622:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.1.5.31@tcp 00000100:00100000:0.0:1431060944.791779:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 Request procesed in 59us (193us total) trans 0 rc -107/-107 00000100:00100000:0.0:1431060944.791787:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 74 00000100:00100000:0.0:1431060944.792290:0:3344:0:(events.c:349:request_in_callback()) peer: 12345-10.1.5.31@tcp 00000100:00100000:0.0:1431060944.792296:0:20964:0:(service.c:1927:ptlrpc_server_handle_req_in()) got req x1500570864992712 00000100:00100000:0.0:1431060944.792313:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 75 00000100:00100000:0.0:1431060944.792314:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:539:x1500570864992712:12345-10.1.5.31@tcp:250 00010000:00080000:0.0:1431060944.792327:0:20964:0:(ldlm_lib.c:1045:target_handle_connect()) MGS: connection from 5108cdc4-cb46-6929-784d-073b1dfbc568@10.1.5.31@tcp t0 exp (null) cur 1431060944 last 0 00000020:00000080:0.0:1431060944.792353:0:20964:0:(genops.c:1146:class_connect()) connect: client 5108cdc4-cb46-6929-784d-073b1dfbc568, cookie 0x96ce524a05765d59 00000020:01000000:0.0:1431060944.792358:0:20964:0:(lprocfs_status_server.c:307:lprocfs_exp_setup()) using hash ffff8800791d5180 00000100:00100000:0.0:1431060944.792411:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:5108cdc4-cb46-6929-784d-073b1dfbc568+4:539:x1500570864992712:12345-10.1.5.31@tcp:250 Request procesed in 96us (122us total) trans 0 rc 0/0 00000100:00100000:0.0:1431060944.792413:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 75 the config lock is enqueued by MGC at 1431060944.791714 and failed in "tgt_request_handle" with "operation 101 on unconnected OST from 12345-10.1.5.31@tcp", |
| Comment by Gerrit Updater [ 13/May/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14792 |
| Comment by Gerrit Updater [ 19/May/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14864 |
| Comment by Gerrit Updater [ 20/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14864/ |
| Comment by Gerrit Updater [ 26/May/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14792/ |
| Comment by Gerrit Updater [ 09/Jun/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14766/ |