[LU-11762] replay-single test 0d fails with 'post-failover df failed' Created: 11/Dec/18 Updated: 17/Oct/20 Resolved: 17/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4) Started lustre-MDT0000 Starting client: trevis-26vm3: -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre CMD: trevis-26vm3 mkdir -p /mnt/lustre CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error replay-single test_0d: @@@@@@ FAIL: post-failover df failed Looking at the dmesg log from client 2 (vm4), we see the following errors [44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail [44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4 [44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107 [44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages [44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp) [44429.547533] Lustre: Skipped 1 previous similar message [44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0d: @@@@@@ FAIL: post-failover df failed In the dmesg log for the MDS (vm6), we see [44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect [44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0 [44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05 [44196.728849] Lustre: Skipped 21 previous similar messages [44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients [44311.674391] Lustre: Skipped 1 previous similar message [44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1 [44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages [44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery [44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages [44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152 [44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted: req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1 [44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted. [44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp) [44311.931571] Lustre: Skipped 46 previous similar messages [44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0d: @@@@@@ FAIL: post-failover df failed We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc . More logs for these failures are at |
| Comments |
| Comment by Peter Jones [ 12/Dec/18 ] |
|
Hongchao Can you please assess this issue? Thanks Peter |
| Comment by Jian Yu [ 14/Dec/18 ] |
|
replay-single test 62 hit the similar issue on master branch: Note this appears to be LU-12304. |
| Comment by Gerrit Updater [ 21/Dec/18 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33907 |
| Comment by Gerrit Updater [ 12/Mar/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34408 |
| Comment by Sarah Liu [ 09/Apr/19 ] |
|
SOAK hit similar error when running on b2_12-next #6 which cause clients were evicted. application error 259372-mdtestfpp.out 04/07/2019 10:44:51: Process 1(soak-29.spirit.whamcloud.com): FAILED in create_remove_directory_tree, Unable to create directory: Input/output error [soak-28][[62764,1],0][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [soak-43][[62764,1],6][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) slurmstepd: error: *** STEP 259372.0 ON soak-28 CANCELLED AT 2019-04-07T10:45:00 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: soak-29: task 1: Exited with exit code 1 srun: Terminating job step 259372.0 srun: error: soak-28: task 0: Killed -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- srun: error: soak-30: task 2: Killed srun: error: soak-31: task 3: Killed srun: error: soak-42: task 5: Killed srun: error: soak-41: task 4: Killed srun: error: soak-43: task 6: Killed client console (soak-28) Apr 7 10:43:47 soak-28 systemd-logind: Removed session 5255. Apr 7 10:43:47 soak-28 systemd: Removed slice User Slice of root. Apr 7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib) was lost; in progress operations using this service will wait for recovery to complete Apr 7 10:44:08 soak-28 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffffa001df2d3800: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Apr 7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) Apr 7 10:44:48 soak-28 kernel: LustreError: 98728:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2 Apr 7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2 Apr 7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) Skipped 13 previous similar messages MDS console (soak-11) Apr 7 10:43:59 soak-11 systemd: Removed slice User Slice of root. Apr 7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) soaked-MDT0003-osd get update log failed: rc = -108 Apr 7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) Skipped 2 previous similar messages Apr 7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: sdp - rdac checker reports path is up Apr 7 10:44:01 soak-11 multipathd: 8:240: reinstated Apr 7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 2 1 round-robin 0 1 1 8:240 1 round-robin 0 1 1 8:128 1] Apr 7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 18fc806b-f2ae-c23f-8207-c2b41766066a (at 192.168.1.129@o2ib) Apr 7 10:44:05 soak-11 kernel: Lustre: 13330:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Apr 7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: disconnecting 23 stale clients Apr 7 10:44:05 soak-11 kernel: LustreError: 13330:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 8388608 Apr 7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Denying connection for new client 0a90ab6f-d87f-2ae2-08eb-fb39e165047b(at 192.168.1.136@o2ib), waiting for 24 known clients (1 recovered, 0 in progress, and 23 evicted) already passed deadline 9:50 Apr 7 10:44:05 soak-11 kernel: Lustre: Skipped 3 previous similar messages Apr 7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to b662025f-1a48-f603-425e-9a1b0fe94ac8 (at 192.168.1.125@o2ib) Apr 7 10:44:05 soak-11 kernel: Lustre: Skipped 1 previous similar message Apr 7 10:44:07 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to f2e27319-0d77-c9dc-e387-bbb2fb32fd1e (at 192.168.1.130@o2ib) Apr 7 10:44:07 soak-11 kernel: Lustre: Skipped 5 previous similar messages Apr 7 10:44:09 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to ea0c92cc-ca60-a13b-ca50-b173ec24752c (at 192.168.1.116@o2ib) Apr 7 10:44:09 soak-11 kernel: Lustre: Skipped 10 previous similar messages |
| Comment by James A Simmons [ 09/Apr/19 ] |
|
We have a potential fix for this as well as another fix that is needed for recovery. Both will need to be landed. |
| Comment by James Nunez (Inactive) [ 14/May/19 ] |
|
We're seeing something similar with replay-single test 62 for ldiskfs/DNE for 2.12.2 RC1 at https://testing.whamcloud.com/test_sets/78994818-753c-11e9-a6f9-52540065bddc . We see the following in the client 2 dmesg [64633.042303] Lustre: DEBUG MARKER: == replay-single test 0d: expired recovery with no clients =========================================== 09:24:46 (1557653086) [64633.892025] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) [64634.219536] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi [64647.402309] LustreError: 166-1: MGC10.2.4.96@tcp: Connection to MGS (at 10.2.4.96@tcp) was lost; in progress operations using this service will fail [64655.316668] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [64655.543605] Lustre: DEBUG MARKER: onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [64657.419063] Lustre: Evicted from MGS (at 10.2.4.96@tcp) after server handle changed from 0x4e283717d3799726 to 0x4e283717d3799d54 [64657.421383] LustreError: 17190:0:(import.c:1267:ptlrpc_connect_interpret()) lustre-MDT0000_UUID went back in time (transno 4295093706 was previously committed, server now claims 4295093699)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 [64837.930298] LustreError: 11-0: lustre-MDT0000-mdc-ffff91839c315000: operation mds_reint to node 10.2.4.96@tcp failed: rc = -107 [64842.704776] LustreError: 167-0: lustre-MDT0000-mdc-ffff91839c315000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [64843.836846] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null I'm not sure if this error has the same cause as others in this ticket. The "went back in time" error message was not mentioned in this ticket. and why in the world are we referring to bugzilla.lustre.org? |
| Comment by James A Simmons [ 20/Jun/19 ] |
|
So I just talked to James Numez and he has reported this bug has not been seen in the last 4 weeks on master. I'm thinking its possible patch https://review.whamcloud.com/34710 resolved this. I do like to do a cleanup patch since that code is not easy to understand. |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34408/ |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
The patch that landed was a cleanup patch but Oleg does see it in his test harness. So I need to work with him to reproduce this problem. |
| Comment by Gerrit Updater [ 26/Jul/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35627 |
| Comment by James A Simmons [ 23/Sep/19 ] |
|
|
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36936 |
| Comment by Gerrit Updater [ 06/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35627/ |
| Comment by Gerrit Updater [ 03/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36936/ |
| Comment by Gerrit Updater [ 04/Jan/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37141 |
| Comment by Mikhail Pershin [ 20/Aug/20 ] |
|
there are several reports pointing to this ticket as reason of failures, check linked tickets: |
| Comment by James A Simmons [ 20/Aug/20 ] |
|
Hongchao Zhang, if we revert this patch do you see replay-single 0d start to fail again? |
| Comment by James A Simmons [ 17/Oct/20 ] |
|
A new patch landed to fix this problem -https://review.whamcloud.com/#/c/39532/ |