[LU-11762] replay-single test 0d fails with 'post-failover df failed' Created: 11/Dec/18  Updated: 17/Oct/20  Resolved: 17/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10950 replay-single test_0c: post-failover ... Reopened
is related to LU-12340 replay-dual test 0b timeouts Resolved
is related to LU-12769 replay-dual test 0b hangs in client m... Resolved
is related to LU-13614 replay-single test_117: LBUG: ASSERTI... Resolved
is related to LU-11771 bad output in target_handle_reconnect... Resolved
is related to LU-13339 patch for LU-11762 causes an assertio... Resolved
is related to LU-9019 Migrate lustre to standard 64 bit tim... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)

Started lustre-MDT0000
Starting client: trevis-26vm3:  -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
CMD: trevis-26vm3 mkdir -p /mnt/lustre
CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error
 replay-single test_0d: @@@@@@ FAIL: post-failover df failed 

Looking at the dmesg log from client 2 (vm4), we see the following errors

[44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail
[44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4
[44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107
[44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages
[44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp)
[44429.547533] Lustre: Skipped 1 previous similar message
[44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 

In the dmesg log for the MDS (vm6), we see

[44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0
[44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05
[44196.728849] Lustre: Skipped 21 previous similar messages
[44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[44311.674391] Lustre: Skipped 1 previous similar message
[44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1
[44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
[44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1
[44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted.
[44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp)
[44311.931571] Lustre: Skipped 46 previous similar messages
[44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 

We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc .

More logs for these failures are at
https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc
https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc



 Comments   
Comment by Peter Jones [ 12/Dec/18 ]

Hongchao

Can you please assess this issue?

Thanks

Peter

Comment by Jian Yu [ 14/Dec/18 ]

replay-single test 62 hit the similar issue on master branch:
https://testing.whamcloud.com/test_sets/81a8b6dc-fdf0-11e8-b837-52540065bddc

Note this appears to be LU-12304.

Comment by Gerrit Updater [ 21/Dec/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33907
Subject: LU-11762 target: don't exceed hard timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c87c2948c5ef142375338b396c4e1d7eb70b5dea

Comment by Gerrit Updater [ 12/Mar/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34408
Subject: LU-11762 ldlm: don't exceed hard timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 888c8f18c25924ab24d893cf4a1811718679847d

Comment by Sarah Liu [ 09/Apr/19 ]

SOAK hit similar error when running on b2_12-next #6 which cause clients were evicted.

application error 259372-mdtestfpp.out

04/07/2019 10:44:51: Process 1(soak-29.spirit.whamcloud.com): FAILED in create_remove_directory_tree, Unable to create directory: Input/output error
[soak-28][[62764,1],0][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[soak-43][[62764,1],6][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
slurmstepd: error: *** STEP 259372.0 ON soak-28 CANCELLED AT 2019-04-07T10:45:00 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-29: task 1: Exited with exit code 1
srun: Terminating job step 259372.0
srun: error: soak-28: task 0: Killed
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: soak-30: task 2: Killed
srun: error: soak-31: task 3: Killed
srun: error: soak-42: task 5: Killed
srun: error: soak-41: task 4: Killed
srun: error: soak-43: task 6: Killed

client console (soak-28)

Apr  7 10:43:47 soak-28 systemd-logind: Removed session 5255.
Apr  7 10:43:47 soak-28 systemd: Removed slice User Slice of root.
Apr  7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Apr  7 10:44:08 soak-28 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffffa001df2d3800: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
Apr  7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
Apr  7 10:44:48 soak-28 kernel: LustreError: 98728:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2
Apr  7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2
Apr  7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) Skipped 13 previous similar messages

MDS console (soak-11)

Apr  7 10:43:59 soak-11 systemd: Removed slice User Slice of root.
Apr  7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) soaked-MDT0003-osd get update log failed: rc = -108
Apr  7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) Skipped 2 previous similar messages
Apr  7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: sdp - rdac checker reports path is up
Apr  7 10:44:01 soak-11 multipathd: 8:240: reinstated
Apr  7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 2 1 round-robin 0 1 1 8:240 1 round-robin 0 1 1 8:128 1]
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 18fc806b-f2ae-c23f-8207-c2b41766066a (at 192.168.1.129@o2ib)
Apr  7 10:44:05 soak-11 kernel: Lustre: 13330:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: disconnecting 23 stale clients
Apr  7 10:44:05 soak-11 kernel: LustreError: 13330:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 8388608
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Denying connection for new client 0a90ab6f-d87f-2ae2-08eb-fb39e165047b(at 192.168.1.136@o2ib), waiting for 24 known clients (1 recovered, 0 in progress, and 23 evicted) already passed deadline 9:50
Apr  7 10:44:05 soak-11 kernel: Lustre: Skipped 3 previous similar messages
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to b662025f-1a48-f603-425e-9a1b0fe94ac8 (at 192.168.1.125@o2ib)
Apr  7 10:44:05 soak-11 kernel: Lustre: Skipped 1 previous similar message
Apr  7 10:44:07 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to f2e27319-0d77-c9dc-e387-bbb2fb32fd1e (at 192.168.1.130@o2ib)
Apr  7 10:44:07 soak-11 kernel: Lustre: Skipped 5 previous similar messages
Apr  7 10:44:09 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to ea0c92cc-ca60-a13b-ca50-b173ec24752c (at 192.168.1.116@o2ib)
Apr  7 10:44:09 soak-11 kernel: Lustre: Skipped 10 previous similar messages
Comment by James A Simmons [ 09/Apr/19 ]

We have a potential fix for this as well as another fix that is needed for recovery. Both will need to be landed.

Comment by James Nunez (Inactive) [ 14/May/19 ]

We're seeing something similar with replay-single test 62 for ldiskfs/DNE for 2.12.2 RC1 at https://testing.whamcloud.com/test_sets/78994818-753c-11e9-a6f9-52540065bddc . We see the following in the client 2 dmesg

[64633.042303] Lustre: DEBUG MARKER: == replay-single test 0d: expired recovery with no clients =========================================== 09:24:46 (1557653086)
[64633.892025] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
[64634.219536] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
[64647.402309] LustreError: 166-1: MGC10.2.4.96@tcp: Connection to MGS (at 10.2.4.96@tcp) was lost; in progress operations using this service will fail
[64655.316668] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64655.543605] Lustre: DEBUG MARKER: onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64657.419063] Lustre: Evicted from MGS (at 10.2.4.96@tcp) after server handle changed from 0x4e283717d3799726 to 0x4e283717d3799d54
[64657.421383] LustreError: 17190:0:(import.c:1267:ptlrpc_connect_interpret()) lustre-MDT0000_UUID went back in time (transno 4295093706 was previously committed, server now claims 4295093699)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
[64837.930298] LustreError: 11-0: lustre-MDT0000-mdc-ffff91839c315000: operation mds_reint to node 10.2.4.96@tcp failed: rc = -107
[64842.704776] LustreError: 167-0: lustre-MDT0000-mdc-ffff91839c315000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[64843.836846] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null

I'm not sure if this error has the same cause as others in this ticket. The "went back in time" error message was not mentioned in this ticket.

and why in the world are we referring to bugzilla.lustre.org?

Comment by James A Simmons [ 20/Jun/19 ]

So I just talked to James Numez and he has reported this bug has not been seen in the last 4 weeks on master. I'm thinking its possible patch https://review.whamcloud.com/34710 resolved this. I do like to do a cleanup patch since that code is not easy to understand.

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34408/
Subject: LU-11762 ldlm: don't exceed hard timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8bfe8939d810f5ac16484d3d4b81f829c7d7d0d7

Comment by James A Simmons [ 12/Jul/19 ]

The patch that landed was a cleanup patch but Oleg does see it in his test harness. So I need to work with him to reproduce this problem.

Comment by Gerrit Updater [ 26/Jul/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35627
Subject: LU-11762 ldlm: ensure the recovery timer is armed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ae2c2cc41f3c82393e8a19eb4da1c3644846e93

Comment by James A Simmons [ 23/Sep/19 ]

LU-12769 has a real fix.

Comment by Gerrit Updater [ 05/Dec/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36936
Subject: LU-11762 ldlm: don't exceed hard timeout
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 373b1cb9232caa4457d63dd04bf6d16af83f493f

Comment by Gerrit Updater [ 06/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35627/
Subject: LU-11762 ldlm: ensure the recovery timer is armed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fe5c801657f9ddb5e148bb6076e476df6ba31bba

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36936/
Subject: LU-11762 ldlm: don't exceed hard timeout
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: bc27e0f6efbdbd256c6459d15391754ce1b36d32

Comment by Gerrit Updater [ 04/Jan/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37141
Subject: LU-11762 ldlm: ensure the recovery timer is armed
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8e9dae2b1aa53b3be114922c825742271be08a0b

Comment by Mikhail Pershin [ 20/Aug/20 ]

there are several reports pointing to this ticket as reason of failures, check linked tickets:
LU-13614, LU-13339

Comment by James A Simmons [ 20/Aug/20 ]

Hongchao Zhang, if we revert this patch do you see replay-single 0d start to fail again? 

Comment by James A Simmons [ 17/Oct/20 ]

 A new patch landed to fix this problem -https://review.whamcloud.com/#/c/39532/

Generated at Sat Feb 10 02:46:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.