[LU-13608] MDT stuck in WAITING, abort_recov stuck too Created: 28/May/20  Updated: 10/Jan/23  Resolved: 23/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6, Lustre 2.12.7

Type: Bug Priority: Minor
Reporter: Hongchao Zhang Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Servers:
Centos 7.7
Kernel 3.10.0-1062.4.1.el7_lustre.x86_64
Lustre 2.12.3
OFED-internal-4.7-1.0.0


Issue Links:
Related
is related to LU-14318 Add the option to limit the overall r... Open
is related to LU-14119 FID-in-LMA [fid1] does not match the ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

when some network switch goes down, taking with it a number of compute nodes.
a bunch of compute nodes were stuck, and also caused some oss to go down to hit
LBUG (LU-12906) and crash.

the mds went into softlockups before crashing. when it got back, 3 out of 4 MDTs
mounted and recovered, but one MDT went into WAITING and stayed there.
lctl abort_recov had no effect on its status. so the mds was rebooted.

when the MDT again went into WAITING state, a pre-emptive abort_recov was issued
before it could time-out. but it did not help and the MDT continued to try to recover.

2020-05-08 18:52:15 [ 1640.880984] Pid: 15922, comm: mdt02_020 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
2020-05-08 18:52:15 [ 1640.892505] Call Trace:
2020-05-08 18:52:15 [ 1640.895708]  [<ffffffffc169bdc0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
2020-05-08 18:52:15 [ 1640.903498]  [<ffffffffc169c153>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
2020-05-08 18:52:15 [ 1640.911383]  [<ffffffffc1a9eaf3>] osp_remote_sync+0xd3/0x200 [osp]
2020-05-08 18:52:15 [ 1640.918760]  [<ffffffffc1a84c63>] osp_attr_get+0x463/0x730 [osp]
2020-05-08 18:52:15 [ 1640.925917]  [<ffffffffc1a818cd>] osp_object_init+0x16d/0x2d0 [osp]
2020-05-08 18:52:15 [ 1640.933361]  [<ffffffffc141c59b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
2020-05-08 18:52:15 [ 1640.941977]  [<ffffffffc1420471>] lu_object_find_at+0x1e1/0xa60 [obdclass]
2020-05-08 18:52:15 [ 1640.950100]  [<ffffffffc1420d06>] lu_object_find+0x16/0x20 [obdclass]
2020-05-08 18:52:15 [ 1640.957737]  [<ffffffffc194a01b>] mdt_object_find+0x4b/0x170 [mdt]
2020-05-08 18:52:15 [ 1640.965074]  [<ffffffffc194cc38>] mdt_getattr_name_lock+0x848/0x1c30 [mdt]
2020-05-08 18:52:15 [ 1640.973194]  [<ffffffffc1954d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]
2020-05-08 18:52:15 [ 1640.980928]  [<ffffffffc1951bb5>] mdt_intent_policy+0x435/0xd80 [mdt]
2020-05-08 18:52:15 [ 1640.988552]  [<ffffffffc1659d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
2020-05-08 18:52:15 [ 1640.996483]  [<ffffffffc1682366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
2020-05-08 18:52:15 [ 1641.004811]  [<ffffffffc170ab02>] tgt_enqueue+0x62/0x210 [ptlrpc]
2020-05-08 18:52:15 [ 1641.012076]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
2020-05-08 18:52:15 [ 1641.020213]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2020-05-08 18:52:15 [ 1641.029217]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]

2020-05-08 18:52:15 [ 1641.056176] Pid: 14912, comm: mdt03_006 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
2020-05-08 18:52:15 [ 1641.067667] Call Trace:
2020-05-08 18:52:15 [ 1641.070833]  [<ffffffffc1672b96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc]
2020-05-08 18:52:15 [ 1641.078951]  [<ffffffffc167492f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
2020-05-08 18:52:15 [ 1641.087272]  [<ffffffffc167751e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
2020-05-08 18:52:15 [ 1641.095108]  [<ffffffffc1a997f2>] osp_md_object_lock+0x162/0x2d0 [osp]
2020-05-08 18:52:15 [ 1641.102832]  [<ffffffffc10cb193>] lod_object_lock+0xf3/0x7b0 [lod]
2020-05-08 18:52:15 [ 1641.110179]  [<ffffffffc1a2eeee>] mdd_object_lock+0x3e/0xe0 [mdd]
2020-05-08 18:52:15 [ 1641.117429]  [<ffffffffc194a341>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
2020-05-08 18:52:15 [ 1641.125926]  [<ffffffffc194a8da>] mdt_remote_object_lock+0x2a/0x30 [mdt]
2020-05-08 18:52:15 [ 1641.133847]  [<ffffffffc195f2ae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
2020-05-08 18:52:15 [ 1641.141189]  [<ffffffffc1961605>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
2020-05-08 18:52:15 [ 1641.148819]  [<ffffffffc196a693>] mdt_reint_rec+0x83/0x210 [mdt]
2020-05-08 18:52:15 [ 1641.155965]  [<ffffffffc19471b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
2020-05-08 18:52:15 [ 1641.163689]  [<ffffffffc1952567>] mdt_reint+0x67/0x140 [mdt]
2020-05-08 18:52:15 [ 1641.170469]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
2020-05-08 18:52:15 [ 1641.178601]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2020-05-08 18:52:15 [ 1641.187639]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]


 Comments   
Comment by Alex Zhuravlev [ 28/May/20 ]

what version did you use? any logs?

Comment by Gerrit Updater [ 28/May/20 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38746
Subject: LU-13608 tgt: abort recovery while reading update llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c4b7bf0bf5321a937d39ee1f5e7e907b2c20903

Comment by Alexander Boyko [ 02/Jul/20 ]

I've seen a similar MDS hang, and I want to add my thoughts
For my occurrence, MDT02 recovery have not finished because of lod_sub_recovery_thread(), it tried to read update log files which was not exist.

PID: 113444  TASK: ffff8ed602b4a140  CPU: 0   COMMAND: "lod0002_rec0000"
 #0 [ffff8ed5197635c8] __schedule at ffffffff9416d0bc
 #1 [ffff8ed519763658] schedule at ffffffff9416d6e9
 #2 [ffff8ed519763668] schedule_timeout at ffffffff9416b01e
 #3 [ffff8ed519763710] ptlrpc_set_wait at ffffffffc10d42f8 [ptlrpc]
 #4 [ffff8ed5197637b8] ptlrpc_queue_wait at ffffffffc10d4653 [ptlrpc]
 #5 [ffff8ed5197637d8] osp_remote_sync at ffffffffc168a783 [osp]
 #6 [ffff8ed519763828] osp_attr_get at ffffffffc16704d7 [osp]
 #7 [ffff8ed5197638a8] osp_object_init at ffffffffc166d00d [osp]
 #8 [ffff8ed5197638e0] lu_object_start at ffffffffc0dff50b [obdclass]
 #9 [ffff8ed519763938] lu_object_find_at at ffffffffc0e033f1 [obdclass]
#10 [ffff8ed519763a00] dt_locate_at at ffffffffc0e049bd [obdclass]
#11 [ffff8ed519763a20] llog_osd_open at ffffffffc0dc517f [obdclass]
#12 [ffff8ed519763a98] llog_open at ffffffffc0db1f8a [obdclass]
#13 [ffff8ed519763ae0] llog_cat_id2handle at ffffffffc0dbad45 [obdclass]
#14 [ffff8ed519763b50] llog_cat_process_common at ffffffffc0dbb1b9 [obdclass]
#15 [ffff8ed519763ba8] llog_cat_process_cb at ffffffffc0dbc191 [obdclass]
#16 [ffff8ed519763bf8] llog_process_thread at ffffffffc0db586f [obdclass]
#17 [ffff8ed519763d08] llog_process_or_fork at ffffffffc0db6b3c [obdclass]
#18 [ffff8ed519763d70] llog_cat_process_or_fork at ffffffffc0db89e9 [obdclass]
#19 [ffff8ed519763de8] llog_cat_process at ffffffffc0db8b9e [obdclass]
#20 [ffff8ed519763e08] lod_sub_recovery_thread at ffffffffc142b526 [lod]
#21 [ffff8ed519763ea8] kthread at ffffffff93ac2016
#22 [ffff8ed519763f50] ret_from_fork_nospec_begin at ffffffff9417abdd

MDT02 sent out request to MDT0, at MDT0 osd_fid_lookup happened, and triggered OI scrub, because of file did not exist. MDT0 replied with EINPROGRESS, and MDT02 resent request. And again osd_fid_lookup trigger scrub etc.

00080000:02000400:11.0:1593417562.046666:0:224060:0:(osd_handler.c:1221:osd_fid_lookup()) fs1-MDT0000: trigger OI scrub by RPC for the [0x200032c9a:0x1:0x0] with flags 0x4a, rc = 0
00100000:10000000:17.0:1593417562.411181:0:2919:0:(osd_scrub.c:536:osd_scrub_post()) fs1-MDT0000: OI scrub post with result = 1
00100000:10000000:17.0:1593417562.411207:0:2919:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417562.411208:0:2919:0:(osd_scrub.c:1322:osd_scrub_main()) fs1-MDT0000: OI scrub: stop, pos = 3041940257: rc = 1
00100000:10000000:17.0:1593417570.677557:0:2950:0:(osd_scrub.c:459:osd_scrub_prep()) fs1-MDT0000: OI scrub prep, flags = 0x4e
00100000:10000000:17.0:1593417570.677574:0:2950:0:(scrub.c:132:scrub_file_reset()) fs1-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x0
00100000:10000000:17.0:1593417570.677600:0:2950:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417570.677603:0:2950:0:(osd_scrub.c:1307:osd_scrub_main()) fs1-MDT0000: OI scrub start, flags = 0x4e, pos = 12
00080000:02000400:7.0:1593417570.677625:0:225112:0:(osd_handler.c:1221:osd_fid_lookup()) fs1-MDT0000: trigger OI scrub by RPC for the [0x200032c9b:0x2:0x0] with flags 0x4a, rc = 0
00100000:10000000:17.0:1593417571.031439:0:2950:0:(osd_scrub.c:536:osd_scrub_post()) fs1-MDT0000: OI scrub post with result = 1
00100000:10000000:17.0:1593417571.031464:0:2950:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417571.031465:0:2950:0:(osd_scrub.c:1322:osd_scrub_main()) fs1-MDT0000: OI scrub: stop, pos = 3041940257: rc = 1
00100000:10000000:17.0:1593417573.046732:0:2951:0:(osd_scrub.c:459:osd_scrub_prep()) fs1-MDT0000: OI scrub prep, flags = 0x4e
00100000:10000000:17.0:1593417573.046735:0:2951:0:(scrub.c:132:scrub_file_reset()) fs1-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x0
00100000:10000000:17.0:1593417573.046744:0:2951:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417573.046747:0:2951:0:(osd_scrub.c:1307:osd_scrub_main()) fs1-MDT0000: OI scrub start, flags = 0x4e, pos = 12
00080000:02000400:8.0:1593417573.046823:0:224060:0:(osd_handler.c:1221:osd_fid_lookup()) fs1-MDT0000: trigger OI scrub by RPC for the [0x200032c9a:0x1:0x0] with flags 0x4a, rc = 0
00100000:10000000:17.0:1593417573.403695:0:2951:0:(osd_scrub.c:536:osd_scrub_post()) fs1-MDT0000: OI scrub post with result = 1
00100000:10000000:17.0:1593417573.403720:0:2951:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417573.403721:0:2951:0:(osd_scrub.c:1322:osd_scrub_main()) fs1-MDT0000: OI scrub: stop, pos = 3041940257: rc = 1
00100000:10000000:17.0:1593417581.677557:0:3081:0:(osd_scrub.c:459:osd_scrub_prep()) fs1-MDT0000: OI scrub prep, flags = 0x4e
00100000:10000000:17.0:1593417581.677574:0:3081:0:(scrub.c:132:scrub_file_reset()) fs1-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x0
00100000:10000000:17.0:1593417581.677600:0:3081:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417581.677603:0:3081:0:(osd_scrub.c:1307:osd_scrub_main()) fs1-MDT0000: OI scrub start, flags = 0x4e, pos = 12
00080000:02000400:7.0:1593417581.677628:0:225112:0:(osd_handler.c:1221:osd_fid_lookup()) fs1-MDT0000: trigger OI scrub by RPC for the [0x200032c9b:0x2:0x0] with flags 0x4a, rc = 0
00100000:10000000:17.0:1593417582.027254:0:3081:0:(osd_scrub.c:536:osd_scrub_post()) fs1-MDT0000: OI scrub post with result = 1
00100000:10000000:17.0:1593417582.027278:0:3081:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417582.027279:0:3081:0:(osd_scrub.c:1322:osd_scrub_main()) fs1-MDT0000: OI scrub: stop, pos = 3041940257: rc = 1
00100000:10000000:17.0:1593417584.046608:0:3083:0:(osd_scrub.c:459:osd_scrub_prep()) fs1-MDT0000: OI scrub prep, flags = 0x4e
00100000:10000000:17.0:1593417584.046611:0:3083:0:(scrub.c:132:scrub_file_reset()) fs1-MDT0000: reset OI scrub file, old flags = 0x0, add flags = 0x0
00100000:10000000:17.0:1593417584.046622:0:3083:0:(scrub.c:239:scrub_file_store()) fs1-MDT0000: store scrub file: rc = 0
00100000:10000000:17.0:1593417584.046624:0:3083:0:(osd_scrub.c:1307:osd_scrub_main()) fs1-MDT0000: OI scrub start, flags = 0x4e, pos = 12
00080000:02000400:11.0:1593417584.046686:0:224060:0:(osd_handler.c:1221:osd_fid_lookup()) fs1-MDT0000: trigger OI scrub by RPC for the [0x200032c9a:0x1:0x0] with flags 0x4a, rc = 0
 

For me it looks strange that OI scrub does not return an error after full scan and repeated FID.

 LU-13608 tgt: abort recovery while reading update llog would help recovery process at MDT, but I'm not sure about lod_sub_recovery_thread().  It looped at pltrpc layer with repeating a request for EINPROGRESS.

Comment by Alexander Boyko [ 02/Jul/20 ]

Here is the test to reproduce update log deletion. The fix doesn't help. I checked abort_recovery, etc.

diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index 414d95e..b2f02f0 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -24177,6 +24177,29 @@ test_902() {
 }
 run_test 902 "test short write doesn't hang lustre"

+test_903() {
+       [ $MDSCOUNT -lt 2 ] && skip "needs >= 2 MDTs"
+
+       local timeout
+
+       timeout=$(do_facet mds2 "$LCTL get_param -n mdt.$FSNAME-MDT0001.recovery_time_hard")
+       for idx in $(seq $MDSCOUNT); do
+               stop mds${idx}
+       done
+
+       do_facet mds1 "mkdir -p /tmp/test_903 && mount -t ldiskfs -o loop $(mdsdevname 1) /tmp/test_903 &&
+               rm -f /tmp/test_903/update_log_dir/* && umount /tmp/test_903 && rm -rf /tmp/test_903"
+
+       for idx in $(seq $MDSCOUNT); do
+               start mds${idx} $(mdsdevname $idx) $MDS_MOUNT_OPTS ||
+                       error "mount mds$idx failed"
+       done
+
+       wait_recovery_complete mds2 $timeout
+
+}
+run_test 903 "don't hang MDS recovery when failed to get update log"
+

Maybe rq_no_retry_einprogress = 1 ? @Alex Zhuravlev  

lctl get_param mdt.*.recovery_status
mdt.lustre-MDT0000.recovery_status=
status: COMPLETE
recovery_start: 1593460707
recovery_duration: 246400
completed_clients: 1/2
replayed_requests: 0
last_transno: 4294967310
VBR: DISABLED
IR: DISABLED
mdt.lustre-MDT0001.recovery_status=
status: WAITING
non-ready MDTs:  0000
recovery_start: 1593707106
time_waited: 1429

root     16048  0.0  0.0  11232   708 pts/2    D+   12:30   0:00 lctl --device 14 abort_recovery

 sudo cat /proc/16048/stack

[<ffffffffc0e9474d>] target_stop_recovery_thread.part.20+0x3d/0xd0 [ptlrpc]
[<ffffffffc0e947f8>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
[<ffffffffc11bcf68>] mdt_iocontrol+0x558/0xb00 [mdt]
[<ffffffffc0a17eeb>] class_handle_ioctl+0x16bb/0x1cc0 [obdclass]
[<ffffffffc0a18565>] obd_class_ioctl+0x75/0x170 [obdclass]
[<ffffffff9d256490>] do_vfs_ioctl+0x3a0/0x5a0
[<ffffffff9d256731>] SyS_ioctl+0xa1/0xc0
[<ffffffff9d776ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff 
Comment by Gerrit Updater [ 04/Jul/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38746/
Subject: LU-13608 tgt: abort recovery while reading update llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0496cdf20451f07befebd1cb8a770544ec0f57df

Comment by Gerrit Updater [ 06/Jul/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/39284
Subject: LU-13608 tgt: abort recovery while reading update llog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: be2b297c3edd91d7caab97408bf7f539612a47e6

Comment by Peter Jones [ 21/Jul/20 ]

hongchao.zhang is the issue reported by aboyko a different issue?

Comment by Hongchao Zhang [ 22/Jul/20 ]

pjones
Hi, this could be another case of this issue, I will create a new patch to fix it.

Comment by Alexander Boyko [ 22/Jul/20 ]

HongChao Zhang, could you please reproduce the original issue and show that abort recovery fix it?  It's better to have a regression test for it or some reproducer.

Comment by Gerrit Updater [ 30/Jul/20 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39538
Subject: LU-13608 out: don't return einprogress error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48cdb839e17fa99a2cbfe442b4ad86c6a6746e2b

Comment by Gerrit Updater [ 01/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39538/
Subject: LU-13608 out: don't return einprogress error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 865aa3f692bccdd9cf7ff6cafeee350e06bb8d76

Comment by Gerrit Updater [ 06/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39284/
Subject: LU-13608 tgt: abort recovery while reading update llog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 4142f05811b019df453ce52f6b690ec81fa5897f

Comment by Gerrit Updater [ 08/Jan/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41183
Subject: LU-13608 out: don't return einprogress error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2c6b286c1e596b850cabe0b185c1552b0133496d

Comment by Andreas Dilger [ 29/Jan/21 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39539
Subject: LU-13608 tests: check MDS recovery hang
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2329127abf2ace5333a10352b77202c55f8da0aa

Comment by Gerrit Updater [ 16/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41183/
Subject: LU-13608 out: don't return einprogress error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7817acc39ee1d6859c2737f75619748dc8e37f95

Generated at Sat Feb 10 03:02:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.