[LU-15761] cannot finish MDS recovery Created: 19/Apr/22  Updated: 21/Dec/22  Resolved: 05/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: soak
Environment:

2.15.0-rc3


Attachments: HTML File lu-14495-soak-8debug     HTML File lu-14495-soak-8dmesg     HTML File lu-14495-soak-8trace     File lustre-log.1651263727.22864     File lustre-log.1651732904.9736     HTML File soak-10-debuglog     HTML File soak-10-umount-hang-trace     HTML File soak-8dmesg     HTML File soak-8trace    
Issue Links:
Related
is related to LU-15769 llog_osd_next_block()): invalid llog ... In Progress
is related to LU-11418 hung threads on MDT and MDT won't umount Resolved
is related to LU-15937 lctl llog commands do not work for DN... Open
is related to LU-14495 (llog_osd.c:622:llog_osd_write_rec())... Resolved
is related to LU-13974 Wrong behavior for out operations cre... Resolved
is related to LU-15645 gap in recovery llog should not be a ... Resolved
is related to LU-15938 MDT recovery did not finish due to co... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hit the same issue twice during soak testing on the same MDS node

trying to connect to soak-10 ...
2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}}
2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO     soaked-MDT0002 in status 'WAITING'.

on soak-11

[136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
[136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
[136538.528122] Lustre: Skipped 8 previous similar messages
[136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
[136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages
[136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten
d: 1
[136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
[137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
[137139.516055] Lustre: Skipped 10 previous similar messages
[137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
[137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages
[137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow
 and will resume later. Dumping the stack trace for debugging purposes:
[137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
[137180.793815] Call Trace:
[137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
[137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod]
[137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
[137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
[137180.818740] [<0>] mdo_create+0x46/0x48 [mdt]
[137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt]
[137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
[137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
[137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
[137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt]
[137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
[137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
[137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
[137180.869420] [<0>] kthread+0xd1/0xe0
[137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21
[137180.879286] [<0>] 0xfffffffffffffffe
[137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
[137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
[137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:''
[137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
[137740.489550] Lustre: Skipped 9 previous similar messages
[137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
[137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages
[137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
[137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
[138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
[138341.477619] Lustre: Skipped 11 previous similar messages

have to run abort_recovery

[188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
[188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
[188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
[188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11
[188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
[188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo)
[188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted.
[188345.146528] Lustre: Failing over soaked-MDT0002
[188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5
[188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message
[188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping)
[188345.238946] Lustre: Skipped 2 previous similar messages


 Comments   
Comment by Peter Jones [ 19/Apr/22 ]

Mike

Could you please investigate?

Peter

Comment by Andreas Dilger [ 19/Apr/22 ]

Sarah, it would be useful to know which was the last 2.14.5x version where this test was still passing? Have you restarted soak with the current build to see if this is repeatable, or if it was a one-time failure?

According to https://wiki.whamcloud.com/display/Releases/Soak+Testing+on+Soak it looks like 2.14.56 was run on 20211223, and 2.15.0-RC2-2-g94f4e1f was run on 20220302. Did those builds work without error?

The number of patches that landed since 2.15.0-RC2-2-g94f4e1f is quite small (excluding test patches):

03e4b451b0dc New RC 2.15.0-RC3
649d638467c0 LU-15670 clio: Disable lockless for DIO with O_APPEND
22de0bd145b6 LU-15616 lnet: ln_api_mutex deadlocks
3e3f70eb1ec9 LU-13714 lnet: only update gateway NI status on discovery
0090b6f6f6cf LU-15692 lmv: change default hash back to fnv_1a_64
42a6d1fdb681 LU-15702 lov: remove lo_trunc_stripeno
ef826db1f43b LU-15548 osd-ldiskfs: hide virtual projid xattr
9884f37985c1 LU-15637 llite: Fix use of uninitialized fields
d8012811cc6f LU-15435 ptlrpc: unregister reply buffer on rq_err
4d93fd79e8b0 LU-15661 nodemap: fix map mode value for 'both'
2f496148c31d LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
9dcbf8b3d44f LU-15634 ptlrpc: Use after free of 'conn' in rhashtable retry
f14090e56c9d LU-15546 mdt: mdt_reint_open lookup before locking
966ca46e4aa2 LU-15608 sec: fix DIO for encrypted files
1dbcbd70f87b LU-15021 quota: protect lqe_glbl_data in lqe
5da859e262dd LU-15576 osp: Interop skip sanity test 823 for MDS < 2.14.56
06134ff9adde LU-15601 osd-ldiskfs: handle read_inode_bitmap() error
591a990c617f LU-15513 lod: skip uninit component in lod_fill_mirrors
Comment by Andreas Dilger [ 19/Apr/22 ]

Sarah, I didn't notice that you have already hit this problem twice. Could you please test with patch https://review.whamcloud.com/46837 "LU-15645 obdclass: llog to handle gaps" (already based on 2.15.0-RC3) to see if this solves the recovery problem?

Comment by Sarah Liu [ 20/Apr/22 ]

Hi Andreas, I didn't notice this problem on the previous master tags, I will reload soak tomorrow to try the patch, thank you.

Comment by Colin Faber [ 21/Apr/22 ]

Hi sarah 

Any luck reproducing?

-cf

 

Comment by Sarah Liu [ 25/Apr/22 ]

Hi Colin,

soak doesn't hit the same issue, instead it has another issue on the other pair of MDSs(soak-8 and soak-9) during failback.

on soak-8, umount hung when it tried to failback MDT0001

[13201.893886] Lustre: Skipped 3 previous similar messages
[13226.408821] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.

[13241.225041] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA 
pair check that the target is mounted on the other server.
[13241.244882] LustreError: Skipped 929 previous similar messages
[13272.213200] LDISKFS-fs (dm-1): recovery complete
[13272.218869] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
[13274.329038] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (not set up)
[13274.339066] Lustre: Skipped 1 previous similar message
[13274.402654] Lustre: soaked-MDT0000: Received MDS connection from 0@lo, removing former export from 192.168.1.109@o2ib
[13274.615826] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[13274.633101] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
[13274.686553] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 22 clients reconnect
[13283.173712] Lustre: soaked-MDT0001-osp-MDT0000: Connection restored to 192.168.1.108@o2ib (at 0@lo)
[13283.174024] Lustre: soaked-MDT0001: Recovery over after 0:08, of 22 clients 22 recovered and 0 were evicted.
[13283.187022] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) @@@ restoring transno  req@ffff9849b4361680 x1730744898968000/t38655174014
(0) o36->88b0180e-0530-4e8d-8af5-9a48dbccd57d@192.168.1.119@o2ib:271/0 lens 496/2888 e 0 to 0 dl 1650719436 ref 1 fl Interpret:/2/0 rc 0/0 job:'
'
[13283.187026] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) Skipped 3 previous similar messages
[13291.941059] Lustre: Failing over soaked-MDT0001
[13291.949360] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0001: cannot lock for rename: rc = -5
[13291.957834] LustreError: 3666:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1650719399 with bad export cookie 17448559810846504576
[13291.957883] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[13291.957978] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -107
[13291.957984] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping)
[13291.996099] LustreError: 3684:0:(osp_object.c:629:osp_attr_get()) soaked-MDT0002-osp-MDT0001: osp_attr_get update error [0x280004a85:0x3:0x0]: rc = -108
[13292.029469] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 2 previous similar messages
[13293.222589] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping)
[13293.232426] Lustre: Skipped 7 previous similar messages
[13294.448007] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@ffff9849b28df500 x1730891251304896/t0(0) o41->soaked-MDT0002-osp-MDT0001@192.168.1.110@o2ib:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:''
[13294.473664] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
[13296.155550] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.129@o2ib (stopping)
[13296.165402] Lustre: Skipped 1 previous similar message
[13300.253260] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.121@o2ib (stopping)
[13300.263095] Lustre: Skipped 13 previous similar messages
[13308.262410] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping)
[13308.272238] Lustre: Skipped 36 previous similar messages
[13324.545237] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.127@o2ib (stopping)
[13324.555060] Lustre: Skipped 122 previous similar messages
[13357.071204] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping)
[13357.079677] Lustre: Skipped 226 previous similar messages
[13421.084164] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.123@o2ib (stopping)
[13421.094016] Lustre: Skipped 458 previous similar messages
[13441.068783] INFO: task umount:25201 blocked for more than 120 seconds.
[13441.076096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13441.084855] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
[13441.092782] Call Trace:
[13441.095539]  [<ffffffff91b89179>] schedule+0x29/0x70
[13441.101100]  [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0
[13441.108344]  [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp]
[13441.116905]  [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30
[13441.124415]  [<ffffffff91b8868d>] down_write+0x2d/0x3d
[13441.130225]  [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass]
[13441.138142]  [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass]
[13441.145474]  [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass]
[13441.153015]  [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass]
[13441.160452]  [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod]
[13441.167568]  [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod]
[13441.174981]  [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd]
[13441.182214]  [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt]
[13441.189538]  [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt]
[13441.196489]  [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass]
[13441.203849]  [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass]
[13441.211958]  [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass]
[13441.219967]  [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass]
[13441.227555]  [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0
[13441.234856]  [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100
[13441.241866]  [<ffffffff91651362>] kill_anon_super+0x12/0x20
[13441.248134]  [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre]
[13441.255455]  [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70
[13441.262458]  [<ffffffff91651ec6>] deactivate_super+0x46/0x60
[13441.268792]  [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80
[13441.274634]  [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20
[13441.280692]  [<ffffffff914c299b>] task_work_run+0xbb/0xe0
[13441.286737]  [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0
[13441.293085]  [<ffffffff91b962ef>] int_signal+0x12/0x17
[13549.118275] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.102@o2ib (stopping)
[13549.128106] Lustre: Skipped 917 previous similar messages
[13561.296954] INFO: task umount:25201 blocked for more than 120 seconds.
[13561.304262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13561.313021] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
[13561.320948] Call Trace:
[13561.323722]  [<ffffffff91b89179>] schedule+0x29/0x70
[13561.329295]  [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0
[13561.336541]  [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp]
[13561.345118]  [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30
[13561.352614]  [<ffffffff91b8868d>] down_write+0x2d/0x3d
[13561.358425]  [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass]
[13561.366349]  [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass]
[13561.373688]  [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass]
[13561.381198]  [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass]
[13561.388628]  [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod]
[13561.395764]  [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod]
[13561.403174]  [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd]
[13561.410429]  [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt]
[13561.417740]  [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt]
[13561.424679]  [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass]
[13561.432019]  [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass]
[13561.440115]  [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass]
[13561.448130]  [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass]
[13561.455738]  [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0
[13561.463042]  [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100
[13561.470059]  [<ffffffff91651362>] kill_anon_super+0x12/0x20
[13561.476333]  [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre]
[13561.483635]  [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70
[13561.490656]  [<ffffffff91651ec6>] deactivate_super+0x46/0x60
[13561.497002]  [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80
[13561.502856]  [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20
[13561.508890]  [<ffffffff914c299b>] task_work_run+0xbb/0xe0
[13561.514946]  [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0
[13561.521284]  [<ffffffff91b962ef>] int_signal+0x12/0x17
[13563.147828] Lustre: mdt01_009: service thread pid 3882 was inactive for 236.694 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13563.168704] Pid: 3882, comm: mdt01_009 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
[13563.179794] Call Trace:
[13563.182673] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
[13563.188353] [<0>] lod_trans_stop+0x25c/0x340 [lod]
[13563.193761] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
[13563.199052] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
[13563.204259] [<0>] mdo_create+0x46/0x48 [mdt]
[13563.209061] [<0>] mdt_create+0xab1/0xe40 [mdt]
[13563.214062] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
[13563.219660] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
[13563.224847] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
[13563.230632] [<0>] mdt_reint+0x67/0x150 [mdt]
[13563.235465] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
[13563.241663] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
[13563.248722] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
[13563.254185] [<0>] kthread+0xd1/0xe0
[13563.258089] [<0>] ret_from_fork_nospec_begin+0x21/0x21
[13563.263866] [<0>] 0xfffffffffffffffe
[13591.927826] Lustre: soaked-MDT0000: Received new MDS connection from 192.168.1.110@o2ib, keep former export from same NID
[13591.940130] Lustre: Skipped 1 previous similar message
[13681.525125] INFO: task umount:25201 blocked for more than 120 seconds.
[13681.532427] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13681.541183] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
Comment by Alex Zhuravlev [ 25/Apr/22 ]

sarah would it be possible to get all traces for the case?

Comment by Sarah Liu [ 25/Apr/22 ]

bzzz, please check the attached.

Comment by Alex Zhuravlev [ 29/Apr/22 ]

still looking at the logs, but haven't got a good idea yet. is it reproducable?

Comment by Sarah Liu [ 02/May/22 ]

bzzz, the same umount hang issue happened again on a different node, if you want to check the status, please let me know. I will also upload the trace from soak-10 which the hang happened

Comment by Alex Zhuravlev [ 02/May/22 ]

can you grab all the traces with echo t >/proc/sysrq-trigger please?

Comment by Alex Zhuravlev [ 02/May/22 ]

sarah thank you, checking..

Comment by Alex Zhuravlev [ 02/May/22 ]

and the kernel logs from the servers please

Comment by Gerrit Updater [ 02/May/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47185
Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f161c9dd6fcc11dbb044225a05d6d283d665424f

Comment by Andreas Dilger [ 02/May/22 ]

It looks like this locking imbalance was introduced by patch https://review.whamcloud.com/33401 "LU-11418 llog: refresh remote llog upon -ESTALE".

The very strange thing is that patch was landed in 2.12.0-RC1, but has not been noticed until now. It is possible that some other recent landing has made this codepath more likely to be hit, because otherwise it seems like the code would always be broken.

Comment by Andreas Dilger [ 02/May/22 ]

This may be related to several llog-related test failures that were previously reported, all of them with the "retry remote llog process" message printed.

Looking at the callers of llog_cat_refresh(), it looks like this could happen in combination with the patch https://review.whamcloud.com/40742 "LU-13974 llog: check stale osp object" landed in 2.15.0 that returns -ESTALE more frequently:

int llog_cat_declare_add_rec()
{
        :
start:
        down_read_nested(&cathandle->lgh_lock, LLOGH_CAT);
        :  
        rc = llog_declare_write_rec(env, cathandle->u.chd.chd_current_log,
                                    rec, -1, th);
        if (rc == -ESTALE && dt_object_remote(cathandle->lgh_obj)) {
                up_read(&cathandle->lgh_lock);
             rc = llog_cat_refresh(env, cathandle);
             if (rc)
                        RETURN(rc);
             goto start;    
        }
Comment by Andreas Dilger [ 03/May/22 ]

Sarah, are you able to restart Soak with Alex's patch on master? You were previously able to hit this issue easily, so hopefully it will tell us quickly if the patch fixes it.

Comment by Alex Zhuravlev [ 03/May/22 ]

Sarah replied on Skype: I have loaded soak with it yesterday, will check the status a little later and let you know

Comment by Sarah Liu [ 03/May/22 ]

Hi Andreas, last time the failure occurred after 6/7 hours running, so far soak has been up over 10 hours and it is doing well.

Comment by Sarah Liu [ 04/May/22 ]

A quick update on soak, it doesn't hit the umount issue during failover, but it did hit LU-14495 on MDS0, test is still ongoing since disabled panic-on-lbug. I will upload the trace and debug log here since LU-14495 is closed.

Comment by Alex Zhuravlev [ 04/May/22 ]

thanks for the update. I'm going to check the logs. in the mean time, can you please restart?

Comment by Alex Zhuravlev [ 04/May/22 ]

I don't understand where this problem comes from:

[ 5814.198337] LustreError: 12981:0:(out_handler.c:910:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-ZEMyr9OQ/BUILD/lustre-2.15.0_RC3_3_gf161c9d/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524

want to debug this..
would it be possible to run the test with full debug enabled and dump Lustre upon this message?

Comment by Sarah Liu [ 04/May/22 ]

okay, I will restart with full debug

Comment by Sarah Liu [ 05/May/22 ]

bzzz, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps

[33368.708107] LustreError: 8542:0:(llog_cat.c:789:llog_cat_cancel_records()) Skipped 343 previous similar messages
[33384.336498] Lustre: soaked-MDT0003: haven't heard from client soaked-MDT0003-lwp-OST0005_UUID (at 192.168.1.105@o2ib) in 227 seconds. I think
 it's dead, and I am evicting it. exp ffff916a71261400, cur 1651732900 expire 1651732750 last 1651732673
[33384.360772] Lustre: Skipped 3 previous similar messages
[33388.209816] LustreError: 9736:0:(llog_osd.c:626:llog_osd_write_rec()) soaked-MDT0001-osp-MDT0003: index 40453 already set in llog bitmap [0x2
40000403:0x3:0x0]
[33388.225651] LustreError: 9736:0:(llog_osd.c:628:llog_osd_write_rec()) LBUG
[33388.233370] Pid: 9736, comm: mdt00_011 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Sun Apr 3 16:20:30 UTC 2022
[33388.244363] Call Trace:
[33388.247123] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[33388.252872] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[33388.258268] [<0>] llog_osd_write_rec+0x1a3a/0x1a70 [obdclass]
[33388.264714] [<0>] llog_write_rec+0x293/0x590 [obdclass]
[33388.270573] [<0>] llog_cat_add_rec+0x1d9/0x980 [obdclass]
[33388.276617] [<0>] llog_add+0x182/0x1f0 [obdclass]
[33388.281964] [<0>] sub_updates_write+0x302/0xe3b [ptlrpc]
[33388.287970] [<0>] top_trans_stop+0x4a2/0xfa0 [ptlrpc]
[33388.293652] [<0>] lod_trans_stop+0x25c/0x340 [lod]
[33388.299032] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
[33388.304308] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
[33388.309491] [<0>] mdo_create+0x46/0x48 [mdt]
[33388.314301] [<0>] mdt_create+0xab1/0xe40 [mdt]
[33388.319269] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
[33388.324829] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
[33388.329993] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
[33388.335733] [<0>] mdt_reint+0x67/0x150 [mdt]
[33388.340562] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
[33388.346728] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
[33388.353774] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
[33388.359226] [<0>] kthread+0xd1/0xe0
[33388.363125] [<0>] ret_from_fork_nospec_begin+0x21/0x21
[33388.368893] [<0>] 0xfffffffffffffffe
[33388.373074] LustreError: dumping log to /tmp/lustre-log.1651732904.9736
Comment by Gerrit Updater [ 05/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47185/
Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 69cdcbe3cf732124b3b5be1a1f235aa1b78f6c85

Comment by Peter Jones [ 05/May/22 ]

LU-15761 fix has landed for 2.15. I have reopened LU-14495 to track the remaining issue

Comment by Alex Zhuravlev [ 06/May/22 ]

Alex Zhuravlev, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps

yes, it did, thanks a lot. I've got rough understanding, trying to construct a reproducer.

Comment by Gerrit Updater [ 21/Dec/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49470
Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: cb2086c2f9268db6f65ed2d31384806057a0ee30

Generated at Sat Feb 10 03:21:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.