[LU-16941] MDT deadlock involving lquota_wb Created: 04/Jul/23 Updated: 06/Jul/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hello! Two days ago, we upgraded the Lustre servers of Sherlock's scratch (Fir) from 2.12.9 to 2.15.3. The clients were already running 2.15.x. The server upgrade went smoothly but I noticed one issue today: we got an alert that a Robinhood changelog reader (also running 2.15.3) was stuck, and indeed: Jul 4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8863 blocked for more than 120 seconds. Jul 4 11:39:40 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 4 11:39:40 fir-rbh01 kernel: robinhood D ffff9bacf6375280 0 8863 1 0x00000080 Jul 4 11:39:40 fir-rbh01 kernel: Call Trace: Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b8bf9>] schedule_preempt_disabled+0x29/0x70 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b6ca7>] __mutex_lock_slowpath+0xc7/0x1e0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b602f>] mutex_lock+0x1f/0x33 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7ae6ed>] lookup_slow+0x33/0xab Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bf8e>] path_lookupat+0x89e/0x8d0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac235c6b>] ? kmem_cache_alloc+0x19b/0x1f0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26e16f>] ? getname_flags+0x4f/0x1a0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bfeb>] filename_lookup+0x2b/0xd0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f337>] user_path_at_empty+0x67/0xc0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac25e493>] ? fput+0x13/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f3a1>] user_path_at+0x11/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac2618a3>] vfs_fstatat+0x63/0xd0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac261d01>] SYSC_newlstat+0x31/0x70 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac1468f6>] ? __audit_syscall_exit+0x1f6/0x2b0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac2621ce>] SyS_newlstat+0xe/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7c539a>] system_call_fastpath+0x25/0x2a Jul 4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8864 blocked for more than 120 seconds. I tried to restart the client and Robinhood multiple times without success, same issue. When I look at the MDT (fir-MDT0001), which is currently still running, I can notice a constant CPU load from this process: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12875 root 20 0 0 0 0 S 13.6 0.0 39:41.42 lquota_wb_fir-M Looking further with crash, there seems to be two lquota_wb threads (12871 and 12875): crash> bt 12871 PID: 12871 TASK: ffff8ccd6c9b0000 CPU: 6 COMMAND: "lquota_wb_fir-M" #0 [ffff8ccd6ca93cf0] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6ca93d58] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6ca93d68] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6ca93e10] qsd_upd_thread at ffffffffc175a665 [lquota] #4 [ffff8ccd6ca93ec8] kthread at ffffffffb00cb621 #5 [ffff8ccd6ca93f50] ret_from_fork_nospec_begin at ffffffffb07c51dd crash> bt 12875 PID: 12875 TASK: ffff8ccd6cba1080 CPU: 2 COMMAND: "lquota_wb_fir-M" #0 [ffff8ccd6cbabcf0] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6cbabd58] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6cbabd68] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6cbabe10] qsd_upd_thread at ffffffffc175a665 [lquota] #4 [ffff8ccd6cbabec8] kthread at ffffffffb00cb621 #5 [ffff8ccd6cbabf50] ret_from_fork_nospec_begin at ffffffffb07c51dd At first glance, as this involves lquota_wb, this looks similar to However, the following thread is suspect for me, as it involves lquota and seems stuck: crash> bt 32263 PID: 32263 TASK: ffff8cdd78a21080 CPU: 5 COMMAND: "mdt01_011" #0 [ffff8ccd6b5a3640] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6b5a36a8] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6b5a36b8] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6b5a3760] schedule_timeout_interruptible at ffffffffb07b58fe #4 [ffff8ccd6b5a3770] qsd_acquire at ffffffffc175e915 [lquota] #5 [ffff8ccd6b5a3810] qsd_op_begin0 at ffffffffc175f6bf [lquota] #6 [ffff8ccd6b5a38b0] qsd_op_begin at ffffffffc1760142 [lquota] #7 [ffff8ccd6b5a38f8] osd_declare_qid at ffffffffc17ffcb3 [osd_ldiskfs] #8 [ffff8ccd6b5a3950] osd_declare_attr_qid at ffffffffc17bbdfc [osd_ldiskfs] #9 [ffff8ccd6b5a39a8] osd_declare_attr_set at ffffffffc17bffbe [osd_ldiskfs] #10 [ffff8ccd6b5a39f8] lod_sub_declare_attr_set at ffffffffc1a7752c [lod] #11 [ffff8ccd6b5a3a48] lod_declare_attr_set at ffffffffc1a5854b [lod] #12 [ffff8ccd6b5a3ac8] mdd_attr_set at ffffffffc1d1ab32 [mdd] #13 [ffff8ccd6b5a3b48] mdt_attr_set at ffffffffc193ac36 [mdt] #14 [ffff8ccd6b5a3b98] mdt_reint_setattr at ffffffffc193bd3a [mdt] #15 [ffff8ccd6b5a3c10] mdt_reint_rec at ffffffffc193e69a [mdt] #16 [ffff8ccd6b5a3c38] mdt_reint_internal at ffffffffc1913f3c [mdt] #17 [ffff8ccd6b5a3c78] mdt_reint at ffffffffc191e807 [mdt] #18 [ffff8ccd6b5a3ca8] tgt_request_handle at ffffffffc13b725f [ptlrpc] #19 [ffff8ccd6b5a3d38] ptlrpc_server_handle_request at ffffffffc1360aa3 [ptlrpc] #20 [ffff8ccd6b5a3df0] ptlrpc_main at ffffffffc1362734 [ptlrpc] #21 [ffff8ccd6b5a3ec8] kthread at ffffffffb00cb621 #22 [ffff8ccd6b5a3f50] ret_from_fork_nospec_begin at ffffffffb07c51dd For now, I am attaching a live "foreach bt" taken from this MDS as fir-md1-s2_fir-MDT0001_foreach_bt_lquota_wb_20230704.txt |
| Comments |
| Comment by Stephane Thiell [ 05/Jul/23 ] |
|
So it looks like we have a user doing parallel chgrp operations, on the same group of files, which might be triggering this issue. From fir-MDT0001: 00010000:00020000:8.0:1688572070.004396:0:11568:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1688571770, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0001_UUID lock: ffff891c70dc2400/0x7f8123703e0ca85 lrc: 3/0,1 mode: --/PW res: [0x24007e29d:0xb9ce:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT gid 0 flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 11568 timeout: 0 lvb_type: 0 00010000:00020000:19.0:1688572076.226395:0:11683:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1688571776, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0001_UUID lock: ffff8942c4abb600/0x7f81237041bc52a lrc: 3/1,0 mode: --/PR res: [0x24007e29d:0xb9ce:0x0].0x0 bits 0x13/0x48 rrc: 5 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 11683 timeout: 0 lvb_type: 0 [root@fir-rbh01 ~]# lfs fid2path /fir 0x24007e29d:0xb9ce:0x0 /fir/users/sschulz/porteus_all_cells/01_Combined_Germline_Sigprofiler.pdf # it's on MDT0001: [root@fir-rbh01 ~]# lfs getdirstripe /fir/users/sschulz/porteus_all_cells lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none Looking on the cluster, at least three different jobs doing chgrp are stuck on the same file. If I restart fir-MDT0000, the situation gets resolved for a few minutes, after the recovery. Debug logs when fir-MDT0000 was restarted: 00010000:02000000:0.0:1688571753.052615:0:16567:0:(ldlm_lib.c:1760:target_finish_recovery()) fir-MDT0000: Recovery over after 3:38, of 1691 clients 1690 recovered and 1 was evicted. ... 00040000:00000400:17.0:1688571753.142247:0:16540:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 42000 seconds 00040000:00000400:9.0:1688571753.146880:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 42000 seconds 00040000:00000400:33.0:1688571753.186435:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 84000 seconds 00040000:00000400:33.0:1688571753.512182:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 126000 seconds 00040000:00000400:17.0:1688571753.613086:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 168000 seconds 00040000:00000400:41.0:1688571753.697687:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 210000 seconds 00040000:00000400:33.0:1688571753.929210:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 252000 seconds 00040000:00000400:25.0:1688571754.073319:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 294000 seconds 00040000:00000400:33.0:1688571754.161063:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 336000 seconds 00040000:00000400:25.0:1688571754.161428:0:16540:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 84000 seconds but then, quickly, the lquota_wb thread loads again on fir-MDT0001 and Robinhood gets stuck again, this time on another file of this user involving chgrp. |
| Comment by Stephane Thiell [ 05/Jul/23 ] |
|
I confirmed that the user is simply doing a chgrp at the end of their job script (using a glob, which is not great, but nothing really crazy). We also verified that one of the files in this glob is hanging. ...
rm -rf $RESULTS_DIR/tmp*
echo "results directory is ${RESULTS_DIR}"
echo "final directory is ${FINAL_DIR}"
rm -rf $RESULTS_DIR/*output
mkdir -p $FINAL_DIR
chgrp oak_cgawad $RESULTS_DIR/* # <<<
chmod g+rwx $RESULTS_DIR/*
rsync -a $RESULTS_DIR/* $FINAL_DIR
echo "current directory is `pwd`"
#rm -rf $RESULTS_DIR
echo "### Manta SV done ###"
This was running fine with Lustre 2.12.9 servers before the upgrade to 2.15.3. |
| Comment by Stephane Thiell [ 06/Jul/23 ] |
|
This might be related to |