-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.15.3
-
None
-
3
-
9223372036854775807
Hello! Two days ago, we upgraded the Lustre servers of Sherlock's scratch (Fir) from 2.12.9 to 2.15.3. The clients were already running 2.15.x. The server upgrade went smoothly but I noticed one issue today: we got an alert that a Robinhood changelog readerĀ (also running 2.15.3) was stuck, and indeed:
Jul 4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8863 blocked for more than 120 seconds. Jul 4 11:39:40 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 4 11:39:40 fir-rbh01 kernel: robinhood D ffff9bacf6375280 0 8863 1 0x00000080 Jul 4 11:39:40 fir-rbh01 kernel: Call Trace: Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b8bf9>] schedule_preempt_disabled+0x29/0x70 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b6ca7>] __mutex_lock_slowpath+0xc7/0x1e0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b602f>] mutex_lock+0x1f/0x33 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7ae6ed>] lookup_slow+0x33/0xab Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bf8e>] path_lookupat+0x89e/0x8d0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac235c6b>] ? kmem_cache_alloc+0x19b/0x1f0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26e16f>] ? getname_flags+0x4f/0x1a0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bfeb>] filename_lookup+0x2b/0xd0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f337>] user_path_at_empty+0x67/0xc0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac25e493>] ? fput+0x13/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f3a1>] user_path_at+0x11/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac2618a3>] vfs_fstatat+0x63/0xd0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac261d01>] SYSC_newlstat+0x31/0x70 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac1468f6>] ? __audit_syscall_exit+0x1f6/0x2b0 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac2621ce>] SyS_newlstat+0xe/0x20 Jul 4 11:39:40 fir-rbh01 kernel: [<ffffffffac7c539a>] system_call_fastpath+0x25/0x2a Jul 4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8864 blocked for more than 120 seconds.
I tried to restart the client and Robinhood multiple times without success, same issue. When I look at the MDT (fir-MDT0001), which is currently still running, I can notice a constant CPU load from this process:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12875 root 20 0 0 0 0 S 13.6 0.0 39:41.42 lquota_wb_fir-M
Looking further with crash, there seems to be two lquota_wb threads (12871 and 12875):
crash> bt 12871 PID: 12871 TASK: ffff8ccd6c9b0000 CPU: 6 COMMAND: "lquota_wb_fir-M" #0 [ffff8ccd6ca93cf0] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6ca93d58] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6ca93d68] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6ca93e10] qsd_upd_thread at ffffffffc175a665 [lquota] #4 [ffff8ccd6ca93ec8] kthread at ffffffffb00cb621 #5 [ffff8ccd6ca93f50] ret_from_fork_nospec_begin at ffffffffb07c51dd crash> bt 12875 PID: 12875 TASK: ffff8ccd6cba1080 CPU: 2 COMMAND: "lquota_wb_fir-M" #0 [ffff8ccd6cbabcf0] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6cbabd58] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6cbabd68] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6cbabe10] qsd_upd_thread at ffffffffc175a665 [lquota] #4 [ffff8ccd6cbabec8] kthread at ffffffffb00cb621 #5 [ffff8ccd6cbabf50] ret_from_fork_nospec_begin at ffffffffb07c51dd
At first glance, as this involves lquota_wb, this looks similar to LU-15283, but I can't find the same backtrace signature. And the issue should be fixed in 2.15.3.
However, the following thread is suspect for me, as it involves lquota and seems stuck:
crash> bt 32263 PID: 32263 TASK: ffff8cdd78a21080 CPU: 5 COMMAND: "mdt01_011" #0 [ffff8ccd6b5a3640] __schedule at ffffffffb07b78d8 #1 [ffff8ccd6b5a36a8] schedule at ffffffffb07b7ca9 #2 [ffff8ccd6b5a36b8] schedule_timeout at ffffffffb07b5778 #3 [ffff8ccd6b5a3760] schedule_timeout_interruptible at ffffffffb07b58fe #4 [ffff8ccd6b5a3770] qsd_acquire at ffffffffc175e915 [lquota] #5 [ffff8ccd6b5a3810] qsd_op_begin0 at ffffffffc175f6bf [lquota] #6 [ffff8ccd6b5a38b0] qsd_op_begin at ffffffffc1760142 [lquota] #7 [ffff8ccd6b5a38f8] osd_declare_qid at ffffffffc17ffcb3 [osd_ldiskfs] #8 [ffff8ccd6b5a3950] osd_declare_attr_qid at ffffffffc17bbdfc [osd_ldiskfs] #9 [ffff8ccd6b5a39a8] osd_declare_attr_set at ffffffffc17bffbe [osd_ldiskfs] #10 [ffff8ccd6b5a39f8] lod_sub_declare_attr_set at ffffffffc1a7752c [lod] #11 [ffff8ccd6b5a3a48] lod_declare_attr_set at ffffffffc1a5854b [lod] #12 [ffff8ccd6b5a3ac8] mdd_attr_set at ffffffffc1d1ab32 [mdd] #13 [ffff8ccd6b5a3b48] mdt_attr_set at ffffffffc193ac36 [mdt] #14 [ffff8ccd6b5a3b98] mdt_reint_setattr at ffffffffc193bd3a [mdt] #15 [ffff8ccd6b5a3c10] mdt_reint_rec at ffffffffc193e69a [mdt] #16 [ffff8ccd6b5a3c38] mdt_reint_internal at ffffffffc1913f3c [mdt] #17 [ffff8ccd6b5a3c78] mdt_reint at ffffffffc191e807 [mdt] #18 [ffff8ccd6b5a3ca8] tgt_request_handle at ffffffffc13b725f [ptlrpc] #19 [ffff8ccd6b5a3d38] ptlrpc_server_handle_request at ffffffffc1360aa3 [ptlrpc] #20 [ffff8ccd6b5a3df0] ptlrpc_main at ffffffffc1362734 [ptlrpc] #21 [ffff8ccd6b5a3ec8] kthread at ffffffffb00cb621 #22 [ffff8ccd6b5a3f50] ret_from_fork_nospec_begin at ffffffffb07c51dd
For now, I am attaching a live "foreach bt" taken from this MDS as fir-md1-s2_fir-MDT0001_foreach_bt_lquota_wb_20230704.txt
. Let me know if you have any idea on how to fix this. Thanks much.