[LU-16941] MDT deadlock involving lquota_wb Created: 04/Jul/23  Updated: 06/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File fir-md1-s2_fir-MDT0001_foreach_bt_lquota_wb_20230704.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello! Two days ago, we upgraded the Lustre servers of Sherlock's scratch (Fir) from 2.12.9 to 2.15.3. The clients were already running 2.15.x. The server upgrade went smoothly but I noticed one issue today: we got an alert that a Robinhood changelog reader  (also running 2.15.3) was stuck, and indeed:

Jul  4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8863 blocked for more than 120 seconds.
Jul  4 11:39:40 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  4 11:39:40 fir-rbh01 kernel: robinhood       D ffff9bacf6375280     0  8863      1 0x00000080
Jul  4 11:39:40 fir-rbh01 kernel: Call Trace:
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b8bf9>] schedule_preempt_disabled+0x29/0x70
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac7b602f>] mutex_lock+0x1f/0x33
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac7ae6ed>] lookup_slow+0x33/0xab
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bf8e>] path_lookupat+0x89e/0x8d0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac235c6b>] ? kmem_cache_alloc+0x19b/0x1f0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac26e16f>] ? getname_flags+0x4f/0x1a0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac26bfeb>] filename_lookup+0x2b/0xd0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f337>] user_path_at_empty+0x67/0xc0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac25e493>] ? fput+0x13/0x20
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac26f3a1>] user_path_at+0x11/0x20
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac2618a3>] vfs_fstatat+0x63/0xd0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac261d01>] SYSC_newlstat+0x31/0x70
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac1468f6>] ? __audit_syscall_exit+0x1f6/0x2b0
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac2621ce>] SyS_newlstat+0xe/0x20
Jul  4 11:39:40 fir-rbh01 kernel: [<ffffffffac7c539a>] system_call_fastpath+0x25/0x2a
Jul  4 11:39:40 fir-rbh01 kernel: INFO: task robinhood:8864 blocked for more than 120 seconds.

I tried to restart the client and Robinhood multiple times without success, same issue. When I look at the MDT (fir-MDT0001), which is currently still running, I can notice a constant CPU load from this process:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                         
 12875 root      20   0       0      0      0 S  13.6  0.0  39:41.42 lquota_wb_fir-M   

Looking further with crash, there seems to be two lquota_wb threads (12871 and 12875):

crash> bt 12871
PID: 12871  TASK: ffff8ccd6c9b0000  CPU: 6   COMMAND: "lquota_wb_fir-M"
 #0 [ffff8ccd6ca93cf0] __schedule at ffffffffb07b78d8
 #1 [ffff8ccd6ca93d58] schedule at ffffffffb07b7ca9
 #2 [ffff8ccd6ca93d68] schedule_timeout at ffffffffb07b5778
 #3 [ffff8ccd6ca93e10] qsd_upd_thread at ffffffffc175a665 [lquota]
 #4 [ffff8ccd6ca93ec8] kthread at ffffffffb00cb621
 #5 [ffff8ccd6ca93f50] ret_from_fork_nospec_begin at ffffffffb07c51dd
crash> bt 12875
PID: 12875  TASK: ffff8ccd6cba1080  CPU: 2   COMMAND: "lquota_wb_fir-M"
 #0 [ffff8ccd6cbabcf0] __schedule at ffffffffb07b78d8
 #1 [ffff8ccd6cbabd58] schedule at ffffffffb07b7ca9
 #2 [ffff8ccd6cbabd68] schedule_timeout at ffffffffb07b5778
 #3 [ffff8ccd6cbabe10] qsd_upd_thread at ffffffffc175a665 [lquota]
 #4 [ffff8ccd6cbabec8] kthread at ffffffffb00cb621
 #5 [ffff8ccd6cbabf50] ret_from_fork_nospec_begin at ffffffffb07c51dd

At first glance, as this involves lquota_wb, this looks similar to LU-15283, but I can't find the same backtrace signature. And the issue should be fixed in 2.15.3.

However, the following thread is suspect for me, as it involves lquota and seems stuck:

crash> bt 32263
PID: 32263  TASK: ffff8cdd78a21080  CPU: 5   COMMAND: "mdt01_011"
 #0 [ffff8ccd6b5a3640] __schedule at ffffffffb07b78d8
 #1 [ffff8ccd6b5a36a8] schedule at ffffffffb07b7ca9
 #2 [ffff8ccd6b5a36b8] schedule_timeout at ffffffffb07b5778
 #3 [ffff8ccd6b5a3760] schedule_timeout_interruptible at ffffffffb07b58fe
 #4 [ffff8ccd6b5a3770] qsd_acquire at ffffffffc175e915 [lquota]
 #5 [ffff8ccd6b5a3810] qsd_op_begin0 at ffffffffc175f6bf [lquota]
 #6 [ffff8ccd6b5a38b0] qsd_op_begin at ffffffffc1760142 [lquota]
 #7 [ffff8ccd6b5a38f8] osd_declare_qid at ffffffffc17ffcb3 [osd_ldiskfs]
 #8 [ffff8ccd6b5a3950] osd_declare_attr_qid at ffffffffc17bbdfc [osd_ldiskfs]
 #9 [ffff8ccd6b5a39a8] osd_declare_attr_set at ffffffffc17bffbe [osd_ldiskfs]
#10 [ffff8ccd6b5a39f8] lod_sub_declare_attr_set at ffffffffc1a7752c [lod]
#11 [ffff8ccd6b5a3a48] lod_declare_attr_set at ffffffffc1a5854b [lod]
#12 [ffff8ccd6b5a3ac8] mdd_attr_set at ffffffffc1d1ab32 [mdd]
#13 [ffff8ccd6b5a3b48] mdt_attr_set at ffffffffc193ac36 [mdt]
#14 [ffff8ccd6b5a3b98] mdt_reint_setattr at ffffffffc193bd3a [mdt]
#15 [ffff8ccd6b5a3c10] mdt_reint_rec at ffffffffc193e69a [mdt]
#16 [ffff8ccd6b5a3c38] mdt_reint_internal at ffffffffc1913f3c [mdt]
#17 [ffff8ccd6b5a3c78] mdt_reint at ffffffffc191e807 [mdt]
#18 [ffff8ccd6b5a3ca8] tgt_request_handle at ffffffffc13b725f [ptlrpc]
#19 [ffff8ccd6b5a3d38] ptlrpc_server_handle_request at ffffffffc1360aa3 [ptlrpc]
#20 [ffff8ccd6b5a3df0] ptlrpc_main at ffffffffc1362734 [ptlrpc]
#21 [ffff8ccd6b5a3ec8] kthread at ffffffffb00cb621
#22 [ffff8ccd6b5a3f50] ret_from_fork_nospec_begin at ffffffffb07c51dd

For now, I am attaching a live "foreach bt" taken from this MDS as fir-md1-s2_fir-MDT0001_foreach_bt_lquota_wb_20230704.txt. Let me know if you have any idea on how to fix this. Thanks much.



 Comments   
Comment by Stephane Thiell [ 05/Jul/23 ]

So it looks like we have a user doing parallel chgrp operations, on the same group of files, which might be triggering this issue.

From fir-MDT0001:

00010000:00020000:8.0:1688572070.004396:0:11568:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1688571770, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0001_UUID lock: ffff891c70dc2400/0x7f8123703e0ca85 lrc: 3/0,1 mode: --/PW res: [0x24007e29d:0xb9ce:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT gid 0 flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 11568 timeout: 0 lvb_type: 0
00010000:00020000:19.0:1688572076.226395:0:11683:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1688571776, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0001_UUID lock: ffff8942c4abb600/0x7f81237041bc52a lrc: 3/1,0 mode: --/PR res: [0x24007e29d:0xb9ce:0x0].0x0 bits 0x13/0x48 rrc: 5 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 11683 timeout: 0 lvb_type: 0
[root@fir-rbh01 ~]# lfs fid2path /fir 0x24007e29d:0xb9ce:0x0
/fir/users/sschulz/porteus_all_cells/01_Combined_Germline_Sigprofiler.pdf

# it's on MDT0001:
[root@fir-rbh01 ~]# lfs getdirstripe /fir/users/sschulz/porteus_all_cells
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

Looking on the cluster, at least three different jobs doing chgrp are stuck on the same file.

If I restart fir-MDT0000, the situation gets resolved for a few minutes, after the recovery. Debug logs when fir-MDT0000 was restarted:

00010000:02000000:0.0:1688571753.052615:0:16567:0:(ldlm_lib.c:1760:target_finish_recovery()) fir-MDT0000: Recovery over after 3:38, of 1691 clients 1690 recovered and 1 was evicted.
...
00040000:00000400:17.0:1688571753.142247:0:16540:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 42000 seconds
00040000:00000400:9.0:1688571753.146880:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 42000 seconds
00040000:00000400:33.0:1688571753.186435:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 84000 seconds
00040000:00000400:33.0:1688571753.512182:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 126000 seconds
00040000:00000400:17.0:1688571753.613086:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 168000 seconds
00040000:00000400:41.0:1688571753.697687:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 210000 seconds
00040000:00000400:33.0:1688571753.929210:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 252000 seconds
00040000:00000400:25.0:1688571754.073319:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 294000 seconds
00040000:00000400:33.0:1688571754.161063:0:16544:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 336000 seconds
00040000:00000400:25.0:1688571754.161428:0:16540:0:(qsd_writeback.c:539:qsd_upd_thread()) fir-MDT0000: The reintegration thread [2] blocked more than 84000 seconds

but then, quickly, the lquota_wb thread loads again on fir-MDT0001 and Robinhood gets stuck again, this time on another file of this user involving chgrp.

Comment by Stephane Thiell [ 05/Jul/23 ]

I confirmed that the user is simply doing a chgrp at the end of their job script (using a glob, which is not great, but nothing really crazy).  We also verified that one of the files in this glob is hanging.

...
rm -rf $RESULTS_DIR/tmp*                                                        
echo "results directory is ${RESULTS_DIR}"                                      
echo "final directory is ${FINAL_DIR}"                                          
                                                                                
rm -rf $RESULTS_DIR/*output                                                     
mkdir -p $FINAL_DIR                                                             
                                                                                
                                                                                
chgrp oak_cgawad $RESULTS_DIR/*                # <<<                                 
chmod g+rwx $RESULTS_DIR/*                                                      
rsync -a $RESULTS_DIR/* $FINAL_DIR                                              
                                                                                
echo "current directory is `pwd`"                                               
#rm -rf $RESULTS_DIR                                                            
                                                                                
echo "### Manta SV done  ###"  

This was running fine with Lustre 2.12.9 servers before the upgrade to 2.15.3.

Comment by Stephane Thiell [ 06/Jul/23 ]

This might be related to LU-15880. I backported a patch (quota: fix issues in reserving quota) that is missing in 2.15.3 that seems to fix quota issues with chgrp. We're trying it in production now.

Generated at Sat Feb 10 03:31:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.