[LU-15605] OSS deadlock when running many lfs_migrate processes Created: 28/Feb/22  Updated: 01/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64); clients running Lustre 2.12.8


Attachments: Text File dmesg_oak-io6-s2_2022-02-25-18-13-45.log     Text File foreach_bt_oak-io6-s2_2022-02-25-18-13-45.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Last Friday, we hit a deadlock on an OSS of Oak (running Lustre 2.12.7), where many kernel threads became blocked in start_this_handle(). It's the first time we see that and I believe it is due to us running 24 parallel lfs_migrate for OST to OST migration. The load of the OSS increased dramatically to 500+ but almost no I/O was performed on disks anymore. We had to reboot the server to get out of this situation. We have also paused our OST to OST data migration until we know more.

I captured a vmcore to further investigate the issue.

Checking other Lustre tickets on this Jira, this problem looks similar to LU-15117 and I wonder if it is also related to EDU-190 (however, in our case, please note that only CentOS 7 is used).

A typical blocked thread is:

PID: 229507  TASK: ffff918339656300  CPU: 2   COMMAND: "ll_ost00_090"
 #0 [ffff9182803d7948] __schedule at ffffffffa2f86d07
 #1 [ffff9182803d79d0] schedule at ffffffffa2f87229
 #2 [ffff9182803d79e0] wait_transaction_locked at ffffffffc0275085 [jbd2]
 #3 [ffff9182803d7a38] add_transaction_credits at ffffffffc0275378 [jbd2]
 #4 [ffff9182803d7a98] start_this_handle at ffffffffc0275601 [jbd2]
 #5 [ffff9182803d7b30] jbd2__journal_start at ffffffffc0275ab3 [jbd2]
 #6 [ffff9182803d7b78] __ldiskfs_journal_start_sb at ffffffffc14425b9 [ldiskfs]
 #7 [ffff9182803d7bb8] osd_trans_start at ffffffffc1516e8e [osd_ldiskfs]
 #8 [ffff9182803d7bf0] ofd_trans_start at ffffffffc1679425 [ofd]
 #9 [ffff9182803d7c20] ofd_attr_set at ffffffffc167ca73 [ofd]
#10 [ffff9182803d7c78] ofd_setattr_hdl at ffffffffc1668dcd [ofd]
#11 [ffff9182803d7cd0] tgt_request_handle at ffffffffc10e789a [ptlrpc]
#12 [ffff9182803d7d58] ptlrpc_server_handle_request at ffffffffc108c73b [ptlrpc]
#13 [ffff9182803d7df8] ptlrpc_main at ffffffffc10900a4 [ptlrpc]
#14 [ffff9182803d7ec8] kthread at ffffffffa28c5c21
#15 [ffff9182803d7f50] ret_from_fork_nospec_begin at ffffffffa2f94ddd

I don't know which thread is holding the transaction. Also, we have 40 OSTs on this OSS, so that probably makes it harder to find out. But some threads are in osd_write_lock() as reported in LU-15117:

PID: 143327  TASK: ffff9169e5f84200  CPU: 3   COMMAND: "ll_ost_out01_00"
 #0 [ffff91a1b3e4f8f8] __schedule at ffffffffa2f86d07
 #1 [ffff91a1b3e4f980] schedule at ffffffffa2f87229
 #2 [ffff91a1b3e4f990] rwsem_down_write_failed at ffffffffa2f88965
 #3 [ffff91a1b3e4fa28] call_rwsem_down_write_failed at ffffffffa2b97767
 #4 [ffff91a1b3e4fa70] down_write at ffffffffa2f8655d
 #5 [ffff91a1b3e4fa88] osd_write_lock at ffffffffc150ae8c [osd_ldiskfs]
 #6 [ffff91a1b3e4fab0] out_tx_xattr_set_exec at ffffffffc10f3a8b [ptlrpc]
 #7 [ffff91a1b3e4fb50] out_tx_end at ffffffffc10ec741 [ptlrpc]
 #8 [ffff91a1b3e4fb90] out_handle at ffffffffc10f0962 [ptlrpc]
 #9 [ffff91a1b3e4fcd0] tgt_request_handle at ffffffffc10e789a [ptlrpc]
#10 [ffff91a1b3e4fd58] ptlrpc_server_handle_request at ffffffffc108c73b [ptlrpc]
#11 [ffff91a1b3e4fdf8] ptlrpc_main at ffffffffc10900a4 [ptlrpc]
#12 [ffff91a1b3e4fec8] kthread at ffffffffa28c5c21
#13 [ffff91a1b3e4ff50] ret_from_fork_nospec_begin at ffffffffa2f94ddd

We haven't seen this problem when running a lower number of parallel lfs_migrate processes, like 8, which we have been doing for days without hitting this. But on Friday, we did ramp up the OST-to-OST data migration by increasing the number of parallel workers to 16 and finally up to 24 concurrent lfs_migrate (each working on a different file list generated by lfs find + fpart). This last, more aggressive configuration ran for a few hours before we hit this deadlock on one OSS oak-io6-s2.

lfs_migrate, which uses lfs migrate, performs the data transfer into a volatile file as root and when this is done, seems to do an explicit chown which, if I'm not wrong, will propagate to the OSTs as ofd_attr_set. In the crash dump, we can see many of those.

I'm attaching the output of "foreach bt" from the vmcore we took and just uploaded the vmcore itself to the Whamcloud FTP as vmcore-oak-io6-s2_2022-02-25-18-13-45 (md5: a850b3d9af0ac70732023b6e756ab188).



 Comments   
Comment by Peter Jones [ 01/Mar/22 ]

Alex

Could you please advise?

Thanks

Peter

Comment by Alex Zhuravlev [ 01/Mar/22 ]

I'm looking at the traces.. can you please attach dmesg from vmcore?

Comment by Stephane Thiell [ 01/Mar/22 ]

Thanks! Alex, I just attached it as dmesg_oak-io6-s2_2022-02-25-18-13-45.log
I think the problem started around this line:

[14611341.653001] INFO: task ll_ost_io00_001:199270 blocked for more than 120 seconds.
Comment by Alex Zhuravlev [ 01/Mar/22 ]

AFAICS, this one is missing in the version: https://review.whamcloud.com/31293

Comment by Stephane Thiell [ 01/Mar/22 ]

Oh! Interesting! Our servers are based on 2.12.7 which contains "LU-10048 osd: async truncate" but NOT "LU-10048 ofd: take local locks within transaction" (31293). Do you think that's the one? Is there a backport to b2_12? Thanks!!

Comment by Stephane Thiell [ 01/Mar/22 ]

Hi Alex,

This one: https://review.whamcloud.com/#/c/43278/
"LU-13093 osd: fix osd_attr_set race"

has not landed into 2.12 either.

Comment by Stephane Thiell [ 01/Mar/22 ]

Looks like Etienne back ported "LU-10048 ofd: take local locks within transaction" to b2_12 in  https://review.whamcloud.com/#/c/43277/

We could try to apply both patches like they did in LU-10048.

Generated at Sat Feb 10 03:19:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.