Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.7
Labels:
None
Environment:
CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64); clients running Lustre 2.12.8

Severity:
3
Rank (Obsolete):
9223372036854775807

Last Friday, we hit a deadlock on an OSS of Oak (running Lustre 2.12.7), where many kernel threads became blocked in start_this_handle(). It's the first time we see that and I believe it is due to us running 24 parallel lfs_migrate for OST to OST migration. The load of the OSS increased dramatically to 500+ but almost no I/O was performed on disks anymore. We had to reboot the server to get out of this situation. We have also paused our OST to OST data migration until we know more.

I captured a vmcore to further investigate the issue.

Checking other Lustre tickets on this Jira, this problem looks similar to ~~LU-15117~~ and I wonder if it is also related to EDU-190 (however, in our case, please note that only CentOS 7 is used).

A typical blocked thread is:

PID: 229507  TASK: ffff918339656300  CPU: 2   COMMAND: "ll_ost00_090"
 #0 [ffff9182803d7948] __schedule at ffffffffa2f86d07
 #1 [ffff9182803d79d0] schedule at ffffffffa2f87229
 #2 [ffff9182803d79e0] wait_transaction_locked at ffffffffc0275085 [jbd2]
 #3 [ffff9182803d7a38] add_transaction_credits at ffffffffc0275378 [jbd2]
 #4 [ffff9182803d7a98] start_this_handle at ffffffffc0275601 [jbd2]
 #5 [ffff9182803d7b30] jbd2__journal_start at ffffffffc0275ab3 [jbd2]
 #6 [ffff9182803d7b78] __ldiskfs_journal_start_sb at ffffffffc14425b9 [ldiskfs]
 #7 [ffff9182803d7bb8] osd_trans_start at ffffffffc1516e8e [osd_ldiskfs]
 #8 [ffff9182803d7bf0] ofd_trans_start at ffffffffc1679425 [ofd]
 #9 [ffff9182803d7c20] ofd_attr_set at ffffffffc167ca73 [ofd]
#10 [ffff9182803d7c78] ofd_setattr_hdl at ffffffffc1668dcd [ofd]
#11 [ffff9182803d7cd0] tgt_request_handle at ffffffffc10e789a [ptlrpc]
#12 [ffff9182803d7d58] ptlrpc_server_handle_request at ffffffffc108c73b [ptlrpc]
#13 [ffff9182803d7df8] ptlrpc_main at ffffffffc10900a4 [ptlrpc]
#14 [ffff9182803d7ec8] kthread at ffffffffa28c5c21
#15 [ffff9182803d7f50] ret_from_fork_nospec_begin at ffffffffa2f94ddd

I don't know which thread is holding the transaction. Also, we have 40 OSTs on this OSS, so that probably makes it harder to find out. But some threads are in osd_write_lock() as reported in ~~LU-15117~~:

PID: 143327  TASK: ffff9169e5f84200  CPU: 3   COMMAND: "ll_ost_out01_00"
 #0 [ffff91a1b3e4f8f8] __schedule at ffffffffa2f86d07
 #1 [ffff91a1b3e4f980] schedule at ffffffffa2f87229
 #2 [ffff91a1b3e4f990] rwsem_down_write_failed at ffffffffa2f88965
 #3 [ffff91a1b3e4fa28] call_rwsem_down_write_failed at ffffffffa2b97767
 #4 [ffff91a1b3e4fa70] down_write at ffffffffa2f8655d
 #5 [ffff91a1b3e4fa88] osd_write_lock at ffffffffc150ae8c [osd_ldiskfs]
 #6 [ffff91a1b3e4fab0] out_tx_xattr_set_exec at ffffffffc10f3a8b [ptlrpc]
 #7 [ffff91a1b3e4fb50] out_tx_end at ffffffffc10ec741 [ptlrpc]
 #8 [ffff91a1b3e4fb90] out_handle at ffffffffc10f0962 [ptlrpc]
 #9 [ffff91a1b3e4fcd0] tgt_request_handle at ffffffffc10e789a [ptlrpc]
#10 [ffff91a1b3e4fd58] ptlrpc_server_handle_request at ffffffffc108c73b [ptlrpc]
#11 [ffff91a1b3e4fdf8] ptlrpc_main at ffffffffc10900a4 [ptlrpc]
#12 [ffff91a1b3e4fec8] kthread at ffffffffa28c5c21
#13 [ffff91a1b3e4ff50] ret_from_fork_nospec_begin at ffffffffa2f94ddd

We haven't seen this problem when running a lower number of parallel lfs_migrate processes, like 8, which we have been doing for days without hitting this. But on Friday, we did ramp up the OST-to-OST data migration by increasing the number of parallel workers to 16 and finally up to 24 concurrent lfs_migrate (each working on a different file list generated by lfs find + fpart). This last, more aggressive configuration ran for a few hours before we hit this deadlock on one OSS oak-io6-s2.

lfs_migrate, which uses lfs migrate, performs the data transfer into a volatile file as root and when this is done, seems to do an explicit chown which, if I'm not wrong, will propagate to the OSTs as ofd_attr_set. In the crash dump, we can see many of those.

I'm attaching the output of "foreach bt" from the vmcore we took and just uploaded the vmcore itself to the Whamcloud FTP as vmcore-oak-io6-s2_2022-02-25-18-13-45 (md5: a850b3d9af0ac70732023b6e756ab188).

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

foreach_bt_oak-io6-s2_2022-02-25-18-13-45.log
2.89 MB
28/Feb/22 6:08 PM
dmesg_oak-io6-s2_2022-02-25-18-13-45.log
1.01 MB
01/Mar/22 4:10 PM

Details

Description

Attachments

Attachments

Activity

People

Dates