[LU-13072] High OSS load due to possible deadlock w/ ofd_create_hdl and ofd_quotactl backtraces - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.3
Labels:
None
Environment:
lustre-2.12.3_4_g142b4d4-1.el7.x86_64 3.10.0-957.27.2.el7_lustre.pl2.x86_64 MOFED 4.7

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We found an OSS on Fir (2.12.3_4) heavily loaded and with threads hung this morning. This started during last night. We took a crash dump that I uploaded to the FTP as vmcore_fir-io8-s1_2019-12-12-07-42-58 along with kernel-debuginfo (kernel-debuginfo-3.10.0-957.27.2.el7_lustre.pl2.x86_64.rpm and kernel-debuginfo-common-x86_64-3.10.0-957.27.2.el7_lustre.pl2.x86_64.rpm).

The last messages on the console before the crash dump were:

[264018.187097] LNet: Service thread pid 33911 was inactive for 1200.94s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[264018.204204] LNet: Skipped 1 previous similar message
[264018.209269] Pid: 33911, comm: ll_ost03_106 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[264018.219804] Call Trace:
[264018.222354]  [<ffffffffc155deb3>] ofd_create_hdl+0xcb3/0x20e0 [ofd]
[264018.228754]  [<ffffffffc0f5036a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[264018.235831]  [<ffffffffc0ef724b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[264018.243637]  [<ffffffffc0efabac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[264018.250067]  [<ffffffffa7ac2e81>] kthread+0xd1/0xe0
[264018.255066]  [<ffffffffa8177c24>] ret_from_fork_nospec_begin+0xe/0x21
[264018.261642]  [<ffffffffffffffff>] 0xffffffffffffffff
[264018.266754] LustreError: dumping log to /tmp/lustre-log.1576165038.33911
[264126.221225] LNet: Service thread pid 34617 was inactive for 200.36s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[264126.238268] Pid: 34617, comm: ll_ost03_111 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[264126.248785] Call Trace:
[264126.251360]  [<ffffffffc02c3085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[264126.258376]  [<ffffffffc02c3368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[264126.265544]  [<ffffffffc02c35e1>] start_this_handle+0x1a1/0x430 [jbd2]
[264126.272211]  [<ffffffffc02c3a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[264126.278951]  [<ffffffffc131b309>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[264126.286491]  [<ffffffffc1312ad3>] ldiskfs_acquire_dquot+0x53/0xb0 [ldiskfs]
[264126.293573]  [<ffffffffa7cb026a>] dqget+0x3fa/0x450
[264126.298588]  [<ffffffffa7cb1074>] dquot_get_dqblk+0x14/0x1f0
[264126.304371]  [<ffffffffc143c6d5>] osd_acct_index_lookup+0x235/0x480 [osd_ldiskfs]
[264126.311997]  [<ffffffffc13a380d>] lquotactl_slv+0x27d/0x9d0 [lquota]
[264126.318488]  [<ffffffffc1556a3c>] ofd_quotactl+0x13c/0x380 [ofd]
[264126.324630]  [<ffffffffc0f5036a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[264126.331681]  [<ffffffffc0ef724b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[264126.339496]  [<ffffffffc0efabac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[264126.345912]  [<ffffffffa7ac2e81>] kthread+0xd1/0xe0
[264126.350926]  [<ffffffffa8177c24>] ret_from_fork_nospec_begin+0xe/0x21
[264126.357490]  [<ffffffffffffffff>] 0xffffffffffffffff
[264126.362604] LustreError: dumping log to /tmp/lustre-log.1576165146.34617
[264136.461417] Pid: 34429, comm: ll_ost03_110 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[264136.471940] Call Trace:
[264136.474507]  [<ffffffffc02c3085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[264136.481514]  [<ffffffffc02c3368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[264136.488699]  [<ffffffffc02c35e1>] start_this_handle+0x1a1/0x430 [jbd2]
[264136.495351]  [<ffffffffc02c3a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[264136.502103]  [<ffffffffc131b309>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[264136.509629]  [<ffffffffc1312ad3>] ldiskfs_acquire_dquot+0x53/0xb0 [ldiskfs]
[264136.516725]  [<ffffffffa7cb026a>] dqget+0x3fa/0x450
[264136.521728]  [<ffffffffa7cb1074>] dquot_get_dqblk+0x14/0x1f0
[264136.527514]  [<ffffffffc143c6d5>] osd_acct_index_lookup+0x235/0x480 [osd_ldiskfs]
[264136.535127]  [<ffffffffc13a380d>] lquotactl_slv+0x27d/0x9d0 [lquota]
[264136.541618]  [<ffffffffc1556a3c>] ofd_quotactl+0x13c/0x380 [ofd]
[264136.547746]  [<ffffffffc0f5036a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[264136.554806]  [<ffffffffc0ef724b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[264136.562610]  [<ffffffffc0efabac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[264136.569039]  [<ffffffffa7ac2e81>] kthread+0xd1/0xe0
[264136.574042]  [<ffffffffa8177c24>] ret_from_fork_nospec_begin+0xe/0x21
[264136.580632]  [<ffffffffffffffff>] 0xffffffffffffffff
[264136.585737] LustreError: dumping log to /tmp/lustre-log.1576165156.34429
[264349.590511] SysRq : Trigger a crash

Attaching:

vmcore-dmesg-txt as vmcore-dmesg_fir-io8-s1_2019-12-12-07-42-58.txt
output of "foreach bt" as foreach_bt_fir-io8-s1_2019-12-12-07-42-58.txt

We restarted the OSS and a few OSTs couldn't complete their recovery (stuck at 0s) but a manual abort_recovery did work.

Thanks,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_fir-io8-s1_2019-12-12-07-42-58.txt
1.60 MB
12/Dec/19 7:09 PM
vmcore-dmesg_fir-io8-s1_2019-12-12-07-42-58.txt
904 kB
12/Dec/19 7:09 PM

High OSS load due to possible deadlock w/ ofd_create_hdl and ofd_quotactl backtraces

Details

Description

Attachments

Attachments

Activity

People

Dates