[LU-11465] OSS/MDS deadlock in 2.10.5 Created: 03/Oct/18  Updated: 12/Aug/22  Resolved: 12/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Marek Magrys Assignee: Alex Zhuravlev
Resolution: Incomplete Votes: 3
Labels: None
Environment:

CentOS 7, 3.10.0-862.2.3.el7_lustre.x86_64, 1 MDS (+1 HA pair), 4 OSS


Attachments: File cyf_dec10_client_stacks.log     File cyf_dec10_server_stacks.log     Zip Archive logs.zip    
Issue Links:
Duplicate
Related
is related to LU-10048 osd-ldiskfs to truncate outside of ma... Resolved
is related to LU-5152 Can't enforce block quota when unpriv... Resolved
is related to LU-11613 MDS and OSS locked up wait_transactio... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After an upgrade of Lustre servers from Centos 6 and IEEL/DDN version of Lustre 2.5 to Centos 7 and Lustre 2.10.5 we are experiencing stability issues related to and MDS/OSS deadlock, most likely caused by hung OST thread. The issue is visible on one of two filesystems, the smaller one, both are working on identical HW and SW stack, so the issue is most likely related to a specific workload on the filesystem. This is mostly an midterm storage filesystem, so more metadata ops happen on it than on the scratch one. We initially thought that the issue might be caused by our try to enable project quotas (tune2fs -O project on all targets), so we turned it off (tune2fs -O ^project), which didn't change a thing. Also all combinations of MDT/OST quota enforcement (initally it was 'g', than 'gp', now 'none') doesn't change a thing, so probably it is not related to the problem. In terms of relevant information, we run on sync_journal=1 as a workaround to a problem with memory accounting on the clients suggested by DDN some time ago. The crash happens at least once per few hours, sometimes sequentially one after another, on different OSS nodes. I also tagged 2.10.6 as affected, as we tried everything from 2.10.4 to current b2_10.

On OSS we get:

 

[Tue Oct  2 11:56:44 2018] LNet: Service thread pid 11401 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Tue Oct  2 11:56:44 2018] Pid: 11401, comm: ll_ost_out00_00 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [<ffffffff8395aa77>] call_rwsem_down_write_failed+0x17/0x30
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0eeaa8c>] osd_write_lock+0x5c/0xe0 [osd_ldiskfs]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b5b119>] out_tx_attr_set_exec+0x69/0x3f0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b55591>] out_tx_end+0xe1/0x5c0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b596d3>] out_handle+0x1453/0x1bc0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffff836bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:44 2018]  [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:44 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.11401
[Tue Oct  2 11:56:44 2018] Pid: 35664, comm: ll_ost_io00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0e7e495>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0e76533>] jbd2_journal_stop+0x343/0x3d0 [jbd2]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc112ab4c>] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0efb783>] osd_trans_stop+0x183/0x850 [osd_ldiskfs]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc107d352>] ofd_trans_stop+0x22/0x60 [ofd]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc10835f4>] ofd_commitrw_write+0x7e4/0x1c90 [ofd]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc10877a9>] ofd_commitrw+0x4c9/0xae0 [ofd]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b80864>] obd_commitrw+0x2f3/0x336 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b5338d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffff836bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:45 2018]  [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:45 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:45 2018] Pid: 35678, comm: ll_ost_io00_112 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:45 2018] Call Trace:
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0e7e495>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0e76533>] jbd2_journal_stop+0x343/0x3d0 [jbd2]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc112ab4c>] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0efb783>] osd_trans_stop+0x183/0x850 [osd_ldiskfs]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc107d352>] ofd_trans_stop+0x22/0x60 [ofd]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc10835f4>] ofd_commitrw_write+0x7e4/0x1c90 [ofd]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc10877a9>] ofd_commitrw+0x4c9/0xae0 [ofd]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0b80864>] obd_commitrw+0x2f3/0x336 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0b5338d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [<ffffffff836bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:45 2018]  [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:45 2018]  [<ffffffffffffffff>] 0xffffffffffffffff

 

 

On MDS:

[Tue Oct  2 11:56:44 2018] LNet: Service thread pid 69588 was inactive for 200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Tue Oct  2 11:56:44 2018] Pid: 69588, comm: mdt00_095 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0f7e140>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0f7e61d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc14de463>] osp_remote_sync+0xd3/0x200 [osp]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc14c6dd0>] osp_attr_set+0x4c0/0x5d0 [osp]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1421c6b>] lod_sub_attr_set+0x1cb/0x460 [lod]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1403be6>] lod_obj_stripe_attr_set_cb+0x16/0x30 [lod]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc140fa96>] lod_obj_for_each_stripe+0xb6/0x230 [lod]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1411043>] lod_attr_set+0x2f3/0x9a0 [lod]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc14810a0>] mdd_attr_set_internal+0x120/0x2a0 [mdd]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1481e8d>] mdd_attr_set+0x8ad/0xce0 [mdd]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc13645f5>] mdt_reint_setattr+0xba5/0x1060 [mdt]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc1351f07>] mdt_reint+0x67/0x140 [mdt]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [<ffffffff946bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:44 2018]  [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:44 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.69588
[Tue Oct  2 11:56:49 2018] LNet: Service thread pid 7634 was inactive for 200.51s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Tue Oct  2 11:56:49 2018] Pid: 7634, comm: mdt_rdpg00_000 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1002512>] top_trans_start+0x702/0x940 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc13ec3f1>] lod_trans_start+0x31/0x40 [lod]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc148c1ba>] mdd_trans_start+0x1a/0x20 [mdd]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1481b69>] mdd_attr_set+0x589/0xce0 [mdd]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136c1d6>] mdt_mfd_close+0x1a6/0x610 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1371951>] mdt_close_internal+0x121/0x220 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1371c70>] mdt_close+0x220/0x780 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffff946bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] LustreError: dumping log to /tmp/lustre-log.1538474210.7634
[Tue Oct  2 11:56:49 2018] Pid: 69508, comm: mdt00_067 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1369de7>] mdt_empty_transno+0xf7/0x840 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136cf1e>] mdt_mfd_open+0x8de/0xe70 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136da2b>] mdt_finish_open+0x57b/0x690 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136f308>] mdt_reint_open+0x17c8/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffff946bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] Pid: 69607, comm: mdt00_105 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1369de7>] mdt_empty_transno+0xf7/0x840 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136cf1e>] mdt_mfd_open+0x8de/0xe70 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136da2b>] mdt_finish_open+0x57b/0x690 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136f308>] mdt_reint_open+0x17c8/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffff946bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [<ffffffffffffffff>] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] Pid: 69594, comm: mdt00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1002512>] top_trans_start+0x702/0x940 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc13ec3f1>] lod_trans_start+0x31/0x40 [lod]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc148c1ba>] mdd_trans_start+0x1a/0x20 [mdd]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc14763c0>] mdd_create+0xbe0/0x1400 [mdd]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc136fcb5>] mdt_reint_open+0x2175/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [<ffffffff946bae31>] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [<ffffffffffffffff>] 0xffffffffffffffff

Please have a look at above, full dmesg and lustre log files are attached as an archive.

 



 Comments   
Comment by Alex Zhuravlev [ 05/Oct/18 ]

looks like a LU-10048 duplicate

Comment by Marek Magrys [ 05/Oct/18 ]

Might be, but mentioned LU was a minor priority with fix just to master (2.12). In our case we just had the same problem on our main filesystem,  so I think it would be good to backport the resolution to 2.10,  as 2.10.6 release is around the corner.

Comment by Lukasz Flis [ 10/Oct/18 ]

Is there a patch for 2.10.5  available somewhere?

Comment by Marion Hakanson (Inactive) [ 18/Oct/18 ]

We seem to be having the same deadlock here.  Sometimes rebooting the MDS will clear it, sometimes not.  All Lustre servers are at 2.10.5, CentOS-7.5, were upgraded from IEEL-2.3/CentOS-6.  Clients are 2.10.2, CentOS-7.4.

 Oh, our OSTs are all ZFS, with MDT being ldiskfs.  Symptoms we see are hundreds of threads on the MDS in "D" state, and all clients hang.  Similar log entries as the othe poster.

Suggestions on how to recover from or avoid this issue would be very welcome.

 

Comment by Lukasz Flis [ 18/Oct/18 ]

Marion do you have any idea which aplication causes this problem in your environment?
We @CYFRONET are still trying to isolate the problematic type of workload in order to make simple reproducer out of it.

If you can share what applications are you running, maybe we'll find some common factor

Thanks

Comment by Marion Hakanson (Inactive) [ 18/Oct/18 ]

No, we've not identified a triggering workload.  We have a mixture of genomics analysis (many tiny jobs), MRI analysis, and Relion (CryoEM) MPI.  But so far no correlation with the issue occurring.

We upgraded on 09-October, ran for about a week without anything unusual.  We did enable jobstats on Friday the 12th, and the first incident happened Sunday, with another on Tuesday (both of those cleared by rebooting just the MDS).

Yesterday and today, we've had multiple instances, and rebooting the MDS/OSSs has not cleared things.  Immediately after rebooting, the MDS gets on the order of one stuck thread for each client (250+), and I/O hangs on clients.  We've been unable to clear the hangs except by rebooting all clients and servers, so our cluster has been unusable in that time.

We disabled jobstats this morning, just prior to rebooting everything again, and are waiting to see if that has helped.

Again, we need help, both with a patch, and with suggestions on an easier way to recover from this.

Comment by Marion Hakanson (Inactive) [ 19/Oct/18 ]

We are still seeing the problem occur after disabling jobstats.

 

Comment by Marion Hakanson (Inactive) [ 06/Nov/18 ]

For Whamcloud/DDN folks, we also had a case (#111147) with DDN engineering to assist with diagnosing this issue at OHSU.  You should be able to see the details here:

https://community.ddn.com/50038000012Pme2

 

Comment by Alex Zhuravlev [ 08/Nov/18 ]

I'm able to reproduce the issue and couple patches under LU-10048 seem to solve this. they are in testing now..

Comment by Lukasz Flis [ 08/Nov/18 ]

Alex, we can test lustre with and without the patch on our testing instance tommorow if you can share the reproducer code.

Comment by Andreas Dilger [ 16/Nov/18 ]

I think that the LU-5152 patch may be the root cause of for this problem. It sends a sync during chgrp to the OSTs, but I suspect that this induces a deadlock when the OSTs are trying to get new quota from the master.

I'm thinking that we should just revert the LU-5152 patch for now, since it is also causing performance issues. In the meantime, it means that quota would no longer properly handle file group ownership changes if group quota is enabled.

Comment by Andreas Dilger [ 16/Nov/18 ]

Hongchao, in the LU-5152 patch it may be OK to remove the dt_sync() call, and instead ensure in the OSP code to properly order the chgrp with previous setattr requests. That could avoid the deadlock here, as well as the performance problems reported in LU-11303. Also, there is already a mechanism to force sync permission changes via permission_needs_sync() so we definitely should not be doing two syncs per setattr.

I've pushed https://review.whamcloud.com/33676 to revert the LU-5152 patch since I think it is causing as many problems as it solves.

Comment by Alex Zhuravlev [ 16/Nov/18 ]

Andreas, I think the deadlock wasn't caused by dt_sync() but rather by the sync settr RPC to OST (which is to verify the group has enough quota).

Comment by Andreas Dilger [ 16/Nov/18 ]

In either case, reverting the LU-5152 patch should avoid this problem. I'll note that this issue is marked as affecting 2.10.4, 2.10.5, and 2.10.6, and the LU-5152 patch was landed in 2.10.4 so it is at least a strong correlation for being the root cause.

Comment by Andreas Dilger [ 20/Nov/18 ]

For 2.10.5 the https://review.whamcloud.com/33682 revert of LU-5152 has passed testing and is the simplest solution. For 2.12 or 2.13 we are looking into a proper solution that preserves chgrp behavior without impacting the performance or inducing deadlocks.

Comment by Marek Magrys [ 21/Nov/18 ]

Ok, we could take the patches for a spin on one of our production systems. Should we pull both https://review.whamcloud.com/33682 (revert of LU-5152) and https://review.whamcloud.com/33586 ("LU-10048 osd: async truncate") or would you recommend to stick just to 33682 now?

Frequency of the issue got much lower during last weeks, so it will take some time before we confirm that the problem is solved.

Comment by Alex Zhuravlev [ 21/Nov/18 ]

revert of LU-5152 should be enough.

Comment by Lukasz Flis [ 10/Dec/18 ]

Short update:

After moving to 2.10.6 RC2 with LU-5152 reverted we still see hung threads on OSSes

 

 

[507798.983947] LNet: Service thread pid 17130 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
 [507799.001471] Pid: 17130, comm: ll_ost_io00_094 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
 [507799.012231] Call Trace:
 [507799.014947] [<ffffffffc0eac085>] wait_transaction_locked+0x85/0xd0 [jbd2]
 [507799.022084] [<ffffffffc0eac368>] add_transaction_credits+0x268/0x2f0 [jbd2]
 [507799.029399] [<ffffffffc0eac5e1>] start_this_handle+0x1a1/0x430 [jbd2]
 [507799.036186] [<ffffffffc0eaca93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
 [507799.043074] [<ffffffffc0f22009>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
 [507799.050909] [<ffffffffc0fecb1e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
 [507799.058151] [<ffffffffc11242ae>] ofd_trans_start+0x6e/0xf0 [ofd]
 [507799.064513] [<ffffffffc112a75b>] ofd_commitrw_write+0x94b/0x1c90 [ofd]
 [507799.071400] [<ffffffffc112e7a9>] ofd_commitrw+0x4c9/0xae0 [ofd]
 [507799.077676] [<ffffffffc0bb7824>] obd_commitrw+0x2f3/0x336 [ptlrpc]
 [507799.084277] [<ffffffffc0b8a38d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
 [507799.091046] [<ffffffffc0b8638a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
 [507799.098231] [<ffffffffc0b2ee4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
 [507799.106355] [<ffffffffc0b32592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
 [507799.112928] [<ffffffffbc4bae31>] kthread+0xd1/0xe0
 [507799.118077] [<ffffffffbcb1f5f7>] ret_from_fork_nospec_end+0x0/0x39
 [507799.124610] [<ffffffffffffffff>] 0xffffffffffffffff
 

 

Comment by Alex Zhuravlev [ 10/Dec/18 ]

have you updated MDS? can you provide all stack traces please?

Comment by Lukasz Flis [ 10/Dec/18 ]

We have updated all servers: MDS + OSS  together.

Please find the attached stack traces from servers, and corresponding stacks from clients affected by the hangups

cyf_dec10_client_stacks.log cyf_dec10_server_stacks.log
There were no stack traces on MDS today. There were some in past days so i can attach them also if needed

 

 

Comment by Alex Zhuravlev [ 10/Dec/18 ]

hm, I see only writing threads on OSS (ofd_commitrw_write()) and around), but nobody else. either the log is missing important trace(s) or it's something different. is it possible to dump all threads (echo t >/proc/sysrq-trigger) and attach them to the ticket?

Comment by Marek Magrys [ 10/Dec/18 ]

Wi will bump servers kernel to latest RHEL 7.6 with patches from 2.10.6, our kernels from 2.10.4 lacked some ext4 patches which might be the root cause of the mentioned stack traces.

Comment by Andrew Perepechko [ 26/Jun/19 ]

Has this bug been really fixed in master?

LU-5152 hasn't been reverted in master, instead the MDS transaction became async.

However, this bug is not a deadlock between MDS and OSS as the commit message claims. It's an OSS deadlock, MDS just waits for a reply forever with an open transaction freezing other MDS journal users. OSS deadlocks because it locks oo_sem and gets a transaction handle in different order for common operations and out operations, e.g.

crash> bt ffff88081329cf10
PID: 50457  TASK: ffff88081329cf10  CPU: 14  COMMAND: "ll_ost_out01_00"
 #0 [ffff88026aca3988] __schedule at ffffffff816b3de4
 #1 [ffff88026aca3a10] schedule at ffffffff816b4409
 #2 [ffff88026aca3a20] rwsem_down_write_failed at ffffffff816b5cf5
 #3 [ffff88026aca3ab8] call_rwsem_down_write_failed at ffffffff81338247
 #4 [ffff88026aca3b00] down_write at ffffffff816b356d
 #5 [ffff88026aca3b18] osd_write_lock at ffffffffc151eb0c [osd_ldiskfs]
 #6 [ffff88026aca3b40] out_tx_attr_set_exec at ffffffffc0ea3399 [ptlrpc]
 #7 [ffff88026aca3b78] out_tx_end at ffffffffc0e9d771 [ptlrpc]
 #8 [ffff88026aca3bb8] out_handle at ffffffffc0ea1952 [ptlrpc]
 #9 [ffff88026aca3cf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc]
#10 [ffff88026aca3d40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc]
#11 [ffff88026aca3de0] ptlrpc_main at ffffffffc0e41862 [ptlrpc]
#12 [ffff88026aca3ec8] kthread at ffffffff810b4031
#13 [ffff88026aca3f50] ret_from_fork at ffffffff816c155d

crash> bt 0xffff88038d7eaf70
PID: 114059  TASK: ffff88038d7eaf70  CPU: 11  COMMAND: "ll_ost00_040"
 #0 [ffff8804b073b968] __schedule at ffffffff816b3de4
 #1 [ffff8804b073b9f8] schedule at ffffffff816b4409
 #2 [ffff8804b073ba08] wait_transaction_locked at ffffffffc0739085 [jbd2]
 #3 [ffff8804b073ba60] add_transaction_credits at ffffffffc0739368 [jbd2]
 #4 [ffff8804b073bac0] start_this_handle at ffffffffc07395e1 [jbd2]
 #5 [ffff8804b073bb58] jbd2__journal_start at ffffffffc0739a93 [jbd2]
 #6 [ffff8804b073bba0] __ldiskfs_journal_start_sb at ffffffffc1469e59 [ldiskfs]
 #7 [ffff8804b073bbe0] osd_trans_start at ffffffffc152a2ce [osd_ldiskfs]
 #8 [ffff8804b073bc18] ofd_trans_start at ffffffffc15b01ae [ofd]
 #9 [ffff8804b073bc48] ofd_attr_set at ffffffffc15b3173 [ofd]
#10 [ffff8804b073bca0] ofd_setattr_hdl at ffffffffc159e8ed [ofd]
#11 [ffff8804b073bcf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc]
#12 [ffff8804b073bd40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc]
#13 [ffff8804b073bde0] ptlrpc_main at ffffffffc0e41862 [ptlrpc]
#14 [ffff8804b073bec8] kthread at ffffffff810b4031
#15 [ffff8804b073bf50] ret_from_fork at ffffffff816c155d
Comment by Andrew Perepechko [ 26/Jun/19 ]

looks like a LU-10048 duplicate

Ah, sorry, I missed this link while first reading this ticket. It really seems to be fixing this issue:

https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c

Comment by Lukasz Flis [ 18/Sep/19 ]

@Andrew Perepechko
The patch you've mentioned is not applying cleanly on b2_10. Do you consider backporting it to the 2.10 line?
In the past month we had 10+ lockups on OST related to this issue.

The patch you mentioned is the subset of LU-11613 , is there a plan to backport the whole set?
Can the change from  https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c be safely used without other changes from LU-11613?

Comment by Alex Zhuravlev [ 16/Apr/20 ]

luchuan_sugondo you still need this ticket open?

Generated at Sat Feb 10 02:44:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.