[LU-11465] OSS/MDS deadlock in 2.10.5 Created: 03/Oct/18 Updated: 12/Aug/22 Resolved: 12/Aug/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marek Magrys | Assignee: | Alex Zhuravlev |
| Resolution: | Incomplete | Votes: | 3 |
| Labels: | None | ||
| Environment: |
CentOS 7, 3.10.0-862.2.3.el7_lustre.x86_64, 1 MDS (+1 HA pair), 4 OSS |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
After an upgrade of Lustre servers from Centos 6 and IEEL/DDN version of Lustre 2.5 to Centos 7 and Lustre 2.10.5 we are experiencing stability issues related to and MDS/OSS deadlock, most likely caused by hung OST thread. The issue is visible on one of two filesystems, the smaller one, both are working on identical HW and SW stack, so the issue is most likely related to a specific workload on the filesystem. This is mostly an midterm storage filesystem, so more metadata ops happen on it than on the scratch one. We initially thought that the issue might be caused by our try to enable project quotas (tune2fs -O project on all targets), so we turned it off (tune2fs -O ^project), which didn't change a thing. Also all combinations of MDT/OST quota enforcement (initally it was 'g', than 'gp', now 'none') doesn't change a thing, so probably it is not related to the problem. In terms of relevant information, we run on sync_journal=1 as a workaround to a problem with memory accounting on the clients suggested by DDN some time ago. The crash happens at least once per few hours, sometimes sequentially one after another, on different OSS nodes. I also tagged 2.10.6 as affected, as we tried everything from 2.10.4 to current b2_10. On OSS we get:
[Tue Oct 2 11:56:44 2018] LNet: Service thread pid 11401 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Tue Oct 2 11:56:44 2018] Pid: 11401, comm: ll_ost_out00_00 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:44 2018] Call Trace: [Tue Oct 2 11:56:44 2018] [<ffffffff8395aa77>] call_rwsem_down_write_failed+0x17/0x30 [Tue Oct 2 11:56:44 2018] [<ffffffffc0eeaa8c>] osd_write_lock+0x5c/0xe0 [osd_ldiskfs] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b5b119>] out_tx_attr_set_exec+0x69/0x3f0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b55591>] out_tx_end+0xe1/0x5c0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b596d3>] out_handle+0x1453/0x1bc0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffff836bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:44 2018] [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:44 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.11401 [Tue Oct 2 11:56:44 2018] Pid: 35664, comm: ll_ost_io00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:44 2018] Call Trace: [Tue Oct 2 11:56:44 2018] [<ffffffffc0e7e495>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] [Tue Oct 2 11:56:44 2018] [<ffffffffc0e76533>] jbd2_journal_stop+0x343/0x3d0 [jbd2] [Tue Oct 2 11:56:44 2018] [<ffffffffc112ab4c>] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs] [Tue Oct 2 11:56:44 2018] [<ffffffffc0efb783>] osd_trans_stop+0x183/0x850 [osd_ldiskfs] [Tue Oct 2 11:56:44 2018] [<ffffffffc107d352>] ofd_trans_stop+0x22/0x60 [ofd] [Tue Oct 2 11:56:44 2018] [<ffffffffc10835f4>] ofd_commitrw_write+0x7e4/0x1c90 [ofd] [Tue Oct 2 11:56:44 2018] [<ffffffffc10877a9>] ofd_commitrw+0x4c9/0xae0 [ofd] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b80864>] obd_commitrw+0x2f3/0x336 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b5338d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffff836bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:45 2018] [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:45 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:45 2018] Pid: 35678, comm: ll_ost_io00_112 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:45 2018] Call Trace: [Tue Oct 2 11:56:45 2018] [<ffffffffc0e7e495>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] [Tue Oct 2 11:56:45 2018] [<ffffffffc0e76533>] jbd2_journal_stop+0x343/0x3d0 [jbd2] [Tue Oct 2 11:56:45 2018] [<ffffffffc112ab4c>] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs] [Tue Oct 2 11:56:45 2018] [<ffffffffc0efb783>] osd_trans_stop+0x183/0x850 [osd_ldiskfs] [Tue Oct 2 11:56:45 2018] [<ffffffffc107d352>] ofd_trans_stop+0x22/0x60 [ofd] [Tue Oct 2 11:56:45 2018] [<ffffffffc10835f4>] ofd_commitrw_write+0x7e4/0x1c90 [ofd] [Tue Oct 2 11:56:45 2018] [<ffffffffc10877a9>] ofd_commitrw+0x4c9/0xae0 [ofd] [Tue Oct 2 11:56:45 2018] [<ffffffffc0b80864>] obd_commitrw+0x2f3/0x336 [ptlrpc] [Tue Oct 2 11:56:45 2018] [<ffffffffc0b5338d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc] [Tue Oct 2 11:56:45 2018] [<ffffffffc0b4f38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:45 2018] [<ffffffffc0af7e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:45 2018] [<ffffffffc0afb592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:45 2018] [<ffffffff836bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:45 2018] [<ffffffff83d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:45 2018] [<ffffffffffffffff>] 0xffffffffffffffff
On MDS: [Tue Oct 2 11:56:44 2018] LNet: Service thread pid 69588 was inactive for 200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Tue Oct 2 11:56:44 2018] Pid: 69588, comm: mdt00_095 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:44 2018] Call Trace: [Tue Oct 2 11:56:44 2018] [<ffffffffc0f7e140>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0f7e61d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc14de463>] osp_remote_sync+0xd3/0x200 [osp] [Tue Oct 2 11:56:44 2018] [<ffffffffc14c6dd0>] osp_attr_set+0x4c0/0x5d0 [osp] [Tue Oct 2 11:56:44 2018] [<ffffffffc1421c6b>] lod_sub_attr_set+0x1cb/0x460 [lod] [Tue Oct 2 11:56:44 2018] [<ffffffffc1403be6>] lod_obj_stripe_attr_set_cb+0x16/0x30 [lod] [Tue Oct 2 11:56:44 2018] [<ffffffffc140fa96>] lod_obj_for_each_stripe+0xb6/0x230 [lod] [Tue Oct 2 11:56:44 2018] [<ffffffffc1411043>] lod_attr_set+0x2f3/0x9a0 [lod] [Tue Oct 2 11:56:44 2018] [<ffffffffc14810a0>] mdd_attr_set_internal+0x120/0x2a0 [mdd] [Tue Oct 2 11:56:44 2018] [<ffffffffc1481e8d>] mdd_attr_set+0x8ad/0xce0 [mdd] [Tue Oct 2 11:56:44 2018] [<ffffffffc13645f5>] mdt_reint_setattr+0xba5/0x1060 [mdt] [Tue Oct 2 11:56:44 2018] [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt] [Tue Oct 2 11:56:44 2018] [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [Tue Oct 2 11:56:44 2018] [<ffffffffc1351f07>] mdt_reint+0x67/0x140 [mdt] [Tue Oct 2 11:56:44 2018] [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:44 2018] [<ffffffff946bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:44 2018] [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:44 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.69588 [Tue Oct 2 11:56:49 2018] LNet: Service thread pid 7634 was inactive for 200.51s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Tue Oct 2 11:56:49 2018] Pid: 7634, comm: mdt_rdpg00_000 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:49 2018] Call Trace: [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc1002512>] top_trans_start+0x702/0x940 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc13ec3f1>] lod_trans_start+0x31/0x40 [lod] [Tue Oct 2 11:56:49 2018] [<ffffffffc148c1ba>] mdd_trans_start+0x1a/0x20 [mdd] [Tue Oct 2 11:56:49 2018] [<ffffffffc1481b69>] mdd_attr_set+0x589/0xce0 [mdd] [Tue Oct 2 11:56:49 2018] [<ffffffffc136c1d6>] mdt_mfd_close+0x1a6/0x610 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1371951>] mdt_close_internal+0x121/0x220 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1371c70>] mdt_close+0x220/0x780 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffff946bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:49 2018] [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:49 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:49 2018] LustreError: dumping log to /tmp/lustre-log.1538474210.7634 [Tue Oct 2 11:56:49 2018] Pid: 69508, comm: mdt00_067 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:49 2018] Call Trace: [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc1369de7>] mdt_empty_transno+0xf7/0x840 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136cf1e>] mdt_mfd_open+0x8de/0xe70 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136da2b>] mdt_finish_open+0x57b/0x690 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136f308>] mdt_reint_open+0x17c8/0x3190 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffff946bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:49 2018] [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:49 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:49 2018] Pid: 69607, comm: mdt00_105 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:49 2018] Call Trace: [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc1369de7>] mdt_empty_transno+0xf7/0x840 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136cf1e>] mdt_mfd_open+0x8de/0xe70 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136da2b>] mdt_finish_open+0x57b/0x690 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc136f308>] mdt_reint_open+0x17c8/0x3190 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffff946bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:49 2018] [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:49 2018] [<ffffffffffffffff>] 0xffffffffffffffff [Tue Oct 2 11:56:49 2018] Pid: 69594, comm: mdt00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [Tue Oct 2 11:56:49 2018] Call Trace: [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef085>] wait_transaction_locked+0x85/0xd0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef368>] add_transaction_credits+0x268/0x2f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06ef5e1>] start_this_handle+0x1a1/0x430 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc06efa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [Tue Oct 2 11:56:49 2018] [<ffffffffc118aa99>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc12cac8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] [Tue Oct 2 11:56:49 2018] [<ffffffffc1002512>] top_trans_start+0x702/0x940 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc13ec3f1>] lod_trans_start+0x31/0x40 [lod] [Tue Oct 2 11:56:49 2018] [<ffffffffc148c1ba>] mdd_trans_start+0x1a/0x20 [mdd] [Tue Oct 2 11:56:49 2018] [<ffffffffc14763c0>] mdd_create+0xbe0/0x1400 [mdd] [Tue Oct 2 11:56:49 2018] [<ffffffffc136fcb5>] mdt_reint_open+0x2175/0x3190 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1364b33>] mdt_reint_rec+0x83/0x210 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc134636b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1346892>] mdt_intent_reint+0x162/0x430 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc1351671>] mdt_intent_policy+0x441/0xc70 [mdt] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f3b2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f64b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fea452>] tgt_enqueue+0x62/0x210 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0fee38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f96e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffffc0f9a592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [Tue Oct 2 11:56:49 2018] [<ffffffff946bae31>] kthread+0xd1/0xe0 [Tue Oct 2 11:56:49 2018] [<ffffffff94d1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [Tue Oct 2 11:56:49 2018] [<ffffffffffffffff>] 0xffffffffffffffff Please have a look at above, full dmesg and lustre log files are attached as an archive.
|
| Comments |
| Comment by Alex Zhuravlev [ 05/Oct/18 ] |
|
looks like a |
| Comment by Marek Magrys [ 05/Oct/18 ] |
|
Might be, but mentioned LU was a minor priority with fix just to master (2.12). In our case we just had the same problem on our main filesystem, so I think it would be good to backport the resolution to 2.10, as 2.10.6 release is around the corner. |
| Comment by Lukasz Flis [ 10/Oct/18 ] |
|
Is there a patch for 2.10.5 available somewhere? |
| Comment by Marion Hakanson (Inactive) [ 18/Oct/18 ] |
|
We seem to be having the same deadlock here. Sometimes rebooting the MDS will clear it, sometimes not. All Lustre servers are at 2.10.5, CentOS-7.5, were upgraded from IEEL-2.3/CentOS-6. Clients are 2.10.2, CentOS-7.4. Oh, our OSTs are all ZFS, with MDT being ldiskfs. Symptoms we see are hundreds of threads on the MDS in "D" state, and all clients hang. Similar log entries as the othe poster. Suggestions on how to recover from or avoid this issue would be very welcome.
|
| Comment by Lukasz Flis [ 18/Oct/18 ] |
|
Marion do you have any idea which aplication causes this problem in your environment? If you can share what applications are you running, maybe we'll find some common factor Thanks |
| Comment by Marion Hakanson (Inactive) [ 18/Oct/18 ] |
|
No, we've not identified a triggering workload. We have a mixture of genomics analysis (many tiny jobs), MRI analysis, and Relion (CryoEM) MPI. But so far no correlation with the issue occurring. We upgraded on 09-October, ran for about a week without anything unusual. We did enable jobstats on Friday the 12th, and the first incident happened Sunday, with another on Tuesday (both of those cleared by rebooting just the MDS). Yesterday and today, we've had multiple instances, and rebooting the MDS/OSSs has not cleared things. Immediately after rebooting, the MDS gets on the order of one stuck thread for each client (250+), and I/O hangs on clients. We've been unable to clear the hangs except by rebooting all clients and servers, so our cluster has been unusable in that time. We disabled jobstats this morning, just prior to rebooting everything again, and are waiting to see if that has helped. Again, we need help, both with a patch, and with suggestions on an easier way to recover from this. |
| Comment by Marion Hakanson (Inactive) [ 19/Oct/18 ] |
|
We are still seeing the problem occur after disabling jobstats.
|
| Comment by Marion Hakanson (Inactive) [ 06/Nov/18 ] |
|
For Whamcloud/DDN folks, we also had a case (#111147) with DDN engineering to assist with diagnosing this issue at OHSU. You should be able to see the details here: https://community.ddn.com/50038000012Pme2
|
| Comment by Alex Zhuravlev [ 08/Nov/18 ] |
|
I'm able to reproduce the issue and couple patches under |
| Comment by Lukasz Flis [ 08/Nov/18 ] |
|
Alex, we can test lustre with and without the patch on our testing instance tommorow if you can share the reproducer code. |
| Comment by Andreas Dilger [ 16/Nov/18 ] |
|
I think that the I'm thinking that we should just revert the |
| Comment by Andreas Dilger [ 16/Nov/18 ] |
|
Hongchao, in the I've pushed https://review.whamcloud.com/33676 to revert the |
| Comment by Alex Zhuravlev [ 16/Nov/18 ] |
|
Andreas, I think the deadlock wasn't caused by dt_sync() but rather by the sync settr RPC to OST (which is to verify the group has enough quota). |
| Comment by Andreas Dilger [ 16/Nov/18 ] |
|
In either case, reverting the |
| Comment by Andreas Dilger [ 20/Nov/18 ] |
|
For 2.10.5 the https://review.whamcloud.com/33682 revert of |
| Comment by Marek Magrys [ 21/Nov/18 ] |
|
Ok, we could take the patches for a spin on one of our production systems. Should we pull both https://review.whamcloud.com/33682 (revert of Frequency of the issue got much lower during last weeks, so it will take some time before we confirm that the problem is solved. |
| Comment by Alex Zhuravlev [ 21/Nov/18 ] |
|
revert of |
| Comment by Lukasz Flis [ 10/Dec/18 ] |
|
Short update: After moving to 2.10.6 RC2 with
[507798.983947] LNet: Service thread pid 17130 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [507799.001471] Pid: 17130, comm: ll_ost_io00_094 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018 [507799.012231] Call Trace: [507799.014947] [<ffffffffc0eac085>] wait_transaction_locked+0x85/0xd0 [jbd2] [507799.022084] [<ffffffffc0eac368>] add_transaction_credits+0x268/0x2f0 [jbd2] [507799.029399] [<ffffffffc0eac5e1>] start_this_handle+0x1a1/0x430 [jbd2] [507799.036186] [<ffffffffc0eaca93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [507799.043074] [<ffffffffc0f22009>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [507799.050909] [<ffffffffc0fecb1e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] [507799.058151] [<ffffffffc11242ae>] ofd_trans_start+0x6e/0xf0 [ofd] [507799.064513] [<ffffffffc112a75b>] ofd_commitrw_write+0x94b/0x1c90 [ofd] [507799.071400] [<ffffffffc112e7a9>] ofd_commitrw+0x4c9/0xae0 [ofd] [507799.077676] [<ffffffffc0bb7824>] obd_commitrw+0x2f3/0x336 [ptlrpc] [507799.084277] [<ffffffffc0b8a38d>] tgt_brw_write+0xffd/0x17d0 [ptlrpc] [507799.091046] [<ffffffffc0b8638a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [507799.098231] [<ffffffffc0b2ee4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [507799.106355] [<ffffffffc0b32592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [507799.112928] [<ffffffffbc4bae31>] kthread+0xd1/0xe0 [507799.118077] [<ffffffffbcb1f5f7>] ret_from_fork_nospec_end+0x0/0x39 [507799.124610] [<ffffffffffffffff>] 0xffffffffffffffff
|
| Comment by Alex Zhuravlev [ 10/Dec/18 ] |
|
have you updated MDS? can you provide all stack traces please? |
| Comment by Lukasz Flis [ 10/Dec/18 ] |
|
We have updated all servers: MDS + OSS together. Please find the attached stack traces from servers, and corresponding stacks from clients affected by the hangups cyf_dec10_client_stacks.log
|
| Comment by Alex Zhuravlev [ 10/Dec/18 ] |
|
hm, I see only writing threads on OSS (ofd_commitrw_write()) and around), but nobody else. either the log is missing important trace(s) or it's something different. is it possible to dump all threads (echo t >/proc/sysrq-trigger) and attach them to the ticket? |
| Comment by Marek Magrys [ 10/Dec/18 ] |
|
Wi will bump servers kernel to latest RHEL 7.6 with patches from 2.10.6, our kernels from 2.10.4 lacked some ext4 patches which might be the root cause of the mentioned stack traces. |
| Comment by Andrew Perepechko [ 26/Jun/19 ] |
|
Has this bug been really fixed in master?
However, this bug is not a deadlock between MDS and OSS as the commit message claims. It's an OSS deadlock, MDS just waits for a reply forever with an open transaction freezing other MDS journal users. OSS deadlocks because it locks oo_sem and gets a transaction handle in different order for common operations and out operations, e.g. crash> bt ffff88081329cf10 PID: 50457 TASK: ffff88081329cf10 CPU: 14 COMMAND: "ll_ost_out01_00" #0 [ffff88026aca3988] __schedule at ffffffff816b3de4 #1 [ffff88026aca3a10] schedule at ffffffff816b4409 #2 [ffff88026aca3a20] rwsem_down_write_failed at ffffffff816b5cf5 #3 [ffff88026aca3ab8] call_rwsem_down_write_failed at ffffffff81338247 #4 [ffff88026aca3b00] down_write at ffffffff816b356d #5 [ffff88026aca3b18] osd_write_lock at ffffffffc151eb0c [osd_ldiskfs] #6 [ffff88026aca3b40] out_tx_attr_set_exec at ffffffffc0ea3399 [ptlrpc] #7 [ffff88026aca3b78] out_tx_end at ffffffffc0e9d771 [ptlrpc] #8 [ffff88026aca3bb8] out_handle at ffffffffc0ea1952 [ptlrpc] #9 [ffff88026aca3cf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc] #10 [ffff88026aca3d40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc] #11 [ffff88026aca3de0] ptlrpc_main at ffffffffc0e41862 [ptlrpc] #12 [ffff88026aca3ec8] kthread at ffffffff810b4031 #13 [ffff88026aca3f50] ret_from_fork at ffffffff816c155d crash> bt 0xffff88038d7eaf70 PID: 114059 TASK: ffff88038d7eaf70 CPU: 11 COMMAND: "ll_ost00_040" #0 [ffff8804b073b968] __schedule at ffffffff816b3de4 #1 [ffff8804b073b9f8] schedule at ffffffff816b4409 #2 [ffff8804b073ba08] wait_transaction_locked at ffffffffc0739085 [jbd2] #3 [ffff8804b073ba60] add_transaction_credits at ffffffffc0739368 [jbd2] #4 [ffff8804b073bac0] start_this_handle at ffffffffc07395e1 [jbd2] #5 [ffff8804b073bb58] jbd2__journal_start at ffffffffc0739a93 [jbd2] #6 [ffff8804b073bba0] __ldiskfs_journal_start_sb at ffffffffc1469e59 [ldiskfs] #7 [ffff8804b073bbe0] osd_trans_start at ffffffffc152a2ce [osd_ldiskfs] #8 [ffff8804b073bc18] ofd_trans_start at ffffffffc15b01ae [ofd] #9 [ffff8804b073bc48] ofd_attr_set at ffffffffc15b3173 [ofd] #10 [ffff8804b073bca0] ofd_setattr_hdl at ffffffffc159e8ed [ofd] #11 [ffff8804b073bcf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc] #12 [ffff8804b073bd40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc] #13 [ffff8804b073bde0] ptlrpc_main at ffffffffc0e41862 [ptlrpc] #14 [ffff8804b073bec8] kthread at ffffffff810b4031 #15 [ffff8804b073bf50] ret_from_fork at ffffffff816c155d |
| Comment by Andrew Perepechko [ 26/Jun/19 ] |
Ah, sorry, I missed this link while first reading this ticket. It really seems to be fixing this issue: https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c |
| Comment by Lukasz Flis [ 18/Sep/19 ] |
|
@Andrew Perepechko The patch you mentioned is the subset of |
| Comment by Alex Zhuravlev [ 16/Apr/20 ] |
|
luchuan_sugondo you still need this ticket open? |