<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:44:06 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11465] OSS/MDS deadlock in 2.10.5</title>
                <link>https://jira.whamcloud.com/browse/LU-11465</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After an upgrade of Lustre servers from Centos 6 and IEEL/DDN version of Lustre 2.5 to Centos 7 and Lustre 2.10.5 we are experiencing stability issues related to and MDS/OSS deadlock, most likely caused by hung OST thread. The issue is visible on one of two filesystems, the smaller one, both are working on identical HW and SW stack, so the issue is most likely related to a specific workload on the filesystem. This is mostly an midterm storage filesystem, so more metadata ops happen on it than on the scratch one. We initially thought that the issue might be caused by our try to enable project quotas (tune2fs -O project on all targets), so we turned it off (tune2fs -O ^project), which didn&apos;t change a thing. Also all combinations of MDT/OST quota enforcement (initally it was &apos;g&apos;, than &apos;gp&apos;, now &apos;none&apos;) doesn&apos;t change a thing, so probably it is not related to the problem. In terms of relevant information, we run on sync_journal=1 as a workaround to a problem with memory accounting on the clients suggested by DDN some time ago. The crash happens at least once per few hours, sometimes sequentially one after another, on different OSS nodes. I also tagged 2.10.6 as affected, as we tried everything from 2.10.4 to current b2_10.&lt;/p&gt;

&lt;p&gt;On OSS we get:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[Tue Oct  2 11:56:44 2018] LNet: Service thread pid 11401 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
[Tue Oct  2 11:56:44 2018] Pid: 11401, comm: ll_ost_out00_00 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff8395aa77&amp;gt;] call_rwsem_down_write_failed+0x17/0x30
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0eeaa8c&amp;gt;] osd_write_lock+0x5c/0xe0 [osd_ldiskfs]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b5b119&amp;gt;] out_tx_attr_set_exec+0x69/0x3f0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b55591&amp;gt;] out_tx_end+0xe1/0x5c0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b596d3&amp;gt;] out_handle+0x1453/0x1bc0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b4f38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0af7e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0afb592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff836bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff83d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.11401
[Tue Oct  2 11:56:44 2018] Pid: 35664, comm: ll_ost_io00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0e7e495&amp;gt;] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0e76533&amp;gt;] jbd2_journal_stop+0x343/0x3d0 [jbd2]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc112ab4c&amp;gt;] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0efb783&amp;gt;] osd_trans_stop+0x183/0x850 [osd_ldiskfs]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc107d352&amp;gt;] ofd_trans_stop+0x22/0x60 [ofd]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc10835f4&amp;gt;] ofd_commitrw_write+0x7e4/0x1c90 [ofd]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc10877a9&amp;gt;] ofd_commitrw+0x4c9/0xae0 [ofd]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b80864&amp;gt;] obd_commitrw+0x2f3/0x336 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b5338d&amp;gt;] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0b4f38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0af7e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0afb592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff836bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffff83d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:45 2018] Pid: 35678, comm: ll_ost_io00_112 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:45 2018] Call Trace:
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0e7e495&amp;gt;] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0e76533&amp;gt;] jbd2_journal_stop+0x343/0x3d0 [jbd2]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc112ab4c&amp;gt;] __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0efb783&amp;gt;] osd_trans_stop+0x183/0x850 [osd_ldiskfs]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc107d352&amp;gt;] ofd_trans_stop+0x22/0x60 [ofd]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc10835f4&amp;gt;] ofd_commitrw_write+0x7e4/0x1c90 [ofd]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc10877a9&amp;gt;] ofd_commitrw+0x4c9/0xae0 [ofd]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0b80864&amp;gt;] obd_commitrw+0x2f3/0x336 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0b5338d&amp;gt;] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0b4f38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0af7e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffc0afb592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffff836bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffff83d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:45 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;On MDS:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[Tue Oct  2 11:56:44 2018] LNet: Service thread pid 69588 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
[Tue Oct  2 11:56:44 2018] Pid: 69588, comm: mdt00_095 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:44 2018] Call Trace:
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0f7e140&amp;gt;] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0f7e61d&amp;gt;] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc14de463&amp;gt;] osp_remote_sync+0xd3/0x200 [osp]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc14c6dd0&amp;gt;] osp_attr_set+0x4c0/0x5d0 [osp]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1421c6b&amp;gt;] lod_sub_attr_set+0x1cb/0x460 [lod]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1403be6&amp;gt;] lod_obj_stripe_attr_set_cb+0x16/0x30 [lod]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc140fa96&amp;gt;] lod_obj_for_each_stripe+0xb6/0x230 [lod]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1411043&amp;gt;] lod_attr_set+0x2f3/0x9a0 [lod]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc14810a0&amp;gt;] mdd_attr_set_internal+0x120/0x2a0 [mdd]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1481e8d&amp;gt;] mdd_attr_set+0x8ad/0xce0 [mdd]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc13645f5&amp;gt;] mdt_reint_setattr+0xba5/0x1060 [mdt]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1364b33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc134636b&amp;gt;] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc1351f07&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0fee38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0f96e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffc0f9a592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff946bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffff94d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:44 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:44 2018] LustreError: dumping log to /tmp/lustre-log.1538474204.69588
[Tue Oct  2 11:56:49 2018] LNet: Service thread pid 7634 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.51s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
[Tue Oct  2 11:56:49 2018] Pid: 7634, comm: mdt_rdpg00_000 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef085&amp;gt;] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef368&amp;gt;] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef5e1&amp;gt;] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06efa93&amp;gt;] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc118aa99&amp;gt;] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc12cac8e&amp;gt;] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1002512&amp;gt;] top_trans_start+0x702/0x940 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc13ec3f1&amp;gt;] lod_trans_start+0x31/0x40 [lod]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc148c1ba&amp;gt;] mdd_trans_start+0x1a/0x20 [mdd]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1481b69&amp;gt;] mdd_attr_set+0x589/0xce0 [mdd]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136c1d6&amp;gt;] mdt_mfd_close+0x1a6/0x610 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1371951&amp;gt;] mdt_close_internal+0x121/0x220 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1371c70&amp;gt;] mdt_close+0x220/0x780 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fee38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f96e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f9a592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff946bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff94d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] LustreError: dumping log to /tmp/lustre-log.1538474210.7634
[Tue Oct  2 11:56:49 2018] Pid: 69508, comm: mdt00_067 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef085&amp;gt;] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef368&amp;gt;] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef5e1&amp;gt;] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06efa93&amp;gt;] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc118aa99&amp;gt;] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc12cac8e&amp;gt;] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1369de7&amp;gt;] mdt_empty_transno+0xf7/0x840 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136cf1e&amp;gt;] mdt_mfd_open+0x8de/0xe70 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136da2b&amp;gt;] mdt_finish_open+0x57b/0x690 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136f308&amp;gt;] mdt_reint_open+0x17c8/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1364b33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc134636b&amp;gt;] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1346892&amp;gt;] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1351671&amp;gt;] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f3b2ba&amp;gt;] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f64b53&amp;gt;] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fea452&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fee38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f96e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f9a592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff946bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff94d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] Pid: 69607, comm: mdt00_105 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef085&amp;gt;] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef368&amp;gt;] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef5e1&amp;gt;] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06efa93&amp;gt;] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc118aa99&amp;gt;] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc12cac8e&amp;gt;] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1369de7&amp;gt;] mdt_empty_transno+0xf7/0x840 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136cf1e&amp;gt;] mdt_mfd_open+0x8de/0xe70 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136da2b&amp;gt;] mdt_finish_open+0x57b/0x690 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136f308&amp;gt;] mdt_reint_open+0x17c8/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1364b33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc134636b&amp;gt;] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1346892&amp;gt;] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1351671&amp;gt;] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f3b2ba&amp;gt;] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f64b53&amp;gt;] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fea452&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fee38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f96e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f9a592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff946bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff94d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[Tue Oct  2 11:56:49 2018] Pid: 69594, comm: mdt00_098 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
[Tue Oct  2 11:56:49 2018] Call Trace:
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef085&amp;gt;] wait_transaction_locked+0x85/0xd0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef368&amp;gt;] add_transaction_credits+0x268/0x2f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06ef5e1&amp;gt;] start_this_handle+0x1a1/0x430 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc06efa93&amp;gt;] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc118aa99&amp;gt;] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc12cac8e&amp;gt;] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1002512&amp;gt;] top_trans_start+0x702/0x940 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc13ec3f1&amp;gt;] lod_trans_start+0x31/0x40 [lod]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc148c1ba&amp;gt;] mdd_trans_start+0x1a/0x20 [mdd]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc14763c0&amp;gt;] mdd_create+0xbe0/0x1400 [mdd]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc136fcb5&amp;gt;] mdt_reint_open+0x2175/0x3190 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1364b33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc134636b&amp;gt;] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1346892&amp;gt;] mdt_intent_reint+0x162/0x430 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc1351671&amp;gt;] mdt_intent_policy+0x441/0xc70 [mdt]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f3b2ba&amp;gt;] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f64b53&amp;gt;] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fea452&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0fee38a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f96e4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffc0f9a592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff946bae31&amp;gt;] kthread+0xd1/0xe0
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffff94d1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[Tue Oct  2 11:56:49 2018]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Please have a look at above, full dmesg and lustre log files are attached as an archive.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>CentOS 7, 3.10.0-862.2.3.el7_lustre.x86_64, 1 MDS (+1 HA pair), 4 OSS</environment>
        <key id="53484">LU-11465</key>
            <summary>OSS/MDS deadlock in 2.10.5</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="4">Incomplete</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="m.magrys">Marek Magrys</reporter>
                        <labels>
                    </labels>
                <created>Wed, 3 Oct 2018 17:07:13 +0000</created>
                <updated>Fri, 12 Aug 2022 21:52:51 +0000</updated>
                            <resolved>Fri, 12 Aug 2022 21:52:51 +0000</resolved>
                                    <version>Lustre 2.10.4</version>
                    <version>Lustre 2.10.5</version>
                    <version>Lustre 2.10.6</version>
                                                        <due></due>
                            <votes>3</votes>
                                    <watches>19</watches>
                                                                            <comments>
                            <comment id="234419" author="bzzz" created="Fri, 5 Oct 2018 10:05:56 +0000"  >&lt;p&gt;looks like a &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10048&quot; title=&quot;osd-ldiskfs to truncate outside of main transaction&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10048&quot;&gt;&lt;del&gt;LU-10048&lt;/del&gt;&lt;/a&gt; duplicate&lt;/p&gt;</comment>
                            <comment id="234450" author="m.magrys" created="Fri, 5 Oct 2018 17:58:48 +0000"  >&lt;p&gt;Might be, but mentioned LU was a minor priority with fix just to master (2.12). In our case we just had the same problem on our main filesystem,&#160; so I think it would be good to backport the resolution to 2.10,&#160; as 2.10.6 release is around the corner.&lt;/p&gt;</comment>
                            <comment id="234705" author="lflis" created="Wed, 10 Oct 2018 11:22:40 +0000"  >&lt;p&gt;Is there a patch for 2.10.5&#160; available somewhere?&lt;/p&gt;</comment>
                            <comment id="235073" author="hakanson" created="Thu, 18 Oct 2018 06:02:58 +0000"  >&lt;p&gt;We seem to be having the same deadlock here. &#160;Sometimes rebooting the MDS will clear it, sometimes not. &#160;All Lustre servers are at 2.10.5, CentOS-7.5, were upgraded from IEEL-2.3/CentOS-6. &#160;Clients are 2.10.2, CentOS-7.4.&lt;/p&gt;

&lt;p&gt;&#160;Oh, our OSTs are all ZFS, with MDT being ldiskfs. &#160;Symptoms we see are hundreds of threads on the MDS in &quot;D&quot; state, and all clients hang. &#160;Similar log entries as the othe poster.&lt;/p&gt;

&lt;p&gt;Suggestions on how to recover from or avoid this issue would be very welcome.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="235078" author="lflis" created="Thu, 18 Oct 2018 09:46:12 +0000"  >&lt;p&gt;Marion do you have any idea which aplication causes this problem in your environment?&lt;br/&gt;
We @CYFRONET are still trying to isolate the problematic type of workload in order to make simple reproducer out of it.&lt;/p&gt;

&lt;p&gt;If you can share what applications are you running, maybe we&apos;ll find some common factor&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="235118" author="hakanson" created="Thu, 18 Oct 2018 19:12:35 +0000"  >&lt;p&gt;No, we&apos;ve not identified a triggering workload.&#160; We have a mixture of genomics analysis (many tiny jobs), MRI analysis, and Relion (CryoEM) MPI.&#160; But so far no correlation with the issue occurring.&lt;/p&gt;

&lt;p&gt;We upgraded on 09-October, ran for about a week without anything unusual.&#160; We did enable jobstats on Friday the 12th, and the first incident happened Sunday, with another on Tuesday (both of those cleared by rebooting just the MDS).&lt;/p&gt;

&lt;p&gt;Yesterday and today, we&apos;ve had multiple instances, and rebooting the MDS/OSSs has not cleared things.&#160; Immediately after rebooting, the MDS gets on the order of one stuck thread for each client (250+), and I/O hangs on clients.&#160; We&apos;ve been unable to clear the hangs except by rebooting all clients and servers, so our cluster has been unusable in that time.&lt;/p&gt;

&lt;p&gt;We disabled jobstats this morning, just prior to rebooting everything again, and are waiting to see if that has helped.&lt;/p&gt;

&lt;p&gt;Again, we need help, both with a patch, and with suggestions on an easier way to recover from this.&lt;/p&gt;</comment>
                            <comment id="235137" author="hakanson" created="Fri, 19 Oct 2018 06:17:01 +0000"  >&lt;p&gt;We are still seeing the problem occur after disabling jobstats.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="236496" author="hakanson" created="Tue, 6 Nov 2018 23:24:47 +0000"  >&lt;p&gt;For Whamcloud/DDN folks, we also had a case (#111147) with DDN engineering to assist with diagnosing this issue at OHSU.&#160; You should be able to see the details here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://community.ddn.com/50038000012Pme2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://community.ddn.com/50038000012Pme2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="236663" author="bzzz" created="Thu, 8 Nov 2018 17:06:43 +0000"  >&lt;p&gt;I&apos;m able to reproduce the issue and couple patches under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10048&quot; title=&quot;osd-ldiskfs to truncate outside of main transaction&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10048&quot;&gt;&lt;del&gt;LU-10048&lt;/del&gt;&lt;/a&gt; seem to solve this. they are in testing now..&lt;/p&gt;</comment>
                            <comment id="236688" author="lflis" created="Thu, 8 Nov 2018 18:42:50 +0000"  >&lt;p&gt;Alex, we can test lustre with and without the patch on our testing instance tommorow if you can share the reproducer code.&lt;/p&gt;</comment>
                            <comment id="237112" author="adilger" created="Fri, 16 Nov 2018 20:56:05 +0000"  >&lt;p&gt;I think that the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch may be the root cause of for this problem.  It sends a sync during chgrp to the OSTs, but I suspect that this induces a deadlock when the OSTs are trying to get new quota from the master.&lt;/p&gt;

&lt;p&gt;I&apos;m thinking that we should just revert the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch for now, since it is also causing performance issues.  In the meantime, it means that quota would no longer properly handle file group ownership changes if group quota is enabled.&lt;/p&gt;</comment>
                            <comment id="237114" author="adilger" created="Fri, 16 Nov 2018 21:16:34 +0000"  >&lt;p&gt;Hongchao, in the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch it may be OK to remove the &lt;tt&gt;dt_sync()&lt;/tt&gt; call, and instead ensure in the OSP code to properly order the chgrp with previous setattr requests.  That could avoid the deadlock here, as well as the performance problems reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11303&quot; title=&quot;slow chgrp as user when quotas are enabled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11303&quot;&gt;&lt;del&gt;LU-11303&lt;/del&gt;&lt;/a&gt;.  Also, there is already a mechanism to force sync permission changes via &lt;tt&gt;permission_needs_sync()&lt;/tt&gt; so we definitely should &lt;b&gt;not&lt;/b&gt; be doing two syncs per setattr.&lt;/p&gt;

&lt;p&gt;I&apos;ve pushed &lt;a href=&quot;https://review.whamcloud.com/33676&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33676&lt;/a&gt; to revert the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch since I think it is causing as many problems as it solves.&lt;/p&gt;</comment>
                            <comment id="237115" author="bzzz" created="Fri, 16 Nov 2018 21:23:51 +0000"  >&lt;p&gt;Andreas, I think the deadlock wasn&apos;t caused by dt_sync() but rather by the sync settr RPC to OST (which is to verify the group has enough quota).&lt;/p&gt;</comment>
                            <comment id="237119" author="adilger" created="Fri, 16 Nov 2018 22:17:14 +0000"  >&lt;p&gt;In either case, reverting the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch should avoid this problem.  I&apos;ll note that this issue is marked as affecting 2.10.4, 2.10.5, and 2.10.6, and the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; patch was landed in 2.10.4 so it is at least a strong correlation for being the root cause.&lt;/p&gt;</comment>
                            <comment id="237224" author="adilger" created="Tue, 20 Nov 2018 00:31:12 +0000"  >&lt;p&gt;For 2.10.5 the &lt;a href=&quot;https://review.whamcloud.com/33682&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33682&lt;/a&gt; revert of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; has passed testing and is the simplest solution.  For 2.12 or 2.13 we are looking into a proper solution that preserves chgrp behavior without impacting the performance or inducing deadlocks.&lt;/p&gt;</comment>
                            <comment id="237347" author="m.magrys" created="Wed, 21 Nov 2018 16:41:29 +0000"  >&lt;p&gt;Ok, we could take the patches for a spin on one of our production systems. Should we pull both&#160;&lt;a href=&quot;https://review.whamcloud.com/33682&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33682&lt;/a&gt; (revert of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt;) and &lt;a href=&quot;https://review.whamcloud.com/33586&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33586&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10048&quot; title=&quot;osd-ldiskfs to truncate outside of main transaction&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10048&quot;&gt;&lt;del&gt;LU-10048&lt;/del&gt;&lt;/a&gt; osd: async truncate&quot;) or would you recommend to stick just to 33682 now?&lt;/p&gt;

&lt;p&gt;Frequency of the issue got much lower during last weeks, so it will take some time before we confirm that the problem is solved.&lt;/p&gt;</comment>
                            <comment id="237352" author="bzzz" created="Wed, 21 Nov 2018 18:03:42 +0000"  >&lt;p&gt;revert of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; should be enough.&lt;/p&gt;</comment>
                            <comment id="238269" author="lflis" created="Mon, 10 Dec 2018 11:57:56 +0000"  >&lt;p&gt;Short update:&lt;/p&gt;

&lt;p&gt;After moving to 2.10.6 RC2 with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; reverted we still see hung threads on OSSes&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[507798.983947] LNet: Service thread pid 17130 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
 [507799.001471] Pid: 17130, comm: ll_ost_io00_094 3.10.0-862.2.3.el7_lustre.x86_64 #1 SMP Tue May 22 17:36:23 UTC 2018
 [507799.012231] Call Trace:
 [507799.014947] [&amp;lt;ffffffffc0eac085&amp;gt;] wait_transaction_locked+0x85/0xd0 [jbd2]
 [507799.022084] [&amp;lt;ffffffffc0eac368&amp;gt;] add_transaction_credits+0x268/0x2f0 [jbd2]
 [507799.029399] [&amp;lt;ffffffffc0eac5e1&amp;gt;] start_this_handle+0x1a1/0x430 [jbd2]
 [507799.036186] [&amp;lt;ffffffffc0eaca93&amp;gt;] jbd2__journal_start+0xf3/0x1f0 [jbd2]
 [507799.043074] [&amp;lt;ffffffffc0f22009&amp;gt;] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
 [507799.050909] [&amp;lt;ffffffffc0fecb1e&amp;gt;] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
 [507799.058151] [&amp;lt;ffffffffc11242ae&amp;gt;] ofd_trans_start+0x6e/0xf0 [ofd]
 [507799.064513] [&amp;lt;ffffffffc112a75b&amp;gt;] ofd_commitrw_write+0x94b/0x1c90 [ofd]
 [507799.071400] [&amp;lt;ffffffffc112e7a9&amp;gt;] ofd_commitrw+0x4c9/0xae0 [ofd]
 [507799.077676] [&amp;lt;ffffffffc0bb7824&amp;gt;] obd_commitrw+0x2f3/0x336 [ptlrpc]
 [507799.084277] [&amp;lt;ffffffffc0b8a38d&amp;gt;] tgt_brw_write+0xffd/0x17d0 [ptlrpc]
 [507799.091046] [&amp;lt;ffffffffc0b8638a&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
 [507799.098231] [&amp;lt;ffffffffc0b2ee4b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
 [507799.106355] [&amp;lt;ffffffffc0b32592&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
 [507799.112928] [&amp;lt;ffffffffbc4bae31&amp;gt;] kthread+0xd1/0xe0
 [507799.118077] [&amp;lt;ffffffffbcb1f5f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
 [507799.124610] [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&#160;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="238270" author="bzzz" created="Mon, 10 Dec 2018 12:01:15 +0000"  >&lt;p&gt;have you updated MDS? can you provide all stack traces please?&lt;/p&gt;</comment>
                            <comment id="238275" author="lflis" created="Mon, 10 Dec 2018 13:44:15 +0000"  >&lt;p&gt;We have updated all servers: MDS + OSS&#160; together.&lt;/p&gt;

&lt;p&gt;Please find the attached stack traces from servers, and corresponding stacks from clients affected by the hangups&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/31602/31602_cyf_dec10_client_stacks.log&quot; title=&quot;cyf_dec10_client_stacks.log attached to LU-11465&quot;&gt;cyf_dec10_client_stacks.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/31603/31603_cyf_dec10_server_stacks.log&quot; title=&quot;cyf_dec10_server_stacks.log attached to LU-11465&quot;&gt;cyf_dec10_server_stacks.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
 There were no stack traces on MDS today. There were some in past days so i can attach them also if needed&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="238278" author="bzzz" created="Mon, 10 Dec 2018 13:53:57 +0000"  >&lt;p&gt;hm, I see only writing threads on OSS (ofd_commitrw_write()) and around), but nobody else. either the log is missing important trace(s) or it&apos;s something different. is it possible to dump all threads (echo t &amp;gt;/proc/sysrq-trigger) and attach them to the ticket?&lt;/p&gt;</comment>
                            <comment id="238283" author="m.magrys" created="Mon, 10 Dec 2018 14:07:57 +0000"  >&lt;p&gt;Wi will bump servers kernel to latest RHEL 7.6 with patches from 2.10.6, our kernels from 2.10.4 lacked some ext4 patches which might be the root cause of the mentioned stack traces.&lt;/p&gt;</comment>
                            <comment id="250104" author="panda" created="Wed, 26 Jun 2019 19:52:30 +0000"  >&lt;p&gt;Has this bug been really fixed in master?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5152&quot; title=&quot;Can&amp;#39;t enforce block quota when unprivileged user change group&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5152&quot;&gt;&lt;del&gt;LU-5152&lt;/del&gt;&lt;/a&gt; hasn&apos;t been reverted in master, instead the MDS transaction became async.&lt;/p&gt;

&lt;p&gt;However, this bug is not a deadlock between MDS and OSS as the commit message claims. It&apos;s an OSS deadlock, MDS just waits for a reply forever with an open transaction freezing other MDS journal users. OSS deadlocks because it locks oo_sem and gets a transaction handle in different order for common operations and out operations, e.g.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
crash&amp;gt; bt ffff88081329cf10
PID: 50457  TASK: ffff88081329cf10  CPU: 14  COMMAND: &lt;span class=&quot;code-quote&quot;&gt;&quot;ll_ost_out01_00&quot;&lt;/span&gt;
 #0 [ffff88026aca3988] __schedule at ffffffff816b3de4
 #1 [ffff88026aca3a10] schedule at ffffffff816b4409
 #2 [ffff88026aca3a20] rwsem_down_write_failed at ffffffff816b5cf5
 #3 [ffff88026aca3ab8] call_rwsem_down_write_failed at ffffffff81338247
 #4 [ffff88026aca3b00] down_write at ffffffff816b356d
 #5 [ffff88026aca3b18] osd_write_lock at ffffffffc151eb0c [osd_ldiskfs]
 #6 [ffff88026aca3b40] out_tx_attr_set_exec at ffffffffc0ea3399 [ptlrpc]
 #7 [ffff88026aca3b78] out_tx_end at ffffffffc0e9d771 [ptlrpc]
 #8 [ffff88026aca3bb8] out_handle at ffffffffc0ea1952 [ptlrpc]
 #9 [ffff88026aca3cf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc]
#10 [ffff88026aca3d40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc]
#11 [ffff88026aca3de0] ptlrpc_main at ffffffffc0e41862 [ptlrpc]
#12 [ffff88026aca3ec8] kthread at ffffffff810b4031
#13 [ffff88026aca3f50] ret_from_fork at ffffffff816c155d

crash&amp;gt; bt 0xffff88038d7eaf70
PID: 114059  TASK: ffff88038d7eaf70  CPU: 11  COMMAND: &lt;span class=&quot;code-quote&quot;&gt;&quot;ll_ost00_040&quot;&lt;/span&gt;
 #0 [ffff8804b073b968] __schedule at ffffffff816b3de4
 #1 [ffff8804b073b9f8] schedule at ffffffff816b4409
 #2 [ffff8804b073ba08] wait_transaction_locked at ffffffffc0739085 [jbd2]
 #3 [ffff8804b073ba60] add_transaction_credits at ffffffffc0739368 [jbd2]
 #4 [ffff8804b073bac0] start_this_handle at ffffffffc07395e1 [jbd2]
 #5 [ffff8804b073bb58] jbd2__journal_start at ffffffffc0739a93 [jbd2]
 #6 [ffff8804b073bba0] __ldiskfs_journal_start_sb at ffffffffc1469e59 [ldiskfs]
 #7 [ffff8804b073bbe0] osd_trans_start at ffffffffc152a2ce [osd_ldiskfs]
 #8 [ffff8804b073bc18] ofd_trans_start at ffffffffc15b01ae [ofd]
 #9 [ffff8804b073bc48] ofd_attr_set at ffffffffc15b3173 [ofd]
#10 [ffff8804b073bca0] ofd_setattr_hdl at ffffffffc159e8ed [ofd]
#11 [ffff8804b073bcf8] tgt_request_handle at ffffffffc0e988ba [ptlrpc]
#12 [ffff8804b073bd40] ptlrpc_server_handle_request at ffffffffc0e3df13 [ptlrpc]
#13 [ffff8804b073bde0] ptlrpc_main at ffffffffc0e41862 [ptlrpc]
#14 [ffff8804b073bec8] kthread at ffffffff810b4031
#15 [ffff8804b073bf50] ret_from_fork at ffffffff816c155d
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="250107" author="panda" created="Wed, 26 Jun 2019 20:42:18 +0000"  >&lt;blockquote&gt;&lt;p&gt;looks like a &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10048&quot; title=&quot;osd-ldiskfs to truncate outside of main transaction&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10048&quot;&gt;&lt;del&gt;LU-10048&lt;/del&gt;&lt;/a&gt; duplicate&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ah, sorry, I missed this link while first reading this ticket.  It really seems to be fixing this issue:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="254962" author="lflis" created="Wed, 18 Sep 2019 10:37:39 +0000"  >&lt;p&gt;@Andrew Perepechko&lt;br/&gt;
The patch you&apos;ve mentioned is not applying cleanly on b2_10. Do you consider backporting it to the 2.10 line?&lt;br/&gt;
In the past month we had 10+ lockups on OST related to this issue.&lt;/p&gt;

&lt;p&gt;The patch you mentioned is the subset of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11613&quot; title=&quot;MDS and OSS locked up wait_transaction_locked+0x85/0xd0 [jbd2]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11613&quot;&gt;&lt;del&gt;LU-11613&lt;/del&gt;&lt;/a&gt; , is there a plan to backport the whole set?&lt;br/&gt;
Can the change from&#160;&#160;&lt;a href=&quot;https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/31293/60/lustre/ofd/ofd_objects.c&lt;/a&gt;&#160;be safely used without other changes from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11613&quot; title=&quot;MDS and OSS locked up wait_transaction_locked+0x85/0xd0 [jbd2]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11613&quot;&gt;&lt;del&gt;LU-11613&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="267790" author="bzzz" created="Thu, 16 Apr 2020 07:31:42 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=luchuan_sugon&quot; class=&quot;user-hover&quot; rel=&quot;luchuan_sugon&quot;&gt;luchuan_sugon&lt;/a&gt;do you still need this ticket open?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="48520">LU-10048</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="25048">LU-5152</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="53909">LU-11613</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="31602" name="cyf_dec10_client_stacks.log" size="28213" author="lflis" created="Mon, 10 Dec 2018 13:41:53 +0000"/>
                            <attachment id="31603" name="cyf_dec10_server_stacks.log" size="119785" author="lflis" created="Mon, 10 Dec 2018 13:41:54 +0000"/>
                            <attachment id="31130" name="logs.zip" size="16188481" author="m.magrys" created="Wed, 3 Oct 2018 17:06:21 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i003hj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>