<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:40:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11082] stuck threads on MDS</title>
                <link>https://jira.whamcloud.com/browse/LU-11082</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;our main filesystem has hung approximately 10 times since April.&lt;/p&gt;

&lt;p&gt;the symptoms are stuck thread messages on the MDS (it has happened on both our MDS&apos;s), and then soon afterwards clients can&apos;t stat any files in the filesystem.&lt;/p&gt;

&lt;p&gt;we have 4 filesystems that share the same MDS hardware and only the large DNE filesystem with 3 MDTs has hung, so we think this is probably a DNE issue.&lt;/p&gt;

&lt;p&gt;there are no logs that indicate this is a MDS hardware problem. despite that we&apos;ve swapped out a lot of hardware and tested the MDS components as well as we can and it hasn&apos;t helped at all. &apos;zpool iostat 2&apos; shows some tiny amounts of i/o are still going to the MDTs when the filesystems are hung, so eg. hardware raid isn&apos;t just silently stopping with no logs.&lt;/p&gt;

&lt;p&gt;MDT&apos;s are zmirror on top of hardware raid1 (so 4-way replicated).&lt;/p&gt;

&lt;p&gt;to un-hang the filesystem, we failover or reboot the MDS. every time we can&apos;t umount one of the 3 MDTs in the /dagg(fred) filesystem, and we have to forcibly power off the MDS (which led to LU10877 etc.). recovery these days goes ok, and then the filesystem works again. it seems random which of the three MDTs hangs on umount. we&apos;ve lfsck&apos;d the namespace about 5 times.&lt;/p&gt;

&lt;p&gt;here are first stack traces from each of the 2 separate hangs today, which are pretty typical. usually it&apos;s a mdt_rdpg* thread that hangs and those stack traces all look very similar, but occasionally it&apos;s a mdt* thread and there&apos;s some variations on a theme in those.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/var/log/messages:Jun 11 11:06:59 warble2 kernel: LNet: Service thread pid 116276 was inactive for 200.63s. The thread might be hung, or it might only be slow and will res
ume later. Dumping the stack trace for debugging purposes:
/var/log/messages:Jun 11 11:06:59 warble2 kernel: Pid: 116276, comm: mdt_rdpg00_024
/var/log/messages:Jun 11 11:06:59 warble2 kernel: #012Call Trace:
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff8109eb8b&amp;gt;] ? recalc_sigpending+0x1b/0x50
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff816b40e9&amp;gt;] schedule+0x29/0x70
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0e16f10&amp;gt;] top_trans_wait_result+0xa6/0x155 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810c7c70&amp;gt;] ? default_wake_function+0x0/0x20
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0df882b&amp;gt;] top_trans_stop+0x42b/0x930 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc164f5e9&amp;gt;] lod_trans_stop+0x259/0x340 [lod]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0df8fae&amp;gt;] ? top_trans_start+0x27e/0x940 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc16ed1ea&amp;gt;] mdd_trans_stop+0x2a/0x46 [mdd]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc16e2bcb&amp;gt;] mdd_attr_set+0x5eb/0xce0 [mdd]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0d81d47&amp;gt;] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc15ce186&amp;gt;] mdt_mfd_close+0x1a6/0x610 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc15d3901&amp;gt;] mdt_close_internal+0x121/0x220 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc15d3c20&amp;gt;] mdt_close+0x220/0x780 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0de52ba&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0d8de2b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0d8a458&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810c7c82&amp;gt;] ? default_wake_function+0x12/0x20
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810bdc4b&amp;gt;] ? __wake_up_common+0x5b/0x90
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0d91572&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffffc0d90ae0&amp;gt;] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810b4031&amp;gt;] kthread+0xd1/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810b3f60&amp;gt;] ? kthread+0x0/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff816c055d&amp;gt;] ret_from_fork+0x5d/0xb0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [&amp;lt;ffffffff810b3f60&amp;gt;] ? kthread+0x0/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: 
/var/log/messages:Jun 11 11:06:59 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528679219.116276
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun 11 14:17:28 warble2 kernel: LNet: Service thread pid 22072 was inactive for 200.35s. The thread might be hung, or it might only be slow and will resume later. Dumping 
the stack trace for debugging purposes:
Jun 11 14:17:28 warble2 kernel: Pid: 22072, comm: mdt00_001
Jun 11 14:17:28 warble2 kernel: #012Call Trace:
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff8109eb8b&amp;gt;] ? recalc_sigpending+0x1b/0x50
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff816b40e9&amp;gt;] schedule+0x29/0x70
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0dd3f10&amp;gt;] top_trans_wait_result+0xa6/0x155 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810c7c70&amp;gt;] ? default_wake_function+0x0/0x20
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0db582b&amp;gt;] top_trans_stop+0x42b/0x930 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc10f35e9&amp;gt;] lod_trans_stop+0x259/0x340 [lod]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0db5fae&amp;gt;] ? top_trans_start+0x27e/0x940 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc11911ea&amp;gt;] mdd_trans_stop+0x2a/0x46 [mdd]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc117eb31&amp;gt;] mdd_rename+0x4d1/0x14a0 [mdd]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc103b07a&amp;gt;] mdt_reint_rename_internal.isra.36+0x166a/0x20c0 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc103d3fb&amp;gt;] mdt_reint_rename_or_migrate.isra.39+0x19b/0x860 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d0eff0&amp;gt;] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d09440&amp;gt;] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc103daf3&amp;gt;] mdt_reint_rename+0x13/0x20 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc1041b33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc102337b&amp;gt;] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc102ef07&amp;gt;] mdt_reint+0x67/0x140 [mdt]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0da22ba&amp;gt;] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d4ae2b&amp;gt;] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d47458&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810c7c82&amp;gt;] ? default_wake_function+0x12/0x20
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810bdc4b&amp;gt;] ? __wake_up_common+0x5b/0x90
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d4e572&amp;gt;] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffffc0d4dae0&amp;gt;] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810b4031&amp;gt;] kthread+0xd1/0xe0
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810b3f60&amp;gt;] ? kthread+0x0/0xe0
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff816c055d&amp;gt;] ret_from_fork+0x5d/0xb0
Jun 11 14:17:28 warble2 kernel: [&amp;lt;ffffffff810b3f60&amp;gt;] ? kthread+0x0/0xe0
Jun 11 14:17:28 warble2 kernel: 
Jun 11 14:17:28 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528690648.22072
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;ve attach logs of all the stack traces&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;zgrep warble /var/log/messages-20180[3456]* /var/log/messages | egrep -v &apos;crmd|pengine&apos; | grep -i &apos;stack trace&apos; -B20 -A40 &amp;gt; ~/lustre-bugs/warble-stack-traces.txt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;clients at the moment are a mix of centos7.4 and 2.10.3/2.10.4, and centos7.5 2.10.4.&lt;br/&gt;
servers are centos7.4 and 2.10.4 with zfs 0.7.9 and boot with nopti.&lt;/p&gt;

&lt;p&gt;we have inherited directory striping on the main filesystem, so each new dir is on a random one of the three MDTs.&lt;/p&gt;

&lt;p&gt;we have no idea what i/o pattern triggers the lockups, sorry.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</description>
                <environment>x86_64 centos7.4 OPA ZFS 0.7.9 lustre 2.10.4</environment>
        <key id="52539">LU-11082</key>
            <summary>stuck threads on MDS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                    </labels>
                <created>Mon, 11 Jun 2018 16:02:36 +0000</created>
                <updated>Thu, 16 Aug 2018 18:55:12 +0000</updated>
                            <resolved>Thu, 16 Aug 2018 18:55:11 +0000</resolved>
                                    <version>Lustre 2.10.3</version>
                    <version>Lustre 2.10.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="229423" author="pjones" created="Mon, 11 Jun 2018 17:28:11 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Can you please investigate?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="229452" author="laisiyao" created="Tue, 12 Jun 2018 13:38:59 +0000"  >&lt;p&gt;I think rename is the culprit, because I&apos;ve seen such deadlock in other tests. The root cause is that &apos;rename&apos; needs to lock both source and target parents, and there are two cases it should lock them in reverse order to avoid deadlock:&lt;br/&gt;
1. source parent is a child of target parent.&lt;br/&gt;
2. both source and target parents are stripes of a striped directory, and the stripe index of source parent is after that of target parent.&lt;/p&gt;

&lt;p&gt;Now &apos;rename&apos; only checks the first situation, I&apos;ve cooked a fix under ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4684&quot; title=&quot;DNE3: allow migrating DNE striped directory&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4684&quot;&gt;&lt;del&gt;LU-4684&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;https://review.whamcloud.com/#/c/32701/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32701/&lt;/a&gt;. But it&apos;s not fully tested and reviewed yet, you can wait for it to be ready.&lt;/p&gt;</comment>
                            <comment id="229486" author="scadmin" created="Wed, 13 Jun 2018 08:49:20 +0000"  >&lt;p&gt;that sounds plausible for sure.&lt;/p&gt;

&lt;p&gt;this is an important issue to us. the cluster is broken when the main filesystem hangs.&lt;br/&gt;
can we raise the priority of this somehow?&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229515" author="pjones" created="Wed, 13 Jun 2018 18:26:20 +0000"  >&lt;p&gt;Robin&lt;/p&gt;

&lt;p&gt;Don&apos;t worry we are treating this as a priority. However, experience has shown that any changes to the DNE code need to be very thoroughly tested and this includes multiple days of stress testing. We will definitely advise as soon as we are confident that the fix does not have any undesired side-effects.&#160;&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="229518" author="scadmin" created="Wed, 13 Jun 2018 18:41:31 +0000"  >&lt;p&gt;coolio. please let us know if you&apos;d like us to try something out. I don&apos;t mind if things crash, as long as it&apos;s contributing to progress.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229706" author="scadmin" created="Sun, 24 Jun 2018 06:59:24 +0000"  >&lt;p&gt;Hiya,&lt;/p&gt;

&lt;p&gt;5 more reboots of the MDS&apos;s today.&lt;/p&gt;

&lt;p&gt;this time they didn&apos;t stop until a script that was doing find and then chgrp/chmod on a client was killed.&lt;/p&gt;

&lt;p&gt;this process does no mv or rename so doesn&apos;t seem to fit in with your current theory of why the MDS threads hang.&lt;/p&gt;

&lt;p&gt;I&apos;ll attach the trace from the first hang on the MDS.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229717" author="scadmin" created="Mon, 25 Jun 2018 04:40:01 +0000"  >&lt;p&gt;Hiya,&lt;/p&gt;

&lt;p&gt;now that I think about it, I&apos;m pretty sure the find/chgrp/chmod script was also running in April when we first hit MDT hangs (and then &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10887&quot; title=&quot;2 MDTs stuck in WAITING&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10887&quot;&gt;&lt;del&gt;LU-10887&lt;/del&gt;&lt;/a&gt; as fallout), so it might be a reproducer for this. presumably users might also have been doing similar things to cause the in-between hangs.&lt;/p&gt;

&lt;p&gt;this script is on 1 client and runs ~5 of the below at once, each on separate directories&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;( find $p -mount ! -gid $gid -ls -exec chgrp -h $g {} \; -type d -exec chmod g+s {} \; ; find $p -mount -type d ! -perm -g=s -exec chmod g+s {} \; -ls )
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;it forces file and dir group ownership to match the project directory name, and sets g+s on dirs. ie. a poor mans project/directory quotas. it&apos;s a pretty common thing for sites to want to do.&lt;/p&gt;

&lt;p&gt;the script itself is a bit more complicated than the above, with some VFS stuff to try to limit inode lists getting too long on the sweeping client and slowing down the kernel, but the above line is the essence of it. let me know if you want the full thing.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229720" author="laisiyao" created="Mon, 25 Jun 2018 08:26:08 +0000"  >&lt;p&gt;Hmm, it looks like there are more deadlocks than rename, I&apos;m looking into them. Thanks!&lt;/p&gt;</comment>
                            <comment id="229770" author="laisiyao" created="Thu, 28 Jun 2018 03:24:17 +0000"  >&lt;p&gt;I just uploaded three patches, I&apos;d suggest you apply them all and then test again:&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/32589/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32589/&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/32738/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32738/&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/32701/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32701/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="229773" author="scadmin" created="Thu, 28 Jun 2018 09:55:18 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;any chance you can make a version of &lt;a href=&quot;https://review.whamcloud.com/#/c/32589/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32589/&lt;/a&gt; for 2.10?&lt;br/&gt;
it doesn&apos;t apply at the moment.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="230337" author="scadmin" created="Tue, 17 Jul 2018 08:16:14 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;how&apos;s this coming along?&lt;br/&gt;
there are 1000&apos;s of lines of code different between b2_10 and master across these 4 files, so I&apos;m not confident about backporting this myself to try it out.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="230755" author="laisiyao" created="Mon, 23 Jul 2018 15:23:07 +0000"  >&lt;p&gt;The backport for 2.10 of &lt;a href=&quot;https://review.whamcloud.com/#/c/32589/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32589/&lt;/a&gt; is on &lt;a href=&quot;https://review.whamcloud.com/32853&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32853&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="230876" author="scadmin" created="Wed, 25 Jul 2018 05:31:32 +0000"  >&lt;p&gt;thanks. we have that on out MDS&apos;s now.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="231061" author="pjones" created="Mon, 30 Jul 2018 13:18:22 +0000"  >&lt;p&gt;Robin&lt;/p&gt;

&lt;p&gt;How long will this patch need to be running before you would conclude that it has improved the situation?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="231068" author="scadmin" created="Mon, 30 Jul 2018 14:54:56 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;no issues so far.&lt;/p&gt;

&lt;p&gt;however I haven&apos;t yet run the chmod/chgrp sweep script which was maybe a trigger for this issue.&lt;br/&gt;
partly because (TBH) I&apos;ve been enjoying the stability, but also because I haven&apos;t had any free time.&lt;br/&gt;
I&apos;ll run it this week and report back.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="231371" author="scadmin" created="Fri, 3 Aug 2018 06:21:22 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;after 2 days of running, the find/chgrp/chmod sweep through all files almost completed, but not quite.&lt;/p&gt;

&lt;p&gt;once again the MDTs hung, but it looked different to the previous hangs. this time all 3 MDTs wouldn&apos;t umount, and previously it&apos;s just been (a random) one of them. resolution was the same though - had to power cycle the MDS.&lt;/p&gt;

&lt;p&gt;I&apos;ll attach all warble related syslog (warble1 is where all 3 MDTs are mounted at the moment).&lt;br/&gt;
I&apos;ll attach the first lustre-log.1533268164.11346.gz.&lt;/p&gt;

&lt;p&gt;stack traces were not printed to anywhere because&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Aug  3 13:49:24 warble1 kernel: LNet: 17382:0:(linux-debug.c:185:libcfs_call_trace()) can&apos;t show stack: kernel doesn&apos;t export show_task
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;is this a new change?&lt;/p&gt;

&lt;p&gt;we&apos;re running 3.10.0-862.9.1.el7.x86_64 centos7.5 everywhere now. server lustre is 2.10.4 plus -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;  lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
  lu10988-lfsck2-gerrit32522-21d33c11.patch
  lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
  lu11107-xattr-gerrit32753-c96a8f08.patch
  lu11111-lfsck-gerrit32796-693fe452.patch
  lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;because the symptoms are different, if I had to guess I&apos;d say that the patch here has helped a lot, and perhaps fixed the issue, and maybe we&apos;re running into something new now, but without stacktraces I really have no idea.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="231383" author="pjones" created="Fri, 3 Aug 2018 13:15:17 +0000"  >&lt;p&gt;Is the stack trace issue &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11062&quot; title=&quot;Backtrace stack printing is broken in RHEL 7.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11062&quot;&gt;&lt;del&gt;LU-11062&lt;/del&gt;&lt;/a&gt; perhaps?&lt;/p&gt;</comment>
                            <comment id="231851" author="scadmin" created="Mon, 13 Aug 2018 06:35:57 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;yes, that looks like it. thanks.&lt;br/&gt;
I&apos;ve included the b2_10 version of the stacktrace patch into our MDS builds, and kicked off a re-run of the chmod/chgrp script.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="232076" author="scadmin" created="Thu, 16 Aug 2018 18:48:15 +0000"  >&lt;p&gt;the chgrp/chmod script finished successfully. fixed up about 10M files and a lot of dirs over 2.5 days. no crash on the MDS.&lt;/p&gt;

&lt;p&gt;as the hung threads in the previous crash looked different to those at the start of this ticket, and this most recent sweep finished without issue,  I would err on the side of saying this issue has been fixed.&lt;/p&gt;

&lt;p&gt;thanks!&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="232079" author="pjones" created="Thu, 16 Aug 2018 18:55:12 +0000"  >&lt;p&gt;Good news. So let&apos;s close out this ticket then&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="52600">LU-11103</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="52601">LU-11104</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="52599">LU-11102</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="30433" name="lustre-log.1529816295.52919" size="52192252" author="scadmin" created="Sun, 24 Jun 2018 07:04:29 +0000"/>
                            <attachment id="30699" name="lustre-log.1533268164.11346.gz" size="6362592" author="scadmin" created="Fri, 3 Aug 2018 06:23:59 +0000"/>
                            <attachment id="30698" name="messages.warble" size="84973" author="scadmin" created="Fri, 3 Aug 2018 06:23:57 +0000"/>
                            <attachment id="30327" name="warble-stack-traces.txt" size="981498" author="scadmin" created="Mon, 11 Jun 2018 15:54:29 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzy8f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>