<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:59:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13201] soft lockup on ldlm_reprocess_all</title>
                <link>https://jira.whamcloud.com/browse/LU-13201</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have upgraded to 2.12.3 recently and have been seeing bad soft lockups, going as far as hard lockups, due to &lt;tt&gt;ldlm_reprocess_all&lt;/tt&gt; in two different patterns:&lt;/p&gt;

&lt;p&gt;On OSS all active tasks are in either of these:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; #6 [ffff881c93bafbd8] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881c93bafbe8] __ldlm_reprocess_all at ffffffffc0e0b79a [ptlrpc]
 #8 [ffff881c93bafc38] ldlm_reprocess_all at ffffffffc0e0ba30 [ptlrpc]
 #9 [ffff881c93bafc48] ldlm_handle_enqueue0 at ffffffffc0e33314 [ptlrpc]
#10 [ffff881c93bafcd8] tgt_enqueue at ffffffffc0eb8572 [ptlrpc]
#11 [ffff881c93bafcf8] tgt_request_handle at ffffffffc0ebe8ba [ptlrpc]
#12 [ffff881c93bafd40] ptlrpc_server_handle_request at ffffffffc0e63f13 [ptlrpc]
#13 [ffff881c93bafde0] ptlrpc_main at ffffffffc0e67862 [ptlrpc]
#14 [ffff881c93bafec8] kthread at ffffffff810b4031
#15 [ffff881c93baff50] ret_from_fork at ffffffff816c155d

//

 #4 [ffff881c92687c08] native_queued_spin_lock_slowpath at ffffffff810fdfc2
 #5 [ffff881c92687c10] queued_spin_lock_slowpath at ffffffff816a8ff4
 #6 [ffff881c92687c20] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881c92687c30] lock_res_and_lock at ffffffffc0e0302c [ptlrpc]
 #8 [ffff881c92687c48] ldlm_handle_enqueue0 at ffffffffc0e339a8 [ptlrpc]
(the one at this comment:
/* We never send a blocking AST until the lock is granted, but
  * we can tell it right now */
)
 #9 [ffff881c92687cd8] tgt_enqueue at ffffffffc0eb8572 [ptlrpc]
#10 [ffff881c92687cf8] tgt_request_handle at ffffffffc0ebe8ba [ptlrpc]
#11 [ffff881c92687d40] ptlrpc_server_handle_request at ffffffffc0e63f13 [ptlrpc]
#12 [ffff881c92687de0] ptlrpc_main at ffffffffc0e67862 [ptlrpc]
#13 [ffff881c92687ec8] kthread at ffffffff810b4031
#14 [ffff881c92687f50] ret_from_fork at ffffffff816c155d
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On MGS, active tasks are:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; #4 [ffff881d33ff3b80] native_queued_spin_lock_slowpath at ffffffff810fdfc2
 #5 [ffff881d33ff3b88] queued_spin_lock_slowpath at ffffffff816a8ff4
 #6 [ffff881d33ff3b98] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881d33ff3ba8] ldlm_handle_conflict_lock at ffffffffc0d8fa08 [ptlrpc]
 #8 [ffff881d33ff3be0] ldlm_lock_enqueue at ffffffffc0d8ff23 [ptlrpc]
 #9 [ffff881d33ff3c48] ldlm_handle_enqueue0 at ffffffffc0db8843 [ptlrpc]
#10 [ffff881d33ff3cd8] tgt_enqueue at ffffffffc0e3d572 [ptlrpc]
#11 [ffff881d33ff3cf8] tgt_request_handle at ffffffffc0e438ba [ptlrpc]
#12 [ffff881d33ff3d40] ptlrpc_server_handle_request at ffffffffc0de8f13 [ptlrpc]
#13 [ffff881d33ff3de0] ptlrpc_main at ffffffffc0dec862 [ptlrpc]
#14 [ffff881d33ff3ec8] kthread at ffffffff810b4031
#15 [ffff881d33ff3f50] ret_from_fork at ffffffff816c155d
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the OSS case, the resource was an object, we could go back to the file with lr_name and looking for the &lt;tt&gt;oid&lt;/tt&gt; easily.&lt;br/&gt;
The resource was had huge lists in lr_granted (23668) and lr_waiting (70074), but I&apos;ve seen resources with more on other crashes (this one happened quite a few times); all of these are in l_req_mode = LCK_PR and in this case correspond to mmap&apos;d libraries (e.g. mpi dlopen&apos;d stuff used by jobs); they were all cancelled by in this case a rsync updating the mtime on the inode (utimensat) because the source of the rsync has sub-second precisions and lustre does not so rsync faithfully tries to stick these back.&lt;br/&gt;
For now I&apos;m making sure these rsyncs don&apos;t touch the files anymore but that might not always be possible.&lt;/p&gt;

&lt;p&gt;In the MGS case, I&apos;ve seen some filesystem lock (lr_name = fsname in hex) with either CONFIG_T_RECOVER (0x2) when some OSS reconnected, or with CONFIG_T_CONFIG (0x0) when updating some ost pool (pool_add/remove); from memory we have 300k entries or so in the list. In the case I&apos;m looking at the crash happened with only one entry in lr_granted and 383193 in lr_waiting: all are in LCK_CR except the first lock in lr_waiting in this case.&lt;/p&gt;

&lt;p&gt;I&apos;ve looked at the last granted lock and it&apos;s associated to a thread waiting on an alloc:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 22186  TASK: ffff881f4a6f6eb0  CPU: 12  COMMAND: &quot;ll_mgs_0003&quot;
 #0 [ffff881f1f433a18] __schedule at ffffffff816b3de4
 #1 [ffff881f1f433aa8] __cond_resched at ffffffff810c4bd6
 #2 [ffff881f1f433ac0] _cond_resched at ffffffff816b46aa
 #3 [ffff881f1f433ad0] kmem_cache_alloc at ffffffff811e3f15
 #4 [ffff881f1f433b10] LNetMDBind at ffffffffc0b3005c [lnet]
 #5 [ffff881f1f433b50] ptl_send_buf at ffffffffc0dd3c6f [ptlrpc]
 #6 [ffff881f1f433c08] ptlrpc_send_reply at ffffffffc0dd705b [ptlrpc]
 #7 [ffff881f1f433c80] target_send_reply_msg at ffffffffc0d9854e [ptlrpc]
 #8 [ffff881f1f433ca0] target_send_reply at ffffffffc0da2a5e [ptlrpc]
 #9 [ffff881f1f433cf8] tgt_request_handle at ffffffffc0e43527 [ptlrpc]
#10 [ffff881f1f433d40] ptlrpc_server_handle_request at ffffffffc0de8f13 [ptlrpc]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;just waiting to be scheduled, but it&apos;s not easy with spinlocks hogging all the cores. It doesn&apos;t even look like a client problem here, but at this point the server was already slow to respond causing more OSTs to reconnect causing more recoveries causing more of these... so hard to tell.&lt;/p&gt;


&lt;hr /&gt;


&lt;p&gt;In both cases, the root problem for me is the same: why are we iterating on this list under a spin lock so often?! We shouldn&apos;t ever use a spinlock for list traversal in the first place anyway.&lt;br/&gt;
I think our setup is especially bad because we have mixed interconnects with different latencies, so by the time the first of the slower half of the clients accept to cancel the lock, the first half of the faster clients already all tried to reconnect so lr_waiting is big and all the rest of the accepted cancels will be horribly slow (because it&apos;s the cancels being ok&apos;d that triggers list traversal to check if some waiting locks could be granted at this point)&lt;/p&gt;



&lt;p&gt;Anyway, this has become quite problematic for us lately and doesn&apos;t look like it is much better on master, so I would be more than open for workarounds or ideas to try out.&lt;/p&gt;</description>
                <environment></environment>
        <key id="57991">LU-13201</key>
            <summary>soft lockup on ldlm_reprocess_all</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="martinetd">Dominique Martinet</reporter>
                        <labels>
                    </labels>
                <created>Tue, 4 Feb 2020 17:55:01 +0000</created>
                <updated>Sun, 13 Dec 2020 08:40:13 +0000</updated>
                                            <version>Lustre 2.12.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="262628" author="pjones" created="Wed, 5 Feb 2020 14:32:11 +0000"  >&lt;p&gt;Thanks Dominque&lt;/p&gt;</comment>
                            <comment id="262648" author="simmonsja" created="Wed, 5 Feb 2020 17:39:16 +0000"  >&lt;p&gt;Can you reproduce this easily? If so can you try patch&#160;&lt;a href=&quot;https://review.whamcloud.com/#/c/35483/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/35483.&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="262707" author="martinetd" created="Thu, 6 Feb 2020 11:51:06 +0000"  >&lt;p&gt;Unfortunately it looks like we&apos;re missing something as configuration updates (ost pool manipulations) are much faster today... We think some router wasn&apos;t reliable.&lt;br/&gt;
I&apos;ll try to see if I can reproduce something similar with less clients on a test cluster if I add some volontarily slow routers... It looks like the lnet fault framework can do something like that if I find how to use it, or systemtap will do &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/tongue.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Will keep this updated.&lt;/p&gt;</comment>
                            <comment id="263222" author="martinetd" created="Thu, 13 Feb 2020 12:03:23 +0000"  >&lt;p&gt;I&apos;ve had a frustrating time trying to reproduce this. I&apos;ve started with servers, routers and clients in 2.12.3 as our servers are 2.12.3, but downgrading routers/clients to 2.10.8 like we have in production does not seem to change much.&lt;/p&gt;

&lt;p&gt;Unsuccessful test setup was full TCP, 4 CPUs per VM, with:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;one VM with just MGT/one MDT&lt;/li&gt;
	&lt;li&gt;two VMs with two OSS each&lt;/li&gt;
	&lt;li&gt;two groups of ~32 VMs with different @tcpX networks, including two routers (so three networks, one for servers/routers, and two for each client/routers half)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I&apos;ve tried going as far up as ~3500 mounts on the clients (50 mounts per client, mounted on routers as well), but even that is &quot;just&quot; 10k locks on the MGT which is less than 1/3 of what we have in production.&lt;br/&gt;
In that case neither inducing small (&amp;lt;5ms) delay on one router (so half the packets to half the nodes will be slower) nor inducing larger delays (50-100ms) on any client made much difference; but looking with &lt;tt&gt;perf probe -m ptlrpc _&lt;em&gt;ldlm_reprocess_all:27 res&lt;/tt&gt; / &lt;tt&gt;perf record -e probe:&lt;/em&gt;_ldlm_reprocess_all -a&lt;/tt&gt; I see the function being called quite a few times, I think I just don&apos;t have a sufficient scale yet to make the list traversal slow enough to exhibit soft lockups.&lt;/p&gt;

&lt;p&gt;Given I had seen &amp;gt;100k locks on a single file on OSTs I assumed it would be simpler to get many locks taken on a file, but mmap()ing a file twice even at different offsets on a client does not take two locks, so I&apos;m not sure how the production codes managed to get ~130k locks on a single file with jobs running on a few hundred of nodes... In our case the locks were all taken on mpi libs e.g. libhcoll or ucx so I assumed a simple mmap in read/exec, shared would do, but that doesn&apos;t look like it is that simple :/&lt;/p&gt;


&lt;p&gt;So long story short, I think I&apos;m stuck until I can figure out how to artificially blow the list of locks taken on a single resource up, does someone have an idea for that?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00t6n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>