<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:14:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15034] Lustre 2.12.7 client deadlock on quota check</title>
                <link>https://jira.whamcloud.com/browse/LU-15034</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;&lt;b&gt;Summary:&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;Lustre 2.12.7 clients occasionally (so far has happened on ~9 nodes out of ~1100) deadlocks in quota check routine on file access.&#160; The deadlocked processes will not terminate on their own.&#160; The clients will deadlock in one of 2 ways it appears.&#160; Either the client will get stuck in &lt;font color=&quot;#000000&quot;&gt;ptlrpc_queue_wait&lt;/font&gt; &lt;del&gt;&amp;gt; &lt;font color=&quot;#000000&quot;&gt;ptlrpc_set_wait &lt;/font&gt;or &lt;font color=&quot;#000000&quot;&gt;ptlrpc_queue_wait will fail and it then deadlocks in &lt;/font&gt;&lt;font color=&quot;#000000&quot;&gt;cl_lock_request &lt;/font&gt;&lt;/del&gt;&amp;gt; &lt;font color=&quot;#000000&quot;&gt;cl_sync_io_wait&lt;/font&gt;.&#160; And if a processes deadlocks in &lt;font color=&quot;#000000&quot;&gt;ptlrpc_queue_wait it will not eventually fail and go to cl_lock_request.&#160; On the server side when this occurs OSS server(s) will usually report a message about the client reconnecting, but no other errors.&#160; Sometimes these deadlocks only seem to affect the stuck processes and other times it seems to also block other users from accessing files on the lustre mount (it may also depend on how many processes deadlock, we had one node where 4 separate users deadlocked and for that node the lustre mount was completely hosed).&lt;/font&gt;&lt;/p&gt;


&lt;p&gt;We are doing quota enforcement and user processes deadlocking are not over quota.&lt;/p&gt;

&lt;p&gt;We recently upgraded the server from lustre 2.8 to 2.12.7 and the client from 2.10.8 to 2.12.7.&lt;/p&gt;

&lt;p&gt;Below are the two types of deadlocked stacks.&lt;/p&gt;

&lt;p&gt;Type 1: deadlock in &lt;font color=&quot;#000000&quot;&gt;ptlrpc_queue_wait&lt;/font&gt;&lt;/p&gt;


&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0bd5c60&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_set_wait+0x480/0x790 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt; &lt;/font&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0bd5ff3&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_queue_wait+0x83/0x230 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0bbac42&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_enqueue+0x3e2/0x930 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d47bc9&amp;gt;&amp;#93;&lt;/span&gt; osc_enqueue_base+0x219/0x690 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d525c9&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_enqueue+0x379/0x830 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a52225&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_enqueue+0x65/0x120 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0cf72e5&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x95/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a52225&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_enqueue+0x65/0x120 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a527b7&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x67/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a566bb&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x2bb/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a569ea&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0xba/0x1c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0e250e0&amp;gt;&amp;#93;&lt;/span&gt; ll_file_io_generic+0x590/0xc90 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0e265b3&amp;gt;&amp;#93;&lt;/span&gt; ll_file_aio_read+0x3a3/0x450 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0e26760&amp;gt;&amp;#93;&lt;/span&gt; ll_file_read+0x100/0x1c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffbc24e3af&amp;gt;&amp;#93;&lt;/span&gt; vfs_read+0x9f/0x170 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffbc24f22f&amp;gt;&amp;#93;&lt;/span&gt; SyS_read+0x7f/0xf0 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffbc795f92&amp;gt;&amp;#93;&lt;/span&gt; system_call_fastpath+0x25/0x2a &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffffffffff&amp;gt;&amp;#93;&lt;/span&gt; 0xffffffffffffffff&lt;/p&gt;



&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Type 2: deadlock after &lt;font color=&quot;#000000&quot;&gt;ptlrpc_queue_wait&lt;/font&gt; fails.&lt;/p&gt;

&lt;p&gt;Message send to syslog:&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;4155019.167715&amp;#93;&lt;/span&gt; LustreError: 17861:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -4&lt;/font&gt;&lt;/p&gt;


&lt;p&gt;Followed by deadlocked stack:&lt;/p&gt;


&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a765b5&amp;gt;&amp;#93;&lt;/span&gt; cl_sync_io_wait+0x2b5/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;/font&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a73906&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x1b6/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0f8e9b1&amp;gt;&amp;#93;&lt;/span&gt; cl_glimpse_lock+0x311/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0f8ed3d&amp;gt;&amp;#93;&lt;/span&gt; cl_glimpse_size0+0x20d/0x240 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0f491ca&amp;gt;&amp;#93;&lt;/span&gt; ll_getattr+0x22a/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff89853e99&amp;gt;&amp;#93;&lt;/span&gt; vfs_getattr+0x49/0x80 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff89853f15&amp;gt;&amp;#93;&lt;/span&gt; vfs_fstat+0x45/0x80 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff89854484&amp;gt;&amp;#93;&lt;/span&gt; SYSC_newfstat+0x24/0x60 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8985485e&amp;gt;&amp;#93;&lt;/span&gt; SyS_newfstat+0xe/0x10 &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff89d95f92&amp;gt;&amp;#93;&lt;/span&gt; system_call_fastpath+0x25/0x2a &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffffffffff&amp;gt;&amp;#93;&lt;/span&gt; 0xffffffffffffffff&lt;/p&gt;



&lt;p&gt;&lt;b&gt;Env:&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;OS: CentOS 7.9 (CentOS packaged OFED on client)&lt;/p&gt;

&lt;p&gt;Kernel: &lt;font color=&quot;#000000&quot;&gt;3.10.0-1160.36.2.el7.x86_64&lt;/font&gt;&lt;br/&gt;
Luster server: 2.12.7&lt;/p&gt;

&lt;p&gt;Luster client: 2.12.7&lt;/p&gt;

&lt;p&gt;Network: Infninband (combination of EDR. FDR and QDR)&lt;/p&gt;</description>
                <environment>CentOS 7.9 with included centos OFED.  Lustre server and client version 2.12.7.</environment>
        <key id="66250">LU-15034</key>
            <summary>Lustre 2.12.7 client deadlock on quota check</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jmatthews">Jim Matthews</reporter>
                        <labels>
                    </labels>
                <created>Sat, 25 Sep 2021 20:48:32 +0000</created>
                <updated>Tue, 5 Oct 2021 16:44:57 +0000</updated>
                                            <version>Lustre 2.12.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="313978" author="JIRAUSER16929" created="Sat, 25 Sep 2021 20:50:46 +0000"  >&lt;p&gt;Above editor interpreted my &amp;gt;&apos;s it seems, that line crossed out should not be crossed out.&lt;/p&gt;</comment>
                            <comment id="313979" author="JIRAUSER16929" created="Sat, 25 Sep 2021 23:50:58 +0000"  >&lt;p&gt;I should clarify my statement above: &quot;The deadlocked processes will not terminate on their own.&quot;&#160; The processes can&apos;t be killed using -9, the only way to clear is to reboot the node.&lt;/p&gt;</comment>
                            <comment id="314552" author="JIRAUSER16929" created="Fri, 1 Oct 2021 23:58:15 +0000"  >&lt;p&gt;Just wondering if anyone had a chance to look at this...&#160; Thanks!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>clientdeadlock</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i025fj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>