<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:26:33 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2597] Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync</title>
                <link>https://jira.whamcloud.com/browse/LU-2597</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This looks similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;, but looking at the stacks and function address listings I&apos;m not certain they are the same just yet.&lt;/p&gt;

&lt;p&gt;When running a large IOR using our 2.3.58-3chaos tag, we saw many CPU Soft Lockup messages on the console of many clients. All of the stacks I have been able to uncover look like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;NIP [c000000000439724] ._spin_lock+0x2c/0x44                                       
LR [8000000003a8e81c] .ptlrpc_check_set+0x43cc/0x5120 [ptlrpc]                     
Call Trace:                                                                        
[c0000003ec7bf600] [8000000003a8e75c] .ptlrpc_check_set+0x430c/0x5120 [ptlrpc] (unreliable)
[c0000003ec7bf7b0] [8000000003a8fa5c] .ptlrpc_set_wait+0x4ec/0xcb0 [ptlrpc]        
[c0000003ec7bf920] [8000000003a90974] .ptlrpc_queue_wait+0xd4/0x380 [ptlrpc]       
[c0000003ec7bf9e0] [80000000059c3564] .mdc_sync+0x104/0x340 [mdc]                  
[c0000003ec7bfa90] [800000000709bf08] .lmv_sync+0x2c8/0x820 [lmv]                  
[c0000003ec7bfb80] [80000000068232dc] .ll_fsync+0x23c/0xc50 [lustre]               
[c0000003ec7bfc80] [c000000000100420] .vfs_fsync_range+0xb0/0x104                  
[c0000003ec7bfd30] [c000000000100518] .do_fsync+0x3c/0x6c                          
[c0000003ec7bfdc0] [c000000000100588] .SyS_fsync+0x18/0x28                         
[c0000003ec7bfe30] [c000000000000580] syscall_exit+0x0/0x2c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Searching through all of the console logs for that day, I found three unique address within &lt;tt&gt;ptlrpc_check_set&lt;/tt&gt; listed:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ grep -E &quot;LR.*ptlrpc_check_set&quot; R* | grep 2013-01-08 | \
  sed &apos;s/.*\(ptlrpc_check_set+[a-z0-9]\+\).*/\1/&apos; | \
  sort | uniq
ptlrpc_check_set+0x43cc
ptlrpc_check_set+0x50c
ptlrpc_check_set+0xeec
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Listing these addresses is a bit confusing to me, they aren&apos;t near any &lt;tt&gt;spin_lock&lt;/tt&gt; calls as I expected. Rather, they seem to correlate to debug print statements.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ptlrpc_check_set+0x43cc&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) l *ptlrpc_check_set+0x43cc
0x4e81c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1113).

1109 static int ptlrpc_check_status(struct ptlrpc_request *req)                      
1110 {                                                                               
1111         int err;                                                                
1112         ENTRY;                                                                  
1113                                                                                 
1114         err = lustre_msg_get_status(req-&amp;gt;rq_repmsg);                            
1115         if (lustre_msg_get_type(req-&amp;gt;rq_repmsg) == PTL_RPC_MSG_ERR) {           
1116                 struct obd_import *imp = req-&amp;gt;rq_import;                        
1117                 __u32 opc = lustre_msg_get_opc(req-&amp;gt;rq_reqmsg);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ptlrpc_check_set+0x50c&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) l *ptlrpc_check_set+0x50c
0x4a95c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1774).

1771                 ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE);                    
1772                                                                                 
1773                 CDEBUG(req-&amp;gt;rq_reqmsg != NULL ? D_RPCTRACE : 0,                 
1774                         &quot;Completed RPC pname:cluuid:pid:xid:nid:&quot;               
1775                         &quot;opc %s:%s:%d:&quot;LPU64&quot;:%s:%d\n&quot;,                         
1776                         cfs_curproc_comm(), imp-&amp;gt;imp_obd-&amp;gt;obd_uuid.uuid,        
1777                         lustre_msg_get_status(req-&amp;gt;rq_reqmsg), req-&amp;gt;rq_xid,
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Any ideas?&lt;/p&gt;

&lt;p&gt;I&apos;m working on getting &lt;tt&gt;CONFIG_DEBUG_SPINLOCK&lt;/tt&gt; and &lt;tt&gt;CONFIG_LOCK_STAT&lt;/tt&gt; enabled on the IONs. I&apos;m hopeful that will work and will allow me to determine which spin lock the threads are contending on.&lt;/p&gt;</description>
                <environment></environment>
        <key id="17126">LU-2597</key>
            <summary>Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                            <label>MB</label>
                            <label>sequoia</label>
                            <label>topsequoia</label>
                    </labels>
                <created>Wed, 9 Jan 2013 18:27:49 +0000</created>
                <updated>Tue, 19 Feb 2013 18:26:43 +0000</updated>
                            <resolved>Tue, 19 Feb 2013 18:26:43 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="50285" author="pjones" created="Thu, 10 Jan 2013 15:42:17 +0000"  >&lt;p&gt;Alex&lt;/p&gt;

&lt;p&gt;Any ideas?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="50470" author="bzzz" created="Tue, 15 Jan 2013 07:27:29 +0000"  >&lt;p&gt;hmm, but the first trace is pretty specific about spin_lock() ?&lt;/p&gt;</comment>
                            <comment id="50476" author="prakash" created="Tue, 15 Jan 2013 10:32:18 +0000"  >&lt;p&gt;Hence, the confusion. All of the stacks I looked at ended in a spin_lock call, but I don&apos;t immediately see that when looking at the source lines gdb gives me. &lt;/p&gt;</comment>
                            <comment id="50479" author="bzzz" created="Tue, 15 Jan 2013 11:18:25 +0000"  >&lt;p&gt;well, instructions can be reordered or removed due to optimizations.&lt;/p&gt;</comment>
                            <comment id="50486" author="adilger" created="Tue, 15 Jan 2013 13:27:06 +0000"  >&lt;p&gt;This does appear to be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;, linking these issues.&lt;/p&gt;</comment>
                            <comment id="51012" author="yong.fan" created="Wed, 23 Jan 2013 00:41:51 +0000"  >&lt;p&gt;How many CPU processors on the client? The same configuration as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; case?&lt;/p&gt;</comment>
                            <comment id="51034" author="prakash" created="Wed, 23 Jan 2013 12:17:58 +0000"  >&lt;p&gt;Yes, the nodes are the same hardware and software configuration as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; (although Lustre versions may slightly differ).&lt;/p&gt;</comment>
                            <comment id="52550" author="yong.fan" created="Sat, 16 Feb 2013 23:20:28 +0000"  >&lt;p&gt;Have we still met the issues after &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; patch applied?&lt;/p&gt;</comment>
                            <comment id="52694" author="prakash" created="Tue, 19 Feb 2013 12:36:03 +0000"  >&lt;p&gt;Not that I can recall, I&apos;m OK closing it. It can always be reopened if it comes up again in the future. Chris, do you agree?&lt;/p&gt;</comment>
                            <comment id="52713" author="morrone" created="Tue, 19 Feb 2013 17:02:56 +0000"  >&lt;p&gt;I am fine with closing this.&lt;/p&gt;</comment>
                            <comment id="52719" author="jlevi" created="Tue, 19 Feb 2013 18:26:43 +0000"  >&lt;p&gt;Closing per comments in ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="16548">LU-2263</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvevj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6051</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>