<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:03:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6805] at_init is not safe to use anywhere but on initialization</title>
                <link>https://jira.whamcloud.com/browse/LU-6805</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;at_init() modifies part of struct adaptive_timeout without taking at_lock. That makes it unsafe to be used anywhere but on creation time.&lt;br/&gt;
That is at_init() should not be used in ptlrpc_connect_interpret() and in ptlrpc_service_part_init().&lt;/p&gt;</description>
                <environment></environment>
        <key id="30974">LU-6805</key>
            <summary>at_init is not safe to use anywhere but on initialization</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="vsaveliev">Vladimir Saveliev</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 7 Jul 2015 13:39:08 +0000</created>
                <updated>Thu, 23 Nov 2017 18:21:45 +0000</updated>
                            <resolved>Thu, 16 Jul 2015 12:47:10 +0000</resolved>
                                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="120556" author="gerrit" created="Tue, 7 Jul 2015 14:06:19 +0000"  >&lt;p&gt;Vladimir Saveliev (vladimir_saveliev@xyratex.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/15522&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/15522&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6805&quot; title=&quot;at_init is not safe to use anywhere but on initialization&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6805&quot;&gt;&lt;del&gt;LU-6805&lt;/del&gt;&lt;/a&gt; ptlrpc: use smp unsafe at_init only for initialization&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 568e59b44e4f093436ff83f1ffd860aaff114b46&lt;/p&gt;</comment>
                            <comment id="120592" author="adilger" created="Tue, 7 Jul 2015 17:45:18 +0000"  >&lt;p&gt;What problems are seen without this patch? Have you seen issues with this in real life, or was this found by code inspection or static analysis?&lt;/p&gt;

&lt;p&gt;It seems like a problem that the spinlock is being reset while it might be locked. That might cause an oops if there is spinlock debugging enabled in some rare cases. If that is the case, then including the stack trace in the big and/or patch would make it easier to correlate any future failures with this ticket and patch. &lt;/p&gt;</comment>
                            <comment id="120602" author="amk" created="Tue, 7 Jul 2015 18:00:02 +0000"  >&lt;p&gt;Yes, we&apos;re seeing threads stuck in infinite loops trying to get the spinlock. This problem has been seen on at least 2 customer systems and a few internal test systems.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;gt; crash&amp;gt; ps -m | grep &apos;\^&apos;
&amp;gt; ^[0 00:00:00.002] [IN]  PID: 573    TASK: ffff88083370b800  CPU: 12  COMMAND: &quot;kworker/12:1&quot;
&amp;gt; ^[0 00:00:00.003] [IN]  PID: 5698   TASK: ffff88083cb70800  CPU: 15  COMMAND: &quot;gsock_send_1&quot;
[skip]
&amp;gt; ^[0 00:00:00.252] [IN]  PID: 5785   TASK: ffff88083f074040  CPU: 0   COMMAND: &quot;ptlrpcd_15&quot;
&amp;gt; ^[0 00:41:02.337] [RU]  PID: 5781   TASK: ffff880833673800  CPU: 2   COMMAND: &quot;ptlrpcd_11&quot;
             ^41 minutes  running on CPU 2
&amp;gt; crash&amp;gt; bt 5781
&amp;gt; PID: 5781   TASK: ffff880833673800  CPU: 2   COMMAND: &quot;ptlrpcd_11&quot;
&amp;gt;     [exception RIP: _raw_spin_lock+27]
&amp;gt;     RIP: ffffffff8141eabb  RSP: ffff8807e29d3bc0  RFLAGS: 00000202
&amp;gt;     RAX: 0000000000000266  RBX: ffff8808378d8ba0  RCX: ffffffffa0559aa0
&amp;gt;     RDX: 0000000000001000  RSI: 0000000000000000  RDI: ffff8808378d8bd0
&amp;gt;     RBP: ffff8807e29d3bc0   R8: 0000000000000001   R9: 0000000000000001
&amp;gt;     R10: 0000000000000001  R11: 0000ffffffffff0a  R12: 0000000000000258
&amp;gt;     R13: 000000000000003b  R14: 0000000055944d05  R15: 0000000000000028
&amp;gt;     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
&amp;gt; --- &amp;lt;NMI exception stack&amp;gt; ---
&amp;gt;  #3 [ffff8807e29d3bc0] _raw_spin_lock at ffffffff8141eabb
&amp;gt;  #4 [ffff8807e29d3bc8] at_measured at ffffffffa04c0123 [ptlrpc]
&amp;gt;  #5 [ffff8807e29d3c38] ptlrpc_at_adj_net_latency at ffffffffa0494547 [ptlrpc]
&amp;gt;  #6 [ffff8807e29d3c78] after_reply at ffffffffa0497223 [ptlrpc]
&amp;gt;  #7 [ffff8807e29d3ce8] ptlrpc_check_set at ffffffffa049c3d0 [ptlrpc]
&amp;gt;  #8 [ffff8807e29d3d78] ptlrpcd_check at ffffffffa04c810b [ptlrpc]
&amp;gt;  #9 [ffff8807e29d3dd8] ptlrpcd at ffffffffa04c87bb [ptlrpc]
&amp;gt; #10 [ffff8807e29d3ee8] kthread at ffffffff8107373e
&amp;gt; #11 [ffff8807e29d3f48] kernel_thread_helper at ffffffff81427bb4

spinlock is ptlrpc_request.rq_import-&amp;gt;imp_at.iat_net_latency.at_lock

&amp;gt; crash&amp;gt; ptlrpc
&amp;gt; 
&amp;gt; Sent RPCS: ptlrpc_request_set.set_requests-&amp;gt;rq_set_chain
&amp;gt; thread        ptlrpc_request      pid xid                   nid                opc  phase  bulk  sent/deadline
&amp;gt; ===============================================================================================
&amp;gt; ptlrpcd_11:   ffff8807fa1fc000   5781 x1505497848647028     325@gni             400 RPC    0:0 1435782346/1435782478 
&amp;gt; ===============================================================================================

&amp;gt; crash&amp;gt; ptlrpc_request ffff8807fa1fc000 | grep rq_import
&amp;gt;   rq_import_generation = 1, 
&amp;gt;   rq_import = 0xffff8808378d8800, 
 
&amp;gt; crash&amp;gt; struct -o obd_import | grep imp_at
&amp;gt;    [896] struct imp_at imp_at;
&amp;gt; crash&amp;gt; struct -o imp_at | grep iat_net_latency
&amp;gt;    [32] struct adaptive_timeout iat_net_latency;
&amp;gt; crash&amp;gt; struct -o adaptive_timeout | grep at_lock
&amp;gt;   [48] spinlock_t at_lock;

So spinlock address: 
    0xffff8808378d8800 + 896 + 32 + 48 = ffff8808378d8bd0

&amp;gt; crash&amp;gt; spinlock_t -x ffff8808378d8bd0
&amp;gt; struct spinlock_t {
&amp;gt;   {
&amp;gt;     rlock = {
&amp;gt;       raw_lock = {
&amp;gt;         slock = 0x6666
&amp;gt;       }
&amp;gt;     }
&amp;gt;   }
&amp;gt; }

This at_lock is not locked.
&amp;gt; crash&amp;gt; px xtdumpregs | grep -B 15 -A 6 ffff8808378d8bd0
&amp;gt;   }, {
&amp;gt;     r15 = 0x28, 
&amp;gt;     r14 = 0x55944d05, 
&amp;gt;     r13 = 0x3b, 
&amp;gt;     r12 = 0x258, 
&amp;gt;     bp = 0xffff8807e29d3bc0, 
&amp;gt;     bx = 0xffff8808378d8ba0, 
&amp;gt;     r11 = 0xffffffffff0a, 
&amp;gt;     r10 = 0x1, 
&amp;gt;     r9 = 0x1, 
&amp;gt;     r8 = 0x1, 
&amp;gt;     ax = 0x266, 
&amp;gt;     cx = 0xffffffffa0559aa0, 
&amp;gt;     dx = 0x1000, 
&amp;gt;     si = 0x0, 
&amp;gt;     di = 0xffff8808378d8bd0, 
&amp;gt;     orig_ax = 0xffffffffffffffff, 
&amp;gt;     ip = 0xffffffff8141eabb, 
&amp;gt;     cs = 0x10, 
&amp;gt;     flags = 0x202, 
&amp;gt;     sp = 0xffff8807e29d3bc0, 
&amp;gt;     ss = 0x18

&amp;gt;&amp;gt;     RAX: 0000000000000266  RBX: ffff8808378d8ba0  RCX: ffffffffa0559aa0

&amp;gt; crash&amp;gt; dis _raw_spin_lock
&amp;gt; 0xffffffff8141eaa0 &amp;lt;_raw_spin_lock&amp;gt;:	push   %rbp
&amp;gt; 0xffffffff8141eaa1 &amp;lt;_raw_spin_lock+1&amp;gt;:	mov    %rsp,%rbp
&amp;gt; 0xffffffff8141eaa4 &amp;lt;_raw_spin_lock+4&amp;gt;:	data32 data32 data32 xchg %ax,%ax
&amp;gt; 0xffffffff8141eaa9 &amp;lt;_raw_spin_lock+9&amp;gt;:	mov    $0x100,%eax
&amp;gt; 0xffffffff8141eaae &amp;lt;_raw_spin_lock+14&amp;gt;:	lock xadd %ax,(%rdi)
&amp;gt; 0xffffffff8141eab3 &amp;lt;_raw_spin_lock+19&amp;gt;:	cmp    %ah,%al
&amp;gt; 0xffffffff8141eab5 &amp;lt;_raw_spin_lock+21&amp;gt;:	je     0xffffffff8141eabd &amp;lt;_raw_spin_lock+29&amp;gt;
&amp;gt; 0xffffffff8141eab7 &amp;lt;_raw_spin_lock+23&amp;gt;:	pause  
&amp;gt; 0xffffffff8141eab9 &amp;lt;_raw_spin_lock+25&amp;gt;:	mov    (%rdi),%al
&amp;gt; 0xffffffff8141eabb &amp;lt;_raw_spin_lock+27&amp;gt;:	jmp    0xffffffff8141eab3 &amp;lt;_raw_spin_lock+19&amp;gt;   &amp;lt;--- current location
&amp;gt; 0xffffffff8141eabd &amp;lt;_raw_spin_lock+29&amp;gt;:	leaveq 
&amp;gt; 0xffffffff8141eabe &amp;lt;_raw_spin_lock+30&amp;gt;:	retq   
&amp;gt; 0xffffffff8141eabf &amp;lt;_raw_spin_lock+31&amp;gt;:	nop

So %ah == 2 and %al == 66. The value of %ah does not change in the spinlock loop. So %al needs to wrap around to 2 before the loop will exit. (Note: my assembly skills are pretty rusty so good chance I&apos;ve got something wrong here, but the state of the spinlock compared to the values in use here looks suspicious. )

The dklog shows that other threads have requested this particular spinlock over 300 times since pid 5781 made this hung request. The fact that these threads are not also hung suggests that the spinlock requests were successful.

&amp;gt; 00000100:00001000:15:1435782195.117166:0:5781:0:(import.c:1586:at_measured()) add 40 to ffff880837297c48 time=33 v=40 (40 40 0 0)
&amp;gt; 00000100:00001000:14:1435782195.260830:0:5777:0:(import.c:1586:at_measured()) add 40 to ffff880837297c48 time=33 v=40 (40 40 0 0)
&amp;gt; 00000100:00001000:15:1435782195.270513:0:5773:0:(import.c:1586:at_measured()) add 40 to ffff880837297c48 time=33 v=40 (40 40 0 0)
&amp;gt; 00000100:00001000:14:1435782195.285561:0:5777:0:(import.c:1586:at_measured()) add 40 to ffff880837297c48 time=33 v=40 (40 40 0 0)
&amp;gt; 00000100:00001000:15:1435782195.291757:0:5773:0:(import.c:1586:at_measured()) add 40 to ffff880837297c48 time=33 v=40 (40 40 0 0)
...
&amp;gt; 00000100:00001000:1:1435782534.441946:0:5778:0:(import.c:1586:at_measured()) add 130 to ffff880837297c48 time=72 v=130 (130 0 40 40)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="120694" author="vsaveliev" created="Wed, 8 Jul 2015 14:30:25 +0000"  >&lt;p&gt;One of the issue instances contains the following logs:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00001000:9:1435782375.689537:0:5910:0:(import.c:1586:at_measured()) add 121 to ffff8807ecb12ba0 time=146 v=96 (96 1 1 1)
00000100:00001000:19:1435782375.689537:0:5896:0:(import.c:1586:at_measured()) add 14 to ffff8807ecb12ba0 time=146 v=96 (96 1 1 1)
00000100:00001000:9:1435782375.689541:0:5910:0:(import.c:1644:at_measured()) AT ffff8807ecb12ba0 change: old=96 new=121 delta=25 (val=121) hist 121 1 1 1
00000100:00001000:19:1435782375.689542:0:5896:0:(import.c:1586:at_measured()) add 14 to ffff8807ecb12ba0 time=1435782375 v=0 (0 0 0 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;crash dump shows that pid 5896 is at_measured() &amp;#45;&amp;gt; spin_lock(&amp;amp;at-&amp;gt;at_lock);-ing on the at (ffff8807ecb12ba0) which has been recently at_init()-ed.&lt;br/&gt;
It is possible that the spinlock got corrupted due to race between unprotected initialization and regular spin_lock()-ing.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; adaptive_timeout ffff8807ecb12ba0
struct adaptive_timeout {
  at_binstart = 0, 
  at_hist = {0, 0, 0, 0}, 
  at_flags = 0, 
  at_current = 0, 
  at_worst_ever = 0, 
  at_worst_time = 1435782375, 
  at_lock = {
    {
      rlock = {
        raw_lock = {
          slock = 514
        }
      }
    }
  }
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="121408" author="gerrit" created="Thu, 16 Jul 2015 03:09:49 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/15522/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/15522/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6805&quot; title=&quot;at_init is not safe to use anywhere but on initialization&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6805&quot;&gt;&lt;del&gt;LU-6805&lt;/del&gt;&lt;/a&gt; ptlrpc: use smp unsafe at_init only for initialization&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 96ddb3b168297e7d59a2f4b7b357549f2632bcb4&lt;/p&gt;</comment>
                            <comment id="121422" author="pjones" created="Thu, 16 Jul 2015 12:47:10 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxhj3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>