<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:00:33 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6476] conf-sanity: test_53a Error: &apos;test failed to respond and timed out&apos; </title>
                <link>https://jira.whamcloud.com/browse/LU-6476</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for bfaccini &amp;lt;bruno.faccini@intel.com&amp;gt;&lt;br/&gt;
This issue relates to the following test suite run: &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/b5f17674-e6d2-11e4-83f0-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/b5f17674-e6d2-11e4-83f0-5254006e85c2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One of the tests nodes running test_53a()/thread_sanity()/lustre_rmmod got an Oops in _spin_lock_irqsave() with the following stack dumped on the Console :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;13:04:49:BUG: unable to handle kernel NULL pointer dereference at (null)
13:04:49:IP: [&amp;lt;ffffffff8152cdcf&amp;gt;] _spin_lock_irqsave+0x1f/0x40
13:04:49:PGD 7985a067 PUD 7bb69067 PMD 0 
13:04:49:Oops: 0002 [#1] SMP 
13:04:49:last sysfs file: /sys/devices/system/cpu/possible
13:04:49:CPU 1 
13:04:49:Modules linked in: ptlrpc(-)(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ptlrpc_gss]
13:04:49:
13:04:49:Pid: 9166, comm: rmmod Not tainted 2.6.32-504.12.2.el6.x86_64 #1 Red Hat KVM
13:04:49:RIP: 0010:[&amp;lt;ffffffff8152cdcf&amp;gt;]  [&amp;lt;ffffffff8152cdcf&amp;gt;] _spin_lock_irqsave+0x1f/0x40
13:04:49:RSP: 0018:ffff88007b1c5db8  EFLAGS: 00010082
13:04:49:RAX: 0000000000010000 RBX: 0000000000000000 RCX: 0000000000000000
13:04:49:RDX: 0000000000000282 RSI: 0000000000000003 RDI: 0000000000000000
13:04:49:RBP: ffff88007b1c5db8 R08: ffff880079865540 R09: 0000000000000001
13:04:49:R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007b900aa0
13:04:49:R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
13:04:49:FS:  00007f46301fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
13:04:49:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
13:04:49:CR2: 0000000000000000 CR3: 000000007cb24000 CR4: 00000000000006e0
13:04:49:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
13:04:49:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
13:04:49:Process rmmod (pid: 9166, threadinfo ffff88007b1c4000, task ffff88007b900aa0)
13:04:49:Stack:
13:04:49: ffff88007b1c5df8 ffffffff8105bd02 ffff880078f849c0 0000000000000880
13:04:49:&amp;lt;d&amp;gt; ffff88007b900aa0 ffff88007b1c5e68 0000000000000000 0000000000000001
13:04:49:&amp;lt;d&amp;gt; ffff88007b1c5e08 ffffffffa0761cdf ffff88007b1c5e18 ffffffffa0743104
13:04:49:Call Trace:
13:04:49: [&amp;lt;ffffffff8105bd02&amp;gt;] __wake_up+0x32/0x70
13:04:49: [&amp;lt;ffffffffa0761cdf&amp;gt;] lnet_acceptor_stop+0x3f/0x50 [lnet]
13:04:49: [&amp;lt;ffffffffa0743104&amp;gt;] LNetNIFini+0x74/0x100 [lnet]
13:04:49: [&amp;lt;ffffffffa09ed775&amp;gt;] ptlrpc_ni_fini+0x135/0x1b0 [ptlrpc]
13:04:49: [&amp;lt;ffffffffa09ed803&amp;gt;] ptlrpc_exit_portals+0x13/0x20 [ptlrpc]
13:04:49: [&amp;lt;ffffffffa0a2fcaa&amp;gt;] ptlrpc_exit+0x22/0x38 [ptlrpc]
13:04:49: [&amp;lt;ffffffff810bcf14&amp;gt;] sys_delete_module+0x194/0x260
13:04:49: [&amp;lt;ffffffff810e5c87&amp;gt;] ? audit_syscall_entry+0x1d7/0x200
13:04:49: [&amp;lt;ffffffff8100b072&amp;gt;] system_call_fastpath+0x16/0x1b
13:04:49:Code: c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 b8 00 00 01 00 &amp;lt;f0&amp;gt; 0f c1 07 0f b7 c8 c1 e8 10 39 c1 74 0e f3 90 0f 1f 44 00 00 
13:04:49:RIP  [&amp;lt;ffffffff8152cdcf&amp;gt;] _spin_lock_irqsave+0x1f/0x40
13:04:49: RSP &amp;lt;ffff88007b1c5db8&amp;gt;
13:04:49:CR2: 0000000000000000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Looks like there is a race between lnet_acceptor_stop() execution and concurrent (and unexpected?) acceptor thread exit, where lnet_acceptor_stop() :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;void
lnet_acceptor_stop(void)
{
        if (lnet_acceptor_state.pta_shutdown) /* not running */
                return;

        lnet_acceptor_state.pta_shutdown = 1;
        wake_up_all(sk_sleep(lnet_acceptor_state.pta_sock-&amp;gt;sk));

        /* block until acceptor signals exit */
        wait_for_completion(&amp;amp;lnet_acceptor_state.pta_signal);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;tries to wake-up all socket sleepers without protecting itself against acceptor thread concurent exit in lnet_acceptor() routine :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        }

        sock_release(lnet_acceptor_state.pta_sock);
        lnet_acceptor_state.pta_sock = NULL;

        CDEBUG(D_NET, &quot;Acceptor stopping\n&quot;);

        /* unblock lnet_acceptor_stop() */
        complete(&amp;amp;lnet_acceptor_state.pta_signal);
        return 0;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;where acceptor socket is released and its sk_sleep sleepers wait-queue is cleared under protection of sk_callback_lock RW-Lock in sock_orphan(). So seems that lnet_acceptor_stop() should better use, or implement the same behavior than, sock_def_wakeup() :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static void sock_def_wakeup(struct sock *sk)
{
        read_lock(&amp;amp;sk-&amp;gt;sk_callback_lock);
        if (sk_has_sleeper(sk))
                wake_up_interruptible_all(sk-&amp;gt;sk_sleep);
        read_unlock(&amp;amp;sk-&amp;gt;sk_callback_lock);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;instead of simply calling wake_up_all().&lt;br/&gt;
This may have been introduced by recent integration of patches to implement dynamic acceptor start/stop feature.&lt;br/&gt;
Will push a patch to give a try to my previous thought on how this can be fixed.&lt;/p&gt;</description>
                <environment></environment>
        <key id="29559">LU-6476</key>
            <summary>conf-sanity: test_53a Error: &apos;test failed to respond and timed out&apos; </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Mon, 20 Apr 2015 12:16:37 +0000</created>
                <updated>Sat, 11 Jul 2015 12:57:21 +0000</updated>
                            <resolved>Sat, 11 Jul 2015 12:57:21 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="112376" author="yong.fan" created="Mon, 20 Apr 2015 12:45:34 +0000"  >&lt;p&gt;Another failure instance:&lt;br/&gt;
&lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/88b81062-df63-11e4-bf2e-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/88b81062-df63-11e4-bf2e-5254006e85c2&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="112388" author="gerrit" created="Mon, 20 Apr 2015 14:29:59 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/14503&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14503&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6476&quot; title=&quot;conf-sanity: test_53a Error: &amp;#39;test failed to respond and timed out&amp;#39; &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6476&quot;&gt;&lt;del&gt;LU-6476&lt;/del&gt;&lt;/a&gt; lnet: avoid race during acceptor thread termination&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 9f23308ae7b338d0c0f4b43d2e408d50fca4b3f6&lt;/p&gt;</comment>
                            <comment id="112421" author="adilger" created="Mon, 20 Apr 2015 17:51:45 +0000"  >&lt;p&gt;It looks like this was likely added via &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6002&quot; title=&quot;DLC: startup acceptor dynamically.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6002&quot;&gt;&lt;del&gt;LU-6002&lt;/del&gt;&lt;/a&gt; lnet: startup acceptor thread dynamically&quot; patch &lt;a href=&quot;http://review.whamcloud.com/13010&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13010&lt;/a&gt; &lt;/p&gt;</comment>
                            <comment id="113081" author="ashehata" created="Tue, 21 Apr 2015 20:53:43 +0000"  >&lt;p&gt;I don&apos;t this this is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6002&quot; title=&quot;DLC: startup acceptor dynamically.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6002&quot;&gt;&lt;del&gt;LU-6002&lt;/del&gt;&lt;/a&gt;.  The wake_up_all code in lnet_acceptor_stop was introduced by: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6245&quot; title=&quot;Untangle userland and kernel space support for libcfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6245&quot;&gt;&lt;del&gt;LU-6245&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;ed88907a (nathan          2007-02-10 00:05:05 +0000 505) 
ed88907a (nathan          2007-02-10 00:05:05 +0000 506) void
ed88907a (nathan          2007-02-10 00:05:05 +0000 507) lnet_acceptor_stop(void)
ed88907a (nathan          2007-02-10 00:05:05 +0000 508) {
91df044a (Amir Shehata    2015-01-12 17:14:33 -0800 509)        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lnet_acceptor_state.pta_shutdown) &lt;span class=&quot;code-comment&quot;&gt;/* not running */&lt;/span&gt;
2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 510)                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt;;
19205dfe (maxim           2007-09-28 21:46:35 +0000 511) 
2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 512)        lnet_acceptor_state.pta_shutdown = 1;
0b868add (James Simmons   2015-03-13 13:11:19 -0400 513)        wake_up_all(sk_sleep(lnet_acceptor_state.pta_sock-&amp;gt;sk));
b7455572 (maxim           2009-02-03 13:43:21 +0000 514) 
2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 515)        &lt;span class=&quot;code-comment&quot;&gt;/* block until acceptor signals exit */&lt;/span&gt;
2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 516)        wait_for_completion(&amp;amp;lnet_acceptor_state.pta_signal);
19205dfe (maxim           2007-09-28 21:46:35 +0000 517) }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="113119" author="bfaccini" created="Wed, 22 Apr 2015 13:38:46 +0000"  >&lt;p&gt;Having a better look to the code (but this may need to be confirmed with an in-deep crash-dump analysis !!...), I am now convinced that this can only occur during some concurrent socket activity with the shutdown/lnet_acceptor_stop() process, which requires the sleepers list to be accessed under protection, thus with the preferred sk-&amp;gt;sk_state_change() method to do so.&lt;/p&gt;</comment>
                            <comment id="114726" author="gerrit" created="Fri, 8 May 2015 15:04:33 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/14503/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14503/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6476&quot; title=&quot;conf-sanity: test_53a Error: &amp;#39;test failed to respond and timed out&amp;#39; &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6476&quot;&gt;&lt;del&gt;LU-6476&lt;/del&gt;&lt;/a&gt; lnet: avoid race during acceptor thread termination&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7574e32e64b77851163d0a8d3141b406d03621be&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="27833">LU-6002</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxb1j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>