<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:24:17 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2327] Clients stuck in mount</title>
                <link>https://jira.whamcloud.com/browse/LU-2327</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have another instance of clients getting stuck at mount time.  Out of 768 client, 766 mounted successfully, and 2 appear to be stuck.  So far it has been several minutes.  They both had the same soft lockup messages on the console:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160
REGS: c0000003ca60fba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 &amp;lt;EE,ME,CE&amp;gt;  CR: 84228424  XER: 20000000
TASK = c0000003ca51e600[3244] &apos;ll_ping&apos; THREAD: c0000003ca60c000 CPU: 33
GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78
GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000
GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160
GPR12: 8000000003aebda8 c00000000075d200
NIP [c00000000042e190] ._spin_lock+0x30/0x44
LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
Call Trace:
[c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
[c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0
4c00012c 2fa00000 4dfe0020 7c210b78 &amp;lt;80030000&amp;gt; 2fa00000 40defff4 7c421378
BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160
REGS: c0000003e9bf3900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 &amp;lt;EE,ME,CE&amp;gt;  CR: 84228444  XER: 20000000
TASK = c0000003ea1f0740[3175] &apos;ptlrpcd_rcv&apos; THREAD: c0000003e9bf0000 CPU: 49
GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78
GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160
GPR12: 8000000003aebda8 c00000000076a200
NIP [c00000000042e198] ._spin_lock+0x38/0x44
LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc]
Call Trace:
[c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable)
[c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
[c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
[c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000
4dfe0020 7c210b78 80030000 2fa00000 &amp;lt;40defff4&amp;gt; 7c421378 4bffffc8 7c0802a6
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Both nodes are pretty much idle.  I&apos;ll see what other information I can gather.&lt;/p&gt;</description>
                <environment>Sequoia, 2.3.54-2chaos, github.com/chaos/lustre</environment>
        <key id="16679">LU-2327</key>
            <summary>Clients stuck in mount</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>ptr</label>
                    </labels>
                <created>Wed, 14 Nov 2012 17:24:45 +0000</created>
                <updated>Wed, 15 Apr 2020 17:47:46 +0000</updated>
                            <resolved>Wed, 15 Apr 2020 17:47:46 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="47817" author="morrone" created="Wed, 14 Nov 2012 17:48:25 +0000"  >&lt;p&gt;Actually, there is a pretty common low level of activity from ldlm_poold, and ptlrpcds appear now and then in top.&lt;/p&gt;</comment>
                            <comment id="47818" author="morrone" created="Wed, 14 Nov 2012 17:50:40 +0000"  >&lt;p&gt;Attaching log files.  The following files:&lt;/p&gt;

&lt;p&gt;seqio652_lustre.log&lt;br/&gt;
seqio542_lustre.log&lt;/p&gt;

&lt;p&gt;Were just an &quot;lctl dk&quot; on the nodes that were hung in mount.&lt;/p&gt;

&lt;p&gt;This file:&lt;/p&gt;

&lt;p&gt;seqio652_lustre2.log&lt;/p&gt;

&lt;p&gt;Is &quot;lctl dk&quot; after enabling full debugging and waiting a few seconds.&lt;/p&gt;

&lt;p&gt;This file:&lt;/p&gt;

&lt;p&gt;seqio542_console_sysrq.log&lt;/p&gt;

&lt;p&gt;I did both &quot;l&quot; and &quot;t&quot; sysrq and captured the console output.&lt;/p&gt;</comment>
                            <comment id="47820" author="prakash" created="Wed, 14 Nov 2012 20:44:11 +0000"  >&lt;p&gt;At first glance, this looks like a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="47825" author="morrone" created="Wed, 14 Nov 2012 21:05:29 +0000"  >&lt;p&gt;Perhaps related, but in this instance there is no significant lock contention when I log in.  CPU usage is very low.  My understanding from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; is that the node has very high load due to lock contention.&lt;/p&gt;</comment>
                            <comment id="47827" author="prakash" created="Wed, 14 Nov 2012 21:23:55 +0000"  >&lt;p&gt;But there definitely was significant lock contention at some point, 68 seconds worth according to the message. Perhaps you caught it and logged in just as it cleared up? Or I may have mis-characterized the &quot;locking up&quot; part in the description of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Either way, it appears to be contending on an (the same?) import lock as in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) l *ptlrpc_pinger_main+0x19c
0x7ff2c is in ptlrpc_pinger_main (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c:230).
225     /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;lustre/ptlrpc/pinger.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 225 static void ptlrpc_pinger_process_import(struct obd_import *imp,                
 226                                          unsigned long this_ping)               
 227 {                                                                               
 228         int force, level;                                                       
 229                                                                                 
 230         cfs_spin_lock(&amp;amp;imp-&amp;gt;imp_lock);                                          
 231         level = imp-&amp;gt;imp_state;                                                 
 232         force = imp-&amp;gt;imp_force_verify;                                          
 233         if (force)                                                              
 234                 imp-&amp;gt;imp_force_verify = 0;                                      
 235         cfs_spin_unlock(&amp;amp;imp-&amp;gt;imp_lock);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47828" author="prakash" created="Wed, 14 Nov 2012 21:27:27 +0000"  >&lt;p&gt;If this is deemed a different issue that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;, I probably incorrectly resolved &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2141&quot; title=&quot;Soft lockup at mount time&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2141&quot;&gt;&lt;del&gt;LU-2141&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="47852" author="pjones" created="Thu, 15 Nov 2012 11:07:09 +0000"  >&lt;p&gt;Alex will triage this one&lt;/p&gt;</comment>
                            <comment id="48146" author="prakash" created="Tue, 20 Nov 2012 18:14:09 +0000"  >&lt;p&gt;I&apos;m looking at the &lt;tt&gt;seqio542_console_sysrq.log&lt;/tt&gt; file..&lt;/p&gt;

&lt;p&gt;The &lt;tt&gt;mount.lustre&lt;/tt&gt; task is sleeping:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-14 14:32:30.029575 {DefaultControlEventListener} [mmcs]{549}.0.0: mount.lustre  D 00000fff7bd0903c     0 20578  20577 0x00000000
2012-11-14 14:32:30.029627 {DefaultControlEventListener} [mmcs]{549}.0.0: Call Trace:
2012-11-14 14:32:30.029679 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2cc0] [c000000000687f48] svc_rdma_ops+0xa5a0/0x198c0 (unreliable)
2012-11-14 14:32:30.029731 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2e90] [c000000000008de0] .__switch_to+0xc4/0x100
2012-11-14 14:32:30.029784 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2f20] [c00000000042b0e0] .schedule+0x858/0x9c0
2012-11-14 14:32:30.029836 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db31d0] [c00000000042c16c] .__mutex_lock_slowpath+0x208/0x390
2012-11-14 14:32:30.029888 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db32d0] [c00000000042c30c] .mutex_lock+0x18/0x34
2012-11-14 14:32:30.029940 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3350] [8000000003a9ea34] .ptlrpc_pinger_add_import+0x94/0x3c0 [ptlrpc]
2012-11-14 14:32:30.029993 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3410] [8000000003a35974] .client_connect_import+0x3b4/0x5d0 [ptlrpc]
2012-11-14 14:32:30.030045 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db34f0] [8000000005143ce0] .lov_connect_obd+0x860/0x1590 [lov]
2012-11-14 14:32:30.030097 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3610] [8000000005144f44] .lov_connect+0x534/0xbb0 [lov]
2012-11-14 14:32:30.030150 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3750] [80000000069c603c] .ll_fill_super+0x799c/0xace0 [lustre]
2012-11-14 14:32:30.030203 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3900] [80000000024820c4] .lustre_fill_super+0x4c4/0x8e0 [obdclass]
2012-11-14 14:32:30.030255 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db39d0] [c0000000000d4aa0] .get_sb_nodev+0x84/0xe8
2012-11-14 14:32:30.030307 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3a80] [800000000245b5a8] .lustre_get_sb+0x28/0x40 [obdclass]
2012-11-14 14:32:30.030360 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3b10] [c0000000000d3244] .vfs_kern_mount+0x80/0x114
2012-11-14 14:32:30.030412 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3bc0] [c0000000000d3340] .do_kern_mount+0x58/0x130
2012-11-14 14:32:30.030464 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3c80] [c0000000000f20fc] .do_mount+0x8c8/0x984
2012-11-14 14:32:30.030516 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3d70] [c0000000000f2270] .SyS_mount+0xb8/0x124
2012-11-14 14:32:30.030568 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3e30] [c000000000000580] syscall_exit+0x0/0x2c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The dump of the active CPUs shows two threads stuck on the import lock I detailed above. The &quot;important&quot; one (I think) is below:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-14 14:32:16.108436 {DefaultControlEventListener} [mmcs]{549}.0.1: SysRq : Show backtrace of all active CPUs
2012-11-14 14:32:16.148433 {DefaultControlEventListener} [mmcs]{549}.8.1: CPU33:
2012-11-14 14:32:16.188544 {DefaultControlEventListener} [mmcs]{549}.8.1: Call Trace:
2012-11-14 14:32:16.228443 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3b20] [c000000000008160] .show_stack+0x7c/0x184 (unreliable)
2012-11-14 14:32:16.268527 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3bd0] [c000000000275548] .showacpu+0x64/0x94
2012-11-14 14:32:16.308459 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3c60] [c000000000068114] .generic_smp_call_function_interrupt+0x11c/0x258
2012-11-14 14:32:16.348385 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3d40] [c00000000001c14c] .smp_message_recv+0x34/0x78
2012-11-14 14:32:16.388393 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3dc0] [c000000000024250] .bgq_ipi_dispatch+0x118/0x18c
2012-11-14 14:32:16.428248 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3e50] [c000000000079d28] .handle_IRQ_event+0x88/0x18c
2012-11-14 14:32:16.472316 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3f00] [c00000000007c8a0] .handle_percpu_irq+0x8c/0x100
2012-11-14 14:32:16.518329 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3f90] [c00000000001a848] .call_handle_irq+0x1c/0x2c
2012-11-14 14:32:16.558212 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fa80] [c0000000000055a0] .do_IRQ+0x154/0x1e0
2012-11-14 14:32:16.598349 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fb30] [c0000000000134dc] exc_external_input_book3e+0x110/0x114
2012-11-14 14:32:16.638096 {DefaultControlEventListener} [mmcs]{549}.8.1: --- Exception: 501 at ._spin_lock+0x2c/0x44
2012-11-14 14:32:16.678061 {DefaultControlEventListener} [mmcs]{549}.8.1:     LR = .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2012-11-14 14:32:16.717910 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
2012-11-14 14:32:16.757975 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it looks like, one thread is stuck in &lt;tt&gt;ptlrpc_pinger_process_import&lt;/tt&gt; spinning on the &lt;tt&gt;imp-&amp;gt;imp_lock&lt;/tt&gt; while still holding the &lt;tt&gt;pinger_mutex&lt;/tt&gt;, as a result &lt;tt&gt;mount.lustre&lt;/tt&gt; gets stuck in &lt;tt&gt;ptlrpc_pinger_add_import&lt;/tt&gt; waiting to obtain the &lt;tt&gt;pinger_mutex&lt;/tt&gt; which is held by the other thread spinning on the import lock.&lt;/p&gt;

&lt;p&gt;The question that remains to be answered is, why in the active thread spinning on &lt;tt&gt;imp-&amp;gt;imp_lock&lt;/tt&gt; in the first place?&lt;/p&gt;</comment>
                            <comment id="48147" author="morrone" created="Tue, 20 Nov 2012 18:31:49 +0000"  >&lt;p&gt;I do not see any processes spinning when I run &quot;top&quot;, so I don&apos;t think we have anyone stuck waiting on a spin lock.  I may have just caught it when it happened to be grabbing the lock.  Or the stack may not be entirely reliable.&lt;/p&gt;</comment>
                            <comment id="48149" author="prakash" created="Tue, 20 Nov 2012 19:00:34 +0000"  >&lt;p&gt;But.. But.. The console messages.. The kernel was just mistaken about the thread being stuck for so long?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;top and the console seem to be at odds.. Personally I&apos;d trust the kernel messages, but I can&apos;t back up that claim.&lt;/p&gt;</comment>
                            <comment id="48151" author="morrone" created="Tue, 20 Nov 2012 19:06:53 +0000"  >&lt;p&gt;soft lockups are often not a permanent situation.  I don&apos;t suspect that ll_ping is permanently stuck.  I saw it appear in top occasionally, as you normally would on an otherwise idle system.  But I&apos;m pretty sure it is not stuck permanently on a single spin lock.&lt;/p&gt;</comment>
                            <comment id="48154" author="prakash" created="Tue, 20 Nov 2012 19:16:27 +0000"  >&lt;p&gt;Liang, can you please look this one over?&lt;/p&gt;</comment>
                            <comment id="48169" author="liang" created="Wed, 21 Nov 2012 04:21:01 +0000"  >&lt;p&gt;I think this is different with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;. There are only two threads are spinning on imp_lock, one is ptlrpc_main, the other is ptlrpcd()-&amp;gt;ptlrpc_check_set(). imp_lock is a heavy contention lock, but I can&apos;t explain why both threads are spinning on it but no thread actually hold it.&lt;br/&gt;
I will update you if I found something later.&lt;/p&gt;</comment>
                            <comment id="48900" author="bzzz" created="Fri, 7 Dec 2012 05:29:08 +0000"  >&lt;p&gt;Christopher, Prakash, could you translace ptlrpc_check_set+0x4f4 to the source line, please?&lt;/p&gt;</comment>
                            <comment id="48919" author="prakash" created="Fri, 7 Dec 2012 13:48:21 +0000"  >&lt;p&gt;I&apos;m taking this from the description in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) l *ptlrpc_check_set+0x4f4
0x46b04 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1852).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;1849 &amp;gt;-------&amp;gt;-------&amp;gt;-------libcfs_nid2str(imp-&amp;gt;imp_connection-&amp;gt;c_peer.nid),
1850 &amp;gt;-------&amp;gt;-------&amp;gt;-------lustre_msg_get_opc(req-&amp;gt;rq_reqmsg));
1851 
1852                 cfs_spin_lock(&amp;amp;imp-&amp;gt;imp_lock);
1853                 /* Request already may be not on sending or delaying list. This
1854                  * may happen in the case of marking it erroneous for the case
1855                  * ptlrpc_import_delay_req(req, status) find it impossible to
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="48971" author="bzzz" created="Mon, 10 Dec 2012 08:55:55 +0000"  >&lt;p&gt;I think this is just another symptom of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; - the pinger got stuck awaiting for imp_lock and holding pinger_mutex. then mount got stuck awaiting for pinger_mutex.&lt;/p&gt;

&lt;p&gt;though one thing in the log doesn&apos;t quite fit the theory:&lt;br/&gt;
2012-11-14 14:08:38.748069 &lt;/p&gt;
{DefaultControlEventListener} &lt;span class=&quot;error&quot;&gt;&amp;#91;mmcs&amp;#93;&lt;/span&gt;{549}.8.1: BUG: soft lockup - CPU#33 stuck for 68s! &lt;span class=&quot;error&quot;&gt;&amp;#91;ll_ping:3244&amp;#93;&lt;/span&gt;&lt;br/&gt;
...&lt;br/&gt;
2012-11-14 14:32:16.108436 {DefaultControlEventListener}
&lt;p&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;mmcs&amp;#93;&lt;/span&gt;&lt;/p&gt;
{549}
&lt;p&gt;.0.1: SysRq : Show backtrace of all active CPUs&lt;/p&gt;

&lt;p&gt;so, we had no &quot;soft lockup&quot; messages for ~24min while they are supposed to repeat every ~67s if the thread is making no progress.&lt;/p&gt;

&lt;p&gt;also I&apos;d say there should be tiny or zero contention at this point: we&apos;re in the middle of mount, the only possible activity is connect and/or ping.&lt;/p&gt;

&lt;p&gt;it&apos;d be interesting to have a dump and see details of import/lock.&lt;/p&gt;</comment>
                            <comment id="48985" author="prakash" created="Mon, 10 Dec 2012 11:15:25 +0000"  >&lt;p&gt;We don&apos;t have kdump enabled on these systems, but I can try to collect a lustre debug log if that is helpful. Will that contain the information we need (i.e. who is holding the lock)?&lt;/p&gt;

&lt;p&gt;Also, will the &quot;soft lockup&quot; message repeat for the same thread? I had thought it would only print once for each thread, I could be wrong though.&lt;/p&gt;</comment>
                            <comment id="48986" author="bzzz" created="Mon, 10 Dec 2012 12:14:57 +0000"  >&lt;p&gt;yes, I verified that the message will be printed few times with a simple while(1); in the module.&lt;/p&gt;

&lt;p&gt;BUG: soft lockup - CPU#0 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;osp-syn-0:3547&amp;#93;&lt;/span&gt;&lt;br/&gt;
...&lt;br/&gt;
BUG: soft lockup - CPU#0 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;osp-syn-0:3547&amp;#93;&lt;/span&gt;&lt;br/&gt;
...&lt;br/&gt;
BUG: soft lockup - CPU#0 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;osp-syn-0:3547&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;btw, was the kernel compiled with CONFIG_DEBUG_SPINLOCK ? this would let us to exclude some basic problems at least (like recursive locks).&lt;/p&gt;</comment>
                            <comment id="48992" author="prakash" created="Mon, 10 Dec 2012 12:42:29 +0000"  >&lt;p&gt;Unfortunately, it does not have &lt;tt&gt;CONFIG_DEBUG_SPINLOCK&lt;/tt&gt; enabled. When trying to enable various kernel config options, I run into boot issues. So far, I&apos;ve only been able to successfully enable &lt;tt&gt;CONFIG_DEBUG_SPINLOCK_SLEEP&lt;/tt&gt;. I&apos;ll look into getting &lt;tt&gt;CONFIG_DEBUG_SPINLOCK&lt;/tt&gt; enabled again, next time we get time on Sequoia.&lt;/p&gt;</comment>
                            <comment id="49005" author="bzzz" created="Mon, 10 Dec 2012 15:16:13 +0000"  >&lt;p&gt;please consider CONFIG_LOCK_STAT to collect stats on the locking.&lt;/p&gt;</comment>
                            <comment id="49109" author="bzzz" created="Wed, 12 Dec 2012 05:15:19 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,4808&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4808&lt;/a&gt; - the purpose of this debug patch is to make sure this is not ptlrpc_check_set() itself taking too long due to lots of requests.&lt;/p&gt;</comment>
                            <comment id="49168" author="prakash" created="Wed, 12 Dec 2012 19:35:12 +0000"  >&lt;p&gt;Alex, I&apos;ll pull that in.&lt;/p&gt;</comment>
                            <comment id="49462" author="morrone" created="Wed, 19 Dec 2012 14:48:31 +0000"  >&lt;p&gt;Ok, I believe that we&apos;re stuck in spin locks now.  It turns out that &quot;top&quot; isn&apos;t reporting our spinning threads on the ppc64 nodes for some reason.  But when I make my terminal tall enough to fit the 68 processors, I see 4 of the 68 processors at 100% usage.  Two of the 4 match up with IBM processes that always spin by design (polling their network) on cpus 66 and 67.  The other two match up with the cpus that sysrq-l lists as stuck in Lustre code:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-12-19 10:46:54.184688 {DefaultControlEventListener} [mmcs]{656}.0.2: --- Exception: 501 at ._spin_lock+0x30/0x44
2012-12-19 10:46:54.185089 {DefaultControlEventListener} [mmcs]{656}.0.2:     LR = .ptlrpc_check_set+0x50c/0x5010 [ptlrpc]
2012-12-19 10:46:54.185599 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fb70] [8000000003a19bc4] .ptlrpc_check_set+0x374/0x5010 [ptlrpc] (unreliable)
2012-12-19 10:46:54.186145 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fd20] [8000000003a7030c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2012-12-19 10:46:54.186599 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fe40] [8000000003a707f8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
2012-12-19 10:46:54.187056 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095ff90] [c00000000001b9e0] .kernel_thread+0x54/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-12-19 10:46:54.193259 {DefaultControlEventListener} [mmcs]{656}.1.1: --- Exception: 501 at ._spin_lock+0x38/0x44
2012-12-19 10:46:54.193759 {DefaultControlEventListener} [mmcs]{656}.1.1:     LR = .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2012-12-19 10:46:54.194223 {DefaultControlEventListener} [mmcs]{656}.1.1: [c0000003c18fbe20] [8000000003a52fc0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
2012-12-19 10:46:54.194700 {DefaultControlEventListener} [mmcs]{656}.1.1: [c0000003c18fbf90] [c00000000001b9e0] .kernel_thread+0x54/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The former process is ptlrpcd_rcv and the latter is ll_ping.&lt;/p&gt;

&lt;p&gt;These are, not at all coincidentally, the two processes for which there were soft lockup reports on the console:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;BUG: soft lockup - CPU#2 stuck for 67s! [ptlrpcd_rcv:3361]
BUG: soft lockup - CPU#5 stuck for 67s! [ll_ping:3378]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Alex, we have the &lt;a href=&quot;http://review.whamcloud.com/#change,4808&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;4808&lt;/a&gt; patch applied, but I do not see the message from that on the console.&lt;/p&gt;</comment>
                            <comment id="49478" author="bzzz" created="Thu, 20 Dec 2012 03:55:31 +0000"  >&lt;p&gt;ok, thanks for the update. could you verify with gdb this is exactly the same place again?&lt;/p&gt;</comment>
                            <comment id="49495" author="morrone" created="Thu, 20 Dec 2012 12:47:59 +0000"  >&lt;p&gt;Yes, ll_ping is stuck in the first spin_lock() at the beginning of ptlrpc_pinger_process_import, and ptlrpc_rcv is stuck in the spin_lock() in ptlrpc_check_set() right before the &quot;/* Request already may be not&quot;.&lt;/p&gt;

&lt;p&gt;The same places that Prakash noted in earlier comments.&lt;/p&gt;</comment>
                            <comment id="49498" author="bzzz" created="Thu, 20 Dec 2012 12:55:33 +0000"  >&lt;p&gt;got it, thanks. it would be ideal to try with a debug-enabled kernel.. in the meantime I&apos;ll try to cook a local debug patch. sorry, but I still have no any good idea what&apos;s happening here.&lt;/p&gt;</comment>
                            <comment id="49508" author="morrone" created="Thu, 20 Dec 2012 16:40:05 +0000"  >&lt;p&gt;We have about as much kernel debugging enabled as we can.  Crash doesn&apos;t work on these ppc64 nodes, and the kernel won&apos;t boot when many of the kernel debugging options are enabled.  We&apos;ll add anything else that we can get to work, but I wouldn&apos;t plan on getting more.&lt;/p&gt;</comment>
                            <comment id="49526" author="liang" created="Fri, 21 Dec 2012 01:41:12 +0000"  >&lt;p&gt;this patch fixed a deadlock for ptlrpc: &lt;a href=&quot;http://review.whamcloud.com/#change,4880&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4880&lt;/a&gt; , hope it can be helpful for this too.&lt;/p&gt;</comment>
                            <comment id="49537" author="bzzz" created="Fri, 21 Dec 2012 04:33:00 +0000"  >&lt;p&gt;this patch should dump request/import information if the lock can&apos;t be acquired in 5s: &lt;a href=&quot;http://review.whamcloud.com/4881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4881&lt;/a&gt;&lt;br/&gt;
it would be great to give it a spin along with Liang&apos;s patch, if possible.&lt;/p&gt;</comment>
                            <comment id="49731" author="prakash" created="Thu, 27 Dec 2012 16:08:53 +0000"  >&lt;p&gt;Liang, Alex, I pulled in those two patches (with a small tweak to 4881). It will probably be a week or two until we get time on the machine again to test them out.&lt;/p&gt;</comment>
                            <comment id="50160" author="morrone" created="Tue, 8 Jan 2013 16:55:00 +0000"  >&lt;p&gt;Testing lustre &lt;a href=&quot;https://github.com/chaos/lustre/commits/2.3.58-4chaos&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;2.3.58-4chaos&lt;/a&gt;, we hit the mount hang.&lt;/p&gt;

&lt;p&gt;The console shows:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
2013-01-08 13:41:15.901582 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1793:ptlrpc_check_set()) @@@ spining..  req@c00000039b36b
400 x1423631018098861/t0(0) o8-&amp;gt;lsfull-OST0034-osc-c0000003c1767800@172.20.1.52@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1357681320 ref 1 fl Complete:RN/0/0 rc 0/0
2013-01-08 13:41:15.902369 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1797:ptlrpc_check_set()) import c0000003ce3c7000, obd c00
00003ce29fc80, imp_lock = 2147483654
2013-01-08 13:41:15.902795 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1800:ptlrpc_check_set()) LBUG
2013-01-08 13:41:15.903241 {DefaultControlEventListener} [mmcs]{592}.0.2: Call Trace:
2013-01-08 13:41:15.903679 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91af980] [c000000000008d7c] .show_stack+0x7c/0x184 (unreliable)
2013-01-08 13:41:15.904114 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afa30] [8000000000ae0cb8] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
2013-01-08 13:41:15.904602 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afae0] [8000000000ae1480] .lbug_with_loc+0x50/0xc0 [libcfs]
2013-01-08 13:41:15.905036 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afb70] [8000000003abaa7c] .ptlrpc_check_set+0x62c/0x50f0 [ptlrpc]
2013-01-08 13:41:15.905469 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afd20] [8000000003b115cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2013-01-08 13:41:15.905893 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afe40] [8000000003b11ab8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
2013-01-08 13:41:15.906341 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91aff90] [c00000000001b9e0] .kernel_thread+0x54/0x70
2013-01-08 13:41:15.906756 {DefaultControlEventListener} [mmcs]{592}.2.1: LustreError: dumping log to /tmp/lustre-log.1357681275.3261
2013-01-08 13:41:15.907559 {DefaultControlEventListener} [mmcs]{592}.0.1: ^GMessage from syslogd@(none) at Jan  8 13:41:15 ...
2013-01-08 13:41:15.908017 {DefaultControlEventListener} [mmcs]{592}.0.1:  kernel:LustreError: 3261:0:(client.c:1800:ptlrpc_check_set()) LBUG
2013-01-08 13:42:23.488699 {DefaultControlEventListener} [mmcs]{592}.1.2: BUG: soft lockup - CPU#6 stuck for 68s! [ll_ping:3278]
2013-01-08 13:42:23.489613 {DefaultControlEventListener} [mmcs]{592}.1.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(
U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2013-01-08 13:42:23.490135 {DefaultControlEventListener} [mmcs]{592}.1.2: NIP: c00000000043972c LR: 8000000003af430c CTR: c0000000004396f8
2013-01-08 13:42:23.490608 {DefaultControlEventListener} [mmcs]{592}.1.2: REGS: c0000003e91f3ba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.15llnl.V1R1M2.bgq
62_16.ppc64)
2013-01-08 13:42:23.491118 {DefaultControlEventListener} [mmcs]{592}.1.2: MSR: 0000000080029000 &amp;lt;EE,ME,CE&amp;gt;  CR: 84222484  XER: 20000000
2013-01-08 13:42:23.491631 {DefaultControlEventListener} [mmcs]{592}.1.2: TASK = c0000003e46e6f60[3278] &apos;ll_ping&apos; THREAD: c0000003e91f0000 CPU: 6
2013-01-08 13:42:23.492171 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR00: 0000000080000006 c0000003e91f3e20 c0000000006eece8 c0000003ce3c7278 
2013-01-08 13:42:23.492694 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR04: 2222222222222222 0000000000000000 c0000003e91f3d38 0000000000000000 
2013-01-08 13:42:23.493173 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR08: 0000000000000000 0000000080000006 000000000000000f c0000000004396f8 
2013-01-08 13:42:23.493711 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR12: 8000000003b401a8 c000000000757300 
2013-01-08 13:42:23.494240 {DefaultControlEventListener} [mmcs]{592}.1.2: NIP [c00000000043972c] ._spin_lock+0x34/0x44
2013-01-08 13:42:23.494775 {DefaultControlEventListener} [mmcs]{592}.1.2: LR [8000000003af430c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2013-01-08 13:42:23.495329 {DefaultControlEventListener} [mmcs]{592}.1.2: Call Trace:
2013-01-08 13:42:23.495845 {DefaultControlEventListener} [mmcs]{592}.1.2: [c0000003e91f3e20] [8000000003af4280] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unrelia
ble)
2013-01-08 13:42:23.496386 {DefaultControlEventListener} [mmcs]{592}.1.2: [c0000003e91f3f90] [c00000000001b9e0] .kernel_thread+0x54/0x70
2013-01-08 13:42:23.496898 {DefaultControlEventListener} [mmcs]{592}.1.2: Instruction dump:
2013-01-08 13:42:23.497449 {DefaultControlEventListener} [mmcs]{592}.1.2: 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 
2013-01-08 13:42:23.497994 {DefaultControlEventListener} [mmcs]{592}.1.2: 2fa00000 4dfe0020 7c210b78 80030000 &amp;lt;2fa00000&amp;gt; 40defff4 7c421378 4bffffc8 
2013-01-08 13:48:58.511240 {DefaultControlEventListener} [mmcs]{592}.14.1: 2013-01-08 13:48:50.540 (INFO ) [0xfffa294f0c0] ibm.cios.sysiod.ClientMonitor: Using large Region=1048576 small =65536 Compile date=Oct 29 2012 Time=18:06:51
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;See the attached lustre log file &quot;lustre-log.1357681275.3261.txt&quot;.&lt;/p&gt;</comment>
                            <comment id="50403" author="bzzz" created="Mon, 14 Jan 2013 05:40:02 +0000"  >&lt;p&gt;00000100:00020000:2.0:1357681275.788314:1632:3261:0:(client.c:1797:ptlrpc_check_set()) import c0000003ce3c7000, obd c0000003ce29fc80, imp_lock = 2147483654&lt;br/&gt;
(gdb) p/x 2147483654&lt;br/&gt;
$1 = 0x80000006&lt;br/&gt;
#ifdef CONFIG_PPC64&lt;br/&gt;
/* use 0x800000yy when locked, where yy == CPU number */&lt;br/&gt;
#define LOCK_TOKEN	(*(u32 *)(&amp;amp;get_paca()-&amp;gt;lock_token))&lt;/p&gt;

&lt;p&gt;so, imp_lock was held by cpu #6, which later was reported with:&lt;br/&gt;
2013-01-08 13:42:23.488699 &lt;/p&gt;
{DefaultControlEventListener}
&lt;p&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;mmcs&amp;#93;&lt;/span&gt;&lt;/p&gt;
{592}
&lt;p&gt;.1.2: BUG: soft lockup - CPU#6 stuck for 68s! &lt;span class=&quot;error&quot;&gt;&amp;#91;ll_ping:3278&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;this mean either CPU#6 is trying to acquire imp_lock recursively (unlikely, the code is easy to follow) or CPU#6 has scheduled out a thread holding imp_lock and is running the pinger.&lt;/p&gt;</comment>
                            <comment id="50417" author="prakash" created="Mon, 14 Jan 2013 10:43:49 +0000"  >&lt;p&gt;I have enabled &lt;tt&gt;CONFIG_DEBUG_SPINLOCK&lt;/tt&gt; on these kernels, but was unable to turn on &lt;tt&gt;CONFIG_LOCK_STAT&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;Chris, the next time we hit this, can you log into the node and check the console. If we schedule while holding a spin lock, it should print this message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        printk(KERN_ERR                                                         
                &quot;BUG: sleeping function called from invalid context at %s:%d\n&quot;,
                        file, line);                                            
        printk(KERN_ERR                                                         
                &quot;in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n&quot;,    
                        in_atomic(), irqs_disabled(),                           
                        current-&amp;gt;pid, current-&amp;gt;comm);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This &lt;em&gt;should&lt;/em&gt; also make it to the logs, but I don&apos;t trust the way we filter some messages out.&lt;/p&gt;</comment>
                            <comment id="50418" author="bzzz" created="Mon, 14 Jan 2013 10:46:47 +0000"  >&lt;p&gt;oh, great news. thanks, Prakash!&lt;/p&gt;</comment>
                            <comment id="50435" author="adilger" created="Mon, 14 Jan 2013 15:14:40 +0000"  >&lt;p&gt;I&apos;ve landed &lt;a href=&quot;http://review.whamcloud.com/#change,4880&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4880&lt;/a&gt; to master.&lt;/p&gt;</comment>
                            <comment id="50465" author="bzzz" created="Mon, 14 Jan 2013 23:18:39 +0000"  >&lt;p&gt;Andreas, it seems 4880 didn&apos;t help in this case, see comment from Dec 28&lt;/p&gt;</comment>
                            <comment id="50815" author="prakash" created="Fri, 18 Jan 2013 12:00:32 +0000"  >&lt;p&gt;So, during testing yesterday I looped through many &lt;tt&gt;mount&lt;/tt&gt;/&lt;tt&gt;unmounts&lt;/tt&gt; of the file system on each of the Sequoia IONs running &lt;a href=&quot;https://github.com/chaos/lustre/commits/2.3.58-5chaos&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;2.3.58-5chaos&lt;/a&gt; and didn&apos;t run into a single hang. This is confusing because Chris noted that we hit the hang on 2.358-4chaos, and not much has changed between the two tags:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ git diff 2.3.58-4chaos..2.3.58-5chaos
diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index 187f85c..4e9a651 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -1264,9 +1264,11 @@ static int lod_qos_parse_config(const struct lu_env *env,
 
        if (magic == __swab32(LOV_USER_MAGIC_V1)) {
                lustre_swab_lov_user_md_v1(v1);
+               magic = v1-&amp;gt;lmm_magic;
        } else if (magic == __swab32(LOV_USER_MAGIC_V3)) {
                v3 = buf-&amp;gt;lb_buf;
                lustre_swab_lov_user_md_v3(v3);
+               magic = v3-&amp;gt;lmm_magic;
        }
 
        if (unlikely(magic != LOV_MAGIC_V1 &amp;amp;&amp;amp; magic != LOV_MAGIC_V3)) {
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also, I looked into the kernel sources some more, and contrary to my comment above, I don&apos;t think anything will be printed to the console if we schedule while holding a spinlock. The lines I quoted above only print if &lt;tt&gt;might_sleep&lt;/tt&gt; is called, and the message printed from within &lt;tt&gt;schedule_debug&lt;/tt&gt; is only printed if &lt;tt&gt;CONFIG_PREEMPT&lt;/tt&gt; is enabled (which is not on the IONs).&lt;/p&gt;</comment>
                            <comment id="51349" author="morrone" created="Mon, 28 Jan 2013 13:29:30 +0000"  >&lt;p&gt;Lowered priority on this one.  We haven&apos;t seen the problem in a couple of weeks.  Not entirely sure why/where it could have been fixed, but this is not going to get our attention if we don&apos;t see it again.&lt;/p&gt;</comment>
                            <comment id="52237" author="morrone" created="Tue, 12 Feb 2013 17:21:00 +0000"  >&lt;p&gt;Had some evidence that this problem is still hitting.  Of course, the debugging patch itself is fatal (calls LBUG()), so I&apos;m not 100% sure that the problem wasn&apos;t just the debug patch itself.  I modified the debug patch to print the debug message and drop a backtrace, but not LBUG.&lt;/p&gt;

&lt;p&gt;But sadly, I don&apos;t think this bug fixed itself. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="52241" author="prakash" created="Tue, 12 Feb 2013 20:02:03 +0000"  >&lt;p&gt;I&apos;ve bumped the priority of this back to &quot;Blocker&quot; status as we&apos;ve hit this a number of times today, running 2.3.58-12chaos.&lt;/p&gt;

&lt;p&gt;We&apos;ve modified the debug patch a bit, but it&apos;s basically shows the same information. A few example messages from the logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2013-02-12 16:04:00.355651 {DefaultControlEventListener} [mmcs]{169}.0.1: LustreError: 6708:0:(client.c:1793:ptlrpc_check_set()) @@@ spinning too long...  req@c000000386202000 x1426811847180481/t0(0) o8-&amp;gt;lsfull-OST0048-osc-c0000003c6ee5c00@172.20.1.72@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1360713885 ref 1 fl Complete:RN/0/0 rc 0/0
2013-02-12 16:04:00.356429 {DefaultControlEventListener} [mmcs]{169}.0.1: LustreError: 6708:0:(client.c:1797:ptlrpc_check_set()) import c000000383708800, obd c000000358703880, imp_lock = 2147483652
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2013-02-12 16:05:42.669084 {DefaultControlEventListener} [mmcs]{535}.0.2: LustreError: 5383:0:(client.c:1793:ptlrpc_check_set()) @@@ spinning too long...  req@c00000039ea4d800 x1426811960426622/t0(0) o8-&amp;gt;lsfull-OST0005-osc-c00000038533e000@172.20.1.5@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1360713987 ref 1 fl Complete:RN/0/0 rc 0/0
2013-02-12 16:05:42.669920 {DefaultControlEventListener} [mmcs]{535}.0.2: LustreError: 5383:0:(client.c:1797:ptlrpc_check_set()) import c000000382a4a800, obd c0000003c360c5c0, imp_lock = 2147483655
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It&apos;s interesting to note the &lt;tt&gt;raw_lock.slock&lt;/tt&gt; value is slightly different in these two messages. It&apos;s hex value is: &lt;tt&gt;2147483652 = 0x80000004 != 2147483655 = 0x80000007&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="52356" author="bzzz" created="Thu, 14 Feb 2013 05:42:48 +0000"  >&lt;p&gt;Christopher, do you think the system get back to &quot;normal&quot; after you removed LBUG() ? or those threads keep spinning on the locks forever?&lt;/p&gt;

&lt;p&gt;Prakash, the values different because locks belong to different imports, so they different as different their current owner (CPU#4 and CPU#7).&lt;/p&gt;</comment>
                            <comment id="52389" author="morrone" created="Thu, 14 Feb 2013 13:00:27 +0000"  >&lt;p&gt;It hit again after removing the LBUG(), so now I can definitively say that the two threads get stuck spinning on a spin lock indefinitely.&lt;/p&gt;
</comment>
                            <comment id="52391" author="bzzz" created="Thu, 14 Feb 2013 13:04:12 +0000"  >&lt;p&gt;OK, then there should be a list of the backtraces ?&lt;/p&gt;</comment>
                            <comment id="52406" author="morrone" created="Thu, 14 Feb 2013 17:16:21 +0000"  >&lt;p&gt;Please see the attached files seqio162_console.txt (contains sysrq-l and sysrq-t backtraces) and seqio162_lustre_log.txt.&lt;/p&gt;</comment>
                            <comment id="52431" author="bzzz" created="Fri, 15 Feb 2013 07:48:05 +0000"  >&lt;p&gt;ok, thanks... are you still running on the kernel with CONFIG_DEBUG_SPINLOCK ? if so, let&apos;s drop that lustre patch and try to see an owner?&lt;/p&gt;</comment>
                            <comment id="52445" author="prakash" created="Fri, 15 Feb 2013 12:01:27 +0000"  >&lt;p&gt;Unfortunately, CONFIG_DEBUG_SPINLOCK was dropped when we upgraded the kernel a couple weeks ago. I need to talk to the admins about why they dropped it, and if it can get enabled again. I&apos;m also interested in finding out who the owner is.&lt;/p&gt;</comment>
                            <comment id="52463" author="morrone" created="Fri, 15 Feb 2013 13:49:08 +0000"  >&lt;p&gt;Yes, when we went to the RHEL6.3 kernel on Sequoia last Thursday, we lost CONFIG_DEBUG_SPINLOCK.  Apparently the kernel would no longer operate with that enabled, as we&apos;ve seen with other debug settings.&lt;/p&gt;</comment>
                            <comment id="52736" author="bzzz" created="Wed, 20 Feb 2013 04:27:22 +0000"  >&lt;p&gt;any chance to get it back ?&lt;/p&gt;</comment>
                            <comment id="52752" author="prakash" created="Wed, 20 Feb 2013 12:13:14 +0000"  >&lt;p&gt;I&apos;m not sure yet, I need to get a test machine to play around with the kernel and determine what the underlying issue is with that option set. Pending work to be done. Is that the best option available to us? I&apos;d imagine knowing what the &quot;owner&quot; field is would tell us a lot.&lt;/p&gt;</comment>
                            <comment id="52753" author="bzzz" created="Wed, 20 Feb 2013 12:23:10 +0000"  >&lt;p&gt;well, I&apos;ve read through the code few times yet. Liang did as well. still no idea how do we get into this.. sorry. I hoped to get a hint from the kernel.&lt;/p&gt;</comment>
                            <comment id="52835" author="prakash" created="Thu, 21 Feb 2013 16:00:23 +0000"  >&lt;p&gt;Alex, we&apos;ve re-enabled CONFIG_DEBUG_SPINLOCK on the ION kernels. I&apos;ve added a patch to print out the &lt;tt&gt;void *owner&lt;/tt&gt;, &lt;tt&gt;int magic&lt;/tt&gt;, and &lt;tt&gt;int owner_cpu&lt;/tt&gt; fields of the &lt;tt&gt;spinlock_t&lt;/tt&gt; when the &quot;spinning&quot; message is printed. Is there any other information I should print when this hits? Or any other debugging info that would be useful?&lt;/p&gt;</comment>
                            <comment id="53047" author="prakash" created="Tue, 26 Feb 2013 16:13:39 +0000"  >&lt;p&gt;I&apos;ve tried reproducing the hang with CONFIG_DEBUG_SPINLOCK enabled, but haven&apos;t been able to trigger it.&lt;/p&gt;</comment>
                            <comment id="53795" author="jlevi" created="Tue, 12 Mar 2013 12:18:51 +0000"  >&lt;p&gt;Have you been able to reproduce this issue? Or should we close this ticket and reopen if it occurs again?&lt;/p&gt;</comment>
                            <comment id="53796" author="morrone" created="Tue, 12 Mar 2013 12:37:31 +0000"  >&lt;p&gt;We seem to have the problem that enabling kernel spin lock debugging makes the problem go away.  I think that explains why we had a period of time before that the problem went away as well.&lt;/p&gt;

&lt;p&gt;So I would say that the ticket should remain open, because there seems to be a lustre race here somewhere.  But the priority can be dropped in favor of other things that are more pressing.&lt;/p&gt;</comment>
                            <comment id="60114" author="morrone" created="Thu, 6 Jun 2013 19:59:02 +0000"  >&lt;p&gt;This is still a problem.  Unfortunately, we never seem to hit the problem while CONFIG_DEBUG_SPINLOCK is enabled, but it hits fairly frequently with it disabled.&lt;/p&gt;

&lt;p&gt;In the latest occurance that I investigated yesterday, I see the following stacks:&lt;/p&gt;

&lt;p&gt;CPU1:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ptlrpcd_check_set
ptlrpcd_check
ptlrpcd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It appears to be stuck spinning on the &quot;while (spin_trylock (&amp;amp;imp-&amp;gt;imp_lock)&quot; line.&lt;/p&gt;

&lt;p&gt;CPU5:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ptlrpc_pinger_process_import (inlined)
ptlrpc_pinger_main
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is spinning on &quot;spin_lock(&amp;amp;imp-&amp;gt;imp_lock)&quot; in ptlrpc_pinger_process_import(), which holding the pinger mutex.&lt;/p&gt;

&lt;p&gt;Not spinning, but stuck sleeping on the pinger mutex is the mount.lustre process:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;schedule
__mutex_lock_slowpath
mutex_lock
ptlrpc_pinger_add_import
client_connect_import
lov_connect_obd
lov_connect
ll_fill_super
lustre_fill_super
get_sb_nodev
lustre_get_sb
vfs_kern_mount
do_kern_mount
do_mount
sys_mount
syscall_exit
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But mount.lustre waiting on the mutex is understood since the other thread is holding it.  We still need to understand who is holding the import spin lock.&lt;/p&gt;

&lt;p&gt;Looking our debugging statement (&quot;spinning too long&quot;), I see that the value of the spin lock is 2147483652.  In hex, that is:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;0x8000004&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number.  So the lock value looks correct, and apparently the lock was taken on a CPU that is not currently in use by any processes.&lt;/p&gt;

&lt;p&gt;So that is just seems to further support the assumption that Lustre is failing to unlock the imp_lock somewhere.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="17086">LU-2572</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="12152" name="lustre-log.1357681275.3261.txt" size="3164674" author="morrone" created="Tue, 8 Jan 2013 16:52:21 +0000"/>
                            <attachment id="12256" name="seqio162_console.txt" size="1660608" author="morrone" created="Thu, 14 Feb 2013 17:16:21 +0000"/>
                            <attachment id="12255" name="seqio162_lustre_log.txt" size="3203611" author="morrone" created="Thu, 14 Feb 2013 17:16:21 +0000"/>
                            <attachment id="12050" name="seqio542_console_sysrq.log" size="1735499" author="morrone" created="Wed, 14 Nov 2012 17:50:40 +0000"/>
                            <attachment id="12049" name="seqio542_lustre.log" size="8379960" author="morrone" created="Wed, 14 Nov 2012 17:50:40 +0000"/>
                            <attachment id="12048" name="seqio652_lustre.log" size="8388351" author="morrone" created="Wed, 14 Nov 2012 17:50:40 +0000"/>
                            <attachment id="12047" name="seqio652_lustre2.log" size="4219813" author="morrone" created="Wed, 14 Nov 2012 17:50:40 +0000"/>
                    </attachments>
                <subtasks>
                            <subtask id="17086">LU-2572</subtask>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 26 Jun 2014 17:24:45 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvc9j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5558</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 14 Nov 2012 17:24:45 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>