<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:53:55 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5720] lustre client hangs on possible imp_lock deadlock</title>
                <link>https://jira.whamcloud.com/browse/LU-5720</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The node becomes unresponsive to users and the lustre client appears to be hung after being evicted by the MDT.  The node remains responsive to SysRq.  After crashing the node, it boots and mounts lustre successfully.&lt;/p&gt;

&lt;p&gt;The symptoms develop as follows:&lt;/p&gt;

&lt;p&gt;First the node starts reporting connection lost/connection restored&lt;br/&gt;
notices for an OST (same one repeatedly).  Then the node reports it has&lt;br/&gt;
been evicted by the MDT.  There are then a series of failure messages  &lt;br/&gt;
that appear to be the normal consequence of the eviction.              &lt;/p&gt;

&lt;p&gt;We then start seeing &quot;spinning too long&quot; messages from&lt;br/&gt;
ptlrpc_check_set() within the ptlrpcd_rcv task, and the kernel starts&lt;br/&gt;
reporting soft lockups on tasks ptlrpcd* and ll_ping.  The node becomes&lt;br/&gt;
unresponsive to everything other than SysRq.  The operators then crash &lt;br/&gt;
the node, and it comes up and mounts lustre successfully.              &lt;/p&gt;</description>
                <environment>Sequoia, 2.6.32-431.23.3.1chaos, github.com/chaos/lustre </environment>
        <key id="26922">LU-5720</key>
            <summary>lustre client hangs on possible imp_lock deadlock</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 8 Oct 2014 23:44:05 +0000</created>
                <updated>Mon, 21 Nov 2016 17:46:36 +0000</updated>
                            <resolved>Mon, 21 Nov 2016 17:46:36 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="95979" author="ofaaland" created="Wed, 8 Oct 2014 23:45:27 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;When the console log reports the node is evicted by mds, the errors
that follow are in:                                                
        mdc_enqueue, vvp_io_init, ll_close_inode_openhandle,       
        lmv_fid_alloc, mdc_intent_getattr_async_interpret,         
        ll_inode_revalidate_fini                                   

The stack dump following the &quot;spinning too long&quot; message is:
        ptlrpc_check_set+0x638                              
        ptlrpc_check+0x66c                                  
        ptlrpcd+0x37c                                       

The soft lockups and their stack dumps vary somewhat between instances,
so I&apos;ve listed them below.                                             

When the operators execute SysRq &quot;t&quot; (list current tasks and their
information) we see ll_ping, ptlrpcd_rcv, and one or more ptlrpcd_#
tasks are active.  This matches what we see in crash when analyzing
the core.                                                          

Unfortunately the machine is classified and so I cannot post the log
data.  Also, neither the SysRq output nor the crash dump shows the stack
traces for the active lustre tasks.                                     

Soft lockup details:
9/16 instance       

        soft lockups reported on ptlrpcd*, ll_ping, ll_flush  (all these
        tasks seem to complete before the machine is crashed)           
                ptlrpcd_# with stack traces like:                       
                        NIP: spin_lock                                  
                        ptlrpc_check_set+0x103c                         
                        ptlrpcd_check+0x66c                             
                        ptlrpcd+0x37c                                   

                or like:
                        NIP: spin_lock
                        ptlrpc_set_import_discon+0x58
                        ptlrpc_fail_import+0x8c      
                        ptlrpc_expire_one_request+0x464
                        ptlrpc_expired_set+0x104       
                        ptlrpcd+0x35c                  

                ll_flush stack traces like this:
                        NIP: spin_lock          
                        sptlrpc_import_sec_ref+0x1c
                        import_sec_validate_get+0x50
                        sptlrpc_req_get_ctx+0x9c    
                        __ptlrpc_request_bufs_pack+0xa4
                        ptlrpc_request_pack+0x34       
                        osc_brw_prep_request+0x21      
                        osc_build_rpc+0xc8c            
                        ...                            

                ll_ping stack traces like this:
                        NIP: spin_lock         
                        mutex_lock+0x34        
                        ptlrpc_pinger_main+0x190
                                [this is pinger.c:259
                                spin_lock(&amp;amp;imp-&amp;gt;imp_lock)]


9/27 instance
        soft lockups on ptlrpcd*, ll_ping (all these tasks seem to
        complete before the machine is crashed)                   

                ll_ping stack traces like this:
                        NIP: spin_lock
                        ptlrpc_pinger_main+0x190
                                [this is pinger.c:259
                                spin_lock(&amp;amp;imp-&amp;gt;imp_lock)]
                ptlrpcd_# with stack traces like:
                        NIP: spin_lock
                        ptlrpc_set_import_discon+0x58
                        ptlrpc_fail_import+0x8c
                        ptlrpc_expire_one_request+0x464
                        ptlrpc_expired_set+0x104
                        ptlrpcd+0x35c
                or like:
                        NIP: spin_lock
                        sptlrpc_import_sec_ref+0x1c
                        import_sec_validate_get+0x50
                        sptlrpc_req_refresh_ctx+0x10c
                        ptlrpc_check_set+0x1d34
                        ptlrpcd_check+0x66c
                        ptlrpcd+0x37c
                or like:
                        NIP: spin_lock
                        ptlrpc_check_set+0x103c
                        ptlrpcd_check+0x66c
                        ptlrpcd+0x37c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="95980" author="ofaaland" created="Wed, 8 Oct 2014 23:46:39 +0000"  >&lt;p&gt;Looked for similar tickets.  &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2327&quot; title=&quot;Clients stuck in mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2327&quot;&gt;&lt;del&gt;LU-2327&lt;/del&gt;&lt;/a&gt; has some similarities.  I also see in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; Liang mentions that the stack traces that suggest there may be a case where imp_lock is not unlocked&lt;/p&gt;</comment>
                            <comment id="95982" author="pjones" created="Thu, 9 Oct 2014 00:43:03 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please assist with this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="95983" author="green" created="Thu, 9 Oct 2014 01:16:16 +0000"  >&lt;p&gt;I think this is really &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2263&quot; title=&quot;CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2263&quot;&gt;&lt;del&gt;LU-2263&lt;/del&gt;&lt;/a&gt; that we only had a &quot;guess&quot; patch for and possibly guessed wrong.&lt;/p&gt;

&lt;p&gt;What&apos;s the exact tag in your github tree this system is running at, btw?&lt;/p&gt;</comment>
                            <comment id="95984" author="green" created="Thu, 9 Oct 2014 01:26:20 +0000"  >&lt;p&gt;Another question, you say that &quot;all these tasks seem to complete before the machine is crashed&quot;  - do you mean that you do not have a single task with example traces in the crashdump and all traces are from the log, or do you mean that it&apos;s just ll_flush that completes?&lt;/p&gt;

&lt;p&gt;If it&apos;s the former, how long did you wait in this state before crashing the box (on the off chance the condition clears itself after a while)?&lt;/p&gt;</comment>
                            <comment id="95985" author="green" created="Thu, 9 Oct 2014 01:53:19 +0000"  >&lt;p&gt;Also I guess since you have the crashdump, please check for me if you see threads in a state similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt; (in the initial report).&lt;/p&gt;</comment>
                            <comment id="95988" author="ofaaland" created="Thu, 9 Oct 2014 03:20:56 +0000"  >&lt;p&gt;Oleg:&lt;br/&gt;
  1) This system was at tag 2.4.2-14chaos&lt;br/&gt;
  2) By &quot;tasks seem to complete&quot; I meant that I thought those tasks were able to take hold of the lock before the crash, because the soft lockups were not continually reported.  However, I an not certain now; the &quot;spinning too long&quot; messages continue for &amp;gt;2 hours, and although I see ptlrpcd_rcv appear in soft lockups, they are not repeated every 67 seconds nor anything like it.  I will look at that more tomorrow and get back to you.&lt;/p&gt;

&lt;p&gt;The stack traces are almost entirely from the soft lockup messages in the logs.   &lt;/p&gt;

&lt;p&gt;I don&apos;t have a core for the 9/27 instance.&lt;br/&gt;
For the 9/16 instance, the crashdump provides a stack for flush-lustre-1:&lt;br/&gt;
 writeback_sb_inodes&lt;br/&gt;
 writeback_inodes_wb&lt;br/&gt;
 wb_writeback&lt;br/&gt;
 wb_do_writeback&lt;br/&gt;
 bdi_writeback_task&lt;br/&gt;
 bdi_start_fn&lt;/p&gt;

&lt;p&gt;However, there is no stack for:&lt;br/&gt;
 ptlrpcd_rcv&lt;br/&gt;
 ptlrpcd_0&lt;br/&gt;
 ptlrpcd_2&lt;br/&gt;
 ll_ping&lt;/p&gt;

&lt;p&gt;The rest of the cores are idle (swapper).&lt;/p&gt;</comment>
                            <comment id="96054" author="ofaaland" created="Thu, 9 Oct 2014 17:30:01 +0000"  >&lt;p&gt;No, I see no threads similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="96093" author="green" created="Thu, 9 Oct 2014 23:04:31 +0000"  >&lt;p&gt;Thanks, Olaf.&lt;br/&gt;
Any idea when you&apos;ll be able to cross-check that regularity of soft lockups is indeed suggesting the locks are actually released?&lt;/p&gt;

&lt;p&gt;Also when you say there&apos;s no stack for those processes - do you mean we don&apos;t know where they sit and they are on cpu (though usually crash shows stack for on-cpu threads too) or something else?&lt;/p&gt;

&lt;p&gt;Another side indicator you could use is to check that once there&apos;s soft lockup message for some process (esp. one that is not bound to any cpu), it always repeat on the same cpu (if it does not, that&apos;s a pretty big hint that the lock was obtained, then released and then the process was rescheduled somewhere else later on and again was waiting for a long time).&lt;/p&gt;</comment>
                            <comment id="96102" author="green" created="Thu, 9 Oct 2014 23:30:33 +0000"  >&lt;p&gt;Also, you do you have CONFIG_DEBUG_SPINLOCK enabled by any chance? If you do, then we can see what cpu holds the lock from the memory dump and perhaps see who&apos;s sitting on thta cpu and what is it doing.&lt;/p&gt;</comment>
                            <comment id="96104" author="ofaaland" created="Fri, 10 Oct 2014 00:19:23 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;I cannot find a way to prove whether any of those tasks were able to take hold of the lock or not.  &lt;/p&gt;

&lt;p&gt;All the details below are for the 9/16 incident, which is the one I have a crash dump for.&lt;/p&gt;

&lt;p&gt;I looked more closely at the logs and find that the soft lockup message appears only once for each task and are on distinct CPUs:&lt;/p&gt;

&lt;p&gt;ptlrpcd_rcv  cpu0&lt;br/&gt;
ptlrpcd_0  cpu6&lt;br/&gt;
ptlrpcd_2  cpu10&lt;br/&gt;
flush-lustre-1 cpu8&lt;br/&gt;
ll_ping  cpu13&lt;/p&gt;

&lt;p&gt;Furthermore, those are the same CPUs that those tasks are on in the dump.  The &quot;spinning too long&quot; message from within ptlrpcd_rcv gives all the same information, including the same deadline, for the entire 2+ hours, and produces only one &quot;soft lockup&quot; message.  Based on this, I believe these tasks may have been spinning on the lock until the crash.&lt;/p&gt;

&lt;p&gt;Crash produces a stack for only two tasks that were on CPUs:  the tasks that triggers the page fault, which was in swapper, and the flush-lustre-1 task.  For the other tasks on-cpu, &quot;bt &amp;lt;pid&amp;gt;&quot; produces only the header giving the PID, address of the task struct, cpu, and command.&lt;/p&gt;

&lt;p&gt;Unfortunately, CONFIG_DEBUG_SPINLOCK was not enabled so we got no report of the owner of the lock.  I&apos;ll find out whether that&apos;s something we can change.&lt;/p&gt;</comment>
                            <comment id="96105" author="green" created="Fri, 10 Oct 2014 00:30:11 +0000"  >&lt;p&gt;I found that sometimes using other options of bt helps to produce backtrace when normal bt does not work.&lt;br/&gt;
like say bt -o&lt;/p&gt;

&lt;p&gt;It is still kind of strange that the flush-lustre thread backtrace does not match the backtrace from watchdog.&lt;/p&gt;</comment>
                            <comment id="96107" author="ofaaland" created="Fri, 10 Oct 2014 01:02:16 +0000"  >&lt;p&gt;Thanks, Oleg.  I tried all the options except -o and -O (as this machine is PPC).  -T produces a listing (for example, on ll_ping task)  but I don&apos;t believe it&apos;s a valid backtrace.  No lustre functions appear at all.&lt;/p&gt;

&lt;p&gt;You&apos;re right about the flush-lustre thread.  I just verified the backtraces are different.&lt;/p&gt;</comment>
                            <comment id="96110" author="green" created="Fri, 10 Oct 2014 02:11:01 +0000"  >&lt;p&gt;Thanks.&lt;/p&gt;

&lt;p&gt;While examining your code I found you carry a patch &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2327&quot; title=&quot;Clients stuck in mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2327&quot;&gt;&lt;del&gt;LU-2327&lt;/del&gt;&lt;/a&gt; that also seems very related to the problem at hand. Also unresolved at the time.&lt;br/&gt;
Now, at the end of it there&apos;s this interesting comment by Chris:&lt;br/&gt;
&quot;Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number.&quot;&lt;/p&gt;

&lt;p&gt;Granted, in that ticket it was later determined that the lock was taken by a cpu that was idle, meaning somethign just forgot to release the lock. But I still wonder if you can check current situation in your crashdump too just in case.&lt;/p&gt;</comment>
                            <comment id="96115" author="morrone" created="Fri, 10 Oct 2014 03:14:22 +0000"  >&lt;p&gt;Ah, that is right, I remember that.&lt;/p&gt;

&lt;p&gt;On one hit of this bug, from seqlac6, 2014-09-16 19:09:24, imp_lock = 2147483658.&lt;/p&gt;

&lt;p&gt;In hex, that is 0x8000000a.  So in theory CPU 10 locked the spin lock.  Fortunately, unlike &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2327&quot; title=&quot;Clients stuck in mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2327&quot;&gt;&lt;del&gt;LU-2327&lt;/del&gt;&lt;/a&gt; CPU 10 appears to have ptlrpcd_2 running on it at the time of the crash dump.  Grant the crash dump was manually triggered more than ten minutes later, but lets assume it is stuck and holding the spin lock for now.&lt;/p&gt;

&lt;p&gt;I also found that I could get some messy stack information for ptlrpcd_2.  Olaf is new, and we told him about running &quot;mod -S&quot; when you use crash.  Unfortunately that advise probably made things worse in this instance on PPC.  I seem to get better somewhat more useful symbol information &lt;em&gt;without&lt;/em&gt; using mod -S for the active processes.  The symbols all change drastically, and not to anything sensible, when I use mod -S.&lt;/p&gt;

&lt;p&gt;Crash still can&apos;t give me a clean backtrace, but if I use &quot;bt -T&quot; I can at least get symbols.  I haven&apos;t attempted to disambiguate the symbols into a sane stack, but here it is:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sprintf
kernel_thread
symbol_string
ptlrpc_set_import_discon
pointer
hvc_console_print
pointer
hvc_console_print
up
up
try_acquire_console_sem
release_console_sem
try_acquire_console_sem
vprintk
task_tick_fair
scheduler_tick
run_local_times
update_process_times
account_system_time
account_system_time
do_softirq
xics_get_irq_lpar
irq_exit
do_IRQ
timer_interrupt
__wake_up
hardware_interrupt_entry
cfs_waitq_signal
ptlrpc_init
cfs_free
cfs_free
_spin_lock
_spin_lock
ptlrpc_set_import_discon
lnet_md_unlink
lnet_eq_eneueue_event
LNetMDUnlink
_debug_req
ptlrpc_set_import_discon
ptlrpc_unregister_reply
ptlrpc_fail_import
ptlrpc_expire_one_request
ptlrpc_expired_set
cfs_waitq_timedwait
ptlrpcd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The bottom of the stack contains ptlrpcd(), and that matches this being the ptlrpcd_2 thread.  The lower stack also looks like things the ptlrpcd would do.&lt;/p&gt;</comment>
                            <comment id="96118" author="green" created="Fri, 10 Oct 2014 03:56:06 +0000"  >&lt;p&gt;I audited the call paths leading to the ptlrpc_set_import_discon from ptlrpc_expired_set() and I don&apos;t see any other users of imp_lock on the way which seems to mean that the lock was locked by some other thread.&lt;/p&gt;

&lt;p&gt;I guess there&apos;s another alternative for the lock not being released, though. Another somewhat plausible scenario might be if something that holds this spinlock would go to sleep.&lt;br/&gt;
Without seeing what other threads you might be having that sleep while being called from ptlrpc code, it&apos;s hard to evaluate that.&lt;br/&gt;
I noticed that at least on x86, crash ps output has the last cpu the thread was running on (third column labeled CPU in my case), which might help to further shrink suspect threads.&lt;/p&gt;

&lt;p&gt;I noticed that you tried to run with spinlock debug in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2327&quot; title=&quot;Clients stuck in mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2327&quot;&gt;&lt;del&gt;LU-2327&lt;/del&gt;&lt;/a&gt;, that actually has code to detect a case of getting spinlock on one cpu, and releasing it on another. Something like:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; kernel: [28303.243508] BUG: spinlock wrong CPU on CPU#0, ptlrpcd_0/12315 (Not tainted)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You did not report anything like that back then (other than settign this made the problem disappear outright for some reason), though.&lt;/p&gt;

&lt;p&gt;Either way if we discount memory corruption (pretty low probability I think) and compiler issues (probably pretty low probability too), the only two remaining options are: something sleeps holdign this spinlock, or something forgets to release it (and I looked through all callers several times, and everything matches up properly).&lt;/p&gt;</comment>
                            <comment id="97321" author="ofaaland" created="Thu, 23 Oct 2014 21:05:08 +0000"  >&lt;p&gt;Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed.  Task 15059, comm=srun.  Its backtrace is:&lt;/p&gt;

&lt;p&gt;schedule&lt;br/&gt;
__mutex_lock_slowpath&lt;br/&gt;
mutex_lock&lt;br/&gt;
mdc_enqueue&lt;br/&gt;
lmv_enqueue&lt;br/&gt;
ll_layout_refresh&lt;br/&gt;
vvp_io_init&lt;br/&gt;
cl_io_init0&lt;br/&gt;
ll_file_io_generic&lt;br/&gt;
ll_file_aio_write&lt;br/&gt;
do_sync_readv_writev&lt;br/&gt;
do_readv_writev&lt;br/&gt;
sys_writev&lt;/p&gt;

&lt;p&gt;I don&apos;t see that it would be holding an import lock, but I may have missed something.&lt;/p&gt;</comment>
                            <comment id="97505" author="green" created="Sat, 25 Oct 2014 14:50:15 +0000"  >&lt;p&gt;Just to confirm, is this the only task that is reported as last run on cpu 10?&lt;/p&gt;</comment>
                            <comment id="97649" author="ofaaland" created="Tue, 28 Oct 2014 00:57:23 +0000"  >&lt;p&gt;No, 110 tasks were reported as last running on cpu 10. &lt;/p&gt;

&lt;p&gt;There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*):&lt;br/&gt;
schedule&lt;br/&gt;
cfs_waitq_wait&lt;br/&gt;
ldlm_bl_thread_main&lt;/p&gt;

&lt;p&gt;1 like this (comm=ldlm_cb02_007):&lt;br/&gt;
cfs_waitq_wait&lt;br/&gt;
ptlrpc_wait_event&lt;br/&gt;
ptlrpc_main&lt;/p&gt;

&lt;p&gt;the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.&lt;/p&gt;</comment>
                            <comment id="97858" author="green" created="Wed, 29 Oct 2014 16:52:11 +0000"  >&lt;p&gt;Hm, I see. This is not really promising then.&lt;/p&gt;

&lt;p&gt;Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?&lt;/p&gt;</comment>
                            <comment id="97862" author="ofaaland" created="Wed, 29 Oct 2014 17:13:33 +0000"  >&lt;p&gt;This is a current problem.  We are hitting it intermittently.  We have 6 lac nodes on sequoia, most recent incident was last Tuesday.&lt;/p&gt;

&lt;p&gt;At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn&apos;t magically go away with the kernel change).&lt;/p&gt;</comment>
                            <comment id="97894" author="green" created="Wed, 29 Oct 2014 23:23:23 +0000"  >&lt;p&gt;Ok.&lt;br/&gt;
Lat time the problem did go away magically I believe.&lt;br/&gt;
Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.&lt;/p&gt;</comment>
                            <comment id="174189" author="niu" created="Fri, 18 Nov 2016 05:04:51 +0000"  >&lt;p&gt;Any further report on this issue? Can we close it?&lt;/p&gt;</comment>
                            <comment id="174480" author="ofaaland" created="Mon, 21 Nov 2016 17:25:50 +0000"  >&lt;p&gt;I will check and get back to you.&lt;/p&gt;</comment>
                            <comment id="174482" author="ofaaland" created="Mon, 21 Nov 2016 17:44:57 +0000"  >&lt;p&gt;After rebuilding the kernel with CONFIG_DEBUG_SPINLOCK enabled, we stopped seeing the problem.  You can close it.  We won&apos;t have any additional information to help track it down.&lt;/p&gt;</comment>
                            <comment id="174483" author="pjones" created="Mon, 21 Nov 2016 17:46:36 +0000"  >&lt;p&gt;ok. Thanks Olaf.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 4 May 2015 23:44:05 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwy5b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16049</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 8 Oct 2014 23:44:05 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>