[LU-5720] lustre client hangs on possible imp_lock deadlock Created: 08/Oct/14  Updated: 21/Nov/16  Resolved: 21/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Sequoia, 2.6.32-431.23.3.1chaos, github.com/chaos/lustre


Severity: 3
Rank (Obsolete): 16049

 Description   

The node becomes unresponsive to users and the lustre client appears to be hung after being evicted by the MDT. The node remains responsive to SysRq. After crashing the node, it boots and mounts lustre successfully.

The symptoms develop as follows:

First the node starts reporting connection lost/connection restored
notices for an OST (same one repeatedly). Then the node reports it has
been evicted by the MDT. There are then a series of failure messages
that appear to be the normal consequence of the eviction.

We then start seeing "spinning too long" messages from
ptlrpc_check_set() within the ptlrpcd_rcv task, and the kernel starts
reporting soft lockups on tasks ptlrpcd* and ll_ping. The node becomes
unresponsive to everything other than SysRq. The operators then crash
the node, and it comes up and mounts lustre successfully.



 Comments   
Comment by Olaf Faaland [ 08/Oct/14 ]
When the console log reports the node is evicted by mds, the errors
that follow are in:                                                
        mdc_enqueue, vvp_io_init, ll_close_inode_openhandle,       
        lmv_fid_alloc, mdc_intent_getattr_async_interpret,         
        ll_inode_revalidate_fini                                   

The stack dump following the "spinning too long" message is:
        ptlrpc_check_set+0x638                              
        ptlrpc_check+0x66c                                  
        ptlrpcd+0x37c                                       

The soft lockups and their stack dumps vary somewhat between instances,
so I've listed them below.                                             

When the operators execute SysRq "t" (list current tasks and their
information) we see ll_ping, ptlrpcd_rcv, and one or more ptlrpcd_#
tasks are active.  This matches what we see in crash when analyzing
the core.                                                          

Unfortunately the machine is classified and so I cannot post the log
data.  Also, neither the SysRq output nor the crash dump shows the stack
traces for the active lustre tasks.                                     

Soft lockup details:
9/16 instance       

        soft lockups reported on ptlrpcd*, ll_ping, ll_flush  (all these
        tasks seem to complete before the machine is crashed)           
                ptlrpcd_# with stack traces like:                       
                        NIP: spin_lock                                  
                        ptlrpc_check_set+0x103c                         
                        ptlrpcd_check+0x66c                             
                        ptlrpcd+0x37c                                   

                or like:
                        NIP: spin_lock
                        ptlrpc_set_import_discon+0x58
                        ptlrpc_fail_import+0x8c      
                        ptlrpc_expire_one_request+0x464
                        ptlrpc_expired_set+0x104       
                        ptlrpcd+0x35c                  

                ll_flush stack traces like this:
                        NIP: spin_lock          
                        sptlrpc_import_sec_ref+0x1c
                        import_sec_validate_get+0x50
                        sptlrpc_req_get_ctx+0x9c    
                        __ptlrpc_request_bufs_pack+0xa4
                        ptlrpc_request_pack+0x34       
                        osc_brw_prep_request+0x21      
                        osc_build_rpc+0xc8c            
                        ...                            

                ll_ping stack traces like this:
                        NIP: spin_lock         
                        mutex_lock+0x34        
                        ptlrpc_pinger_main+0x190
                                [this is pinger.c:259
                                spin_lock(&imp->imp_lock)]


9/27 instance
        soft lockups on ptlrpcd*, ll_ping (all these tasks seem to
        complete before the machine is crashed)                   

                ll_ping stack traces like this:
                        NIP: spin_lock
                        ptlrpc_pinger_main+0x190
                                [this is pinger.c:259
                                spin_lock(&imp->imp_lock)]
                ptlrpcd_# with stack traces like:
                        NIP: spin_lock
                        ptlrpc_set_import_discon+0x58
                        ptlrpc_fail_import+0x8c
                        ptlrpc_expire_one_request+0x464
                        ptlrpc_expired_set+0x104
                        ptlrpcd+0x35c
                or like:
                        NIP: spin_lock
                        sptlrpc_import_sec_ref+0x1c
                        import_sec_validate_get+0x50
                        sptlrpc_req_refresh_ctx+0x10c
                        ptlrpc_check_set+0x1d34
                        ptlrpcd_check+0x66c
                        ptlrpcd+0x37c
                or like:
                        NIP: spin_lock
                        ptlrpc_check_set+0x103c
                        ptlrpcd_check+0x66c
                        ptlrpcd+0x37c
Comment by Olaf Faaland [ 08/Oct/14 ]

Looked for similar tickets. LU-2327 has some similarities. I also see in LU-2263 Liang mentions that the stack traces that suggest there may be a case where imp_lock is not unlocked

Comment by Peter Jones [ 09/Oct/14 ]

Niu

Could you please assist with this issue?

Thanks

Peter

Comment by Oleg Drokin [ 09/Oct/14 ]

I think this is really LU-2263 that we only had a "guess" patch for and possibly guessed wrong.

What's the exact tag in your github tree this system is running at, btw?

Comment by Oleg Drokin [ 09/Oct/14 ]

Another question, you say that "all these tasks seem to complete before the machine is crashed" - do you mean that you do not have a single task with example traces in the crashdump and all traces are from the log, or do you mean that it's just ll_flush that completes?

If it's the former, how long did you wait in this state before crashing the box (on the off chance the condition clears itself after a while)?

Comment by Oleg Drokin [ 09/Oct/14 ]

Also I guess since you have the crashdump, please check for me if you see threads in a state similar to LU-4300 (in the initial report).

Comment by Olaf Faaland [ 09/Oct/14 ]

Oleg:
1) This system was at tag 2.4.2-14chaos
2) By "tasks seem to complete" I meant that I thought those tasks were able to take hold of the lock before the crash, because the soft lockups were not continually reported. However, I an not certain now; the "spinning too long" messages continue for >2 hours, and although I see ptlrpcd_rcv appear in soft lockups, they are not repeated every 67 seconds nor anything like it. I will look at that more tomorrow and get back to you.

The stack traces are almost entirely from the soft lockup messages in the logs.

I don't have a core for the 9/27 instance.
For the 9/16 instance, the crashdump provides a stack for flush-lustre-1:
writeback_sb_inodes
writeback_inodes_wb
wb_writeback
wb_do_writeback
bdi_writeback_task
bdi_start_fn

However, there is no stack for:
ptlrpcd_rcv
ptlrpcd_0
ptlrpcd_2
ll_ping

The rest of the cores are idle (swapper).

Comment by Olaf Faaland [ 09/Oct/14 ]

No, I see no threads similar to LU-4300.

Comment by Oleg Drokin [ 09/Oct/14 ]

Thanks, Olaf.
Any idea when you'll be able to cross-check that regularity of soft lockups is indeed suggesting the locks are actually released?

Also when you say there's no stack for those processes - do you mean we don't know where they sit and they are on cpu (though usually crash shows stack for on-cpu threads too) or something else?

Another side indicator you could use is to check that once there's soft lockup message for some process (esp. one that is not bound to any cpu), it always repeat on the same cpu (if it does not, that's a pretty big hint that the lock was obtained, then released and then the process was rescheduled somewhere else later on and again was waiting for a long time).

Comment by Oleg Drokin [ 09/Oct/14 ]

Also, you do you have CONFIG_DEBUG_SPINLOCK enabled by any chance? If you do, then we can see what cpu holds the lock from the memory dump and perhaps see who's sitting on thta cpu and what is it doing.

Comment by Olaf Faaland [ 10/Oct/14 ]

Oleg,

I cannot find a way to prove whether any of those tasks were able to take hold of the lock or not.

All the details below are for the 9/16 incident, which is the one I have a crash dump for.

I looked more closely at the logs and find that the soft lockup message appears only once for each task and are on distinct CPUs:

ptlrpcd_rcv cpu0
ptlrpcd_0 cpu6
ptlrpcd_2 cpu10
flush-lustre-1 cpu8
ll_ping cpu13

Furthermore, those are the same CPUs that those tasks are on in the dump. The "spinning too long" message from within ptlrpcd_rcv gives all the same information, including the same deadline, for the entire 2+ hours, and produces only one "soft lockup" message. Based on this, I believe these tasks may have been spinning on the lock until the crash.

Crash produces a stack for only two tasks that were on CPUs: the tasks that triggers the page fault, which was in swapper, and the flush-lustre-1 task. For the other tasks on-cpu, "bt <pid>" produces only the header giving the PID, address of the task struct, cpu, and command.

Unfortunately, CONFIG_DEBUG_SPINLOCK was not enabled so we got no report of the owner of the lock. I'll find out whether that's something we can change.

Comment by Oleg Drokin [ 10/Oct/14 ]

I found that sometimes using other options of bt helps to produce backtrace when normal bt does not work.
like say bt -o

It is still kind of strange that the flush-lustre thread backtrace does not match the backtrace from watchdog.

Comment by Olaf Faaland [ 10/Oct/14 ]

Thanks, Oleg. I tried all the options except -o and -O (as this machine is PPC). -T produces a listing (for example, on ll_ping task) but I don't believe it's a valid backtrace. No lustre functions appear at all.

You're right about the flush-lustre thread. I just verified the backtraces are different.

Comment by Oleg Drokin [ 10/Oct/14 ]

Thanks.

While examining your code I found you carry a patch LU-2327 that also seems very related to the problem at hand. Also unresolved at the time.
Now, at the end of it there's this interesting comment by Chris:
"Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number."

Granted, in that ticket it was later determined that the lock was taken by a cpu that was idle, meaning somethign just forgot to release the lock. But I still wonder if you can check current situation in your crashdump too just in case.

Comment by Christopher Morrone [ 10/Oct/14 ]

Ah, that is right, I remember that.

On one hit of this bug, from seqlac6, 2014-09-16 19:09:24, imp_lock = 2147483658.

In hex, that is 0x8000000a. So in theory CPU 10 locked the spin lock. Fortunately, unlike LU-2327 CPU 10 appears to have ptlrpcd_2 running on it at the time of the crash dump. Grant the crash dump was manually triggered more than ten minutes later, but lets assume it is stuck and holding the spin lock for now.

I also found that I could get some messy stack information for ptlrpcd_2. Olaf is new, and we told him about running "mod -S" when you use crash. Unfortunately that advise probably made things worse in this instance on PPC. I seem to get better somewhat more useful symbol information without using mod -S for the active processes. The symbols all change drastically, and not to anything sensible, when I use mod -S.

Crash still can't give me a clean backtrace, but if I use "bt -T" I can at least get symbols. I haven't attempted to disambiguate the symbols into a sane stack, but here it is:

sprintf
kernel_thread
symbol_string
ptlrpc_set_import_discon
pointer
hvc_console_print
pointer
hvc_console_print
up
up
try_acquire_console_sem
release_console_sem
try_acquire_console_sem
vprintk
task_tick_fair
scheduler_tick
run_local_times
update_process_times
account_system_time
account_system_time
do_softirq
xics_get_irq_lpar
irq_exit
do_IRQ
timer_interrupt
__wake_up
hardware_interrupt_entry
cfs_waitq_signal
ptlrpc_init
cfs_free
cfs_free
_spin_lock
_spin_lock
ptlrpc_set_import_discon
lnet_md_unlink
lnet_eq_eneueue_event
LNetMDUnlink
_debug_req
ptlrpc_set_import_discon
ptlrpc_unregister_reply
ptlrpc_fail_import
ptlrpc_expire_one_request
ptlrpc_expired_set
cfs_waitq_timedwait
ptlrpcd

The bottom of the stack contains ptlrpcd(), and that matches this being the ptlrpcd_2 thread. The lower stack also looks like things the ptlrpcd would do.

Comment by Oleg Drokin [ 10/Oct/14 ]

I audited the call paths leading to the ptlrpc_set_import_discon from ptlrpc_expired_set() and I don't see any other users of imp_lock on the way which seems to mean that the lock was locked by some other thread.

I guess there's another alternative for the lock not being released, though. Another somewhat plausible scenario might be if something that holds this spinlock would go to sleep.
Without seeing what other threads you might be having that sleep while being called from ptlrpc code, it's hard to evaluate that.
I noticed that at least on x86, crash ps output has the last cpu the thread was running on (third column labeled CPU in my case), which might help to further shrink suspect threads.

I noticed that you tried to run with spinlock debug in LU-2327, that actually has code to detect a case of getting spinlock on one cpu, and releasing it on another. Something like:

 kernel: [28303.243508] BUG: spinlock wrong CPU on CPU#0, ptlrpcd_0/12315 (Not tainted)

You did not report anything like that back then (other than settign this made the problem disappear outright for some reason), though.

Either way if we discount memory corruption (pretty low probability I think) and compiler issues (probably pretty low probability too), the only two remaining options are: something sleeps holdign this spinlock, or something forgets to release it (and I looked through all callers several times, and everything matches up properly).

Comment by Olaf Faaland [ 23/Oct/14 ]

Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is:

schedule
__mutex_lock_slowpath
mutex_lock
mdc_enqueue
lmv_enqueue
ll_layout_refresh
vvp_io_init
cl_io_init0
ll_file_io_generic
ll_file_aio_write
do_sync_readv_writev
do_readv_writev
sys_writev

I don't see that it would be holding an import lock, but I may have missed something.

Comment by Oleg Drokin [ 25/Oct/14 ]

Just to confirm, is this the only task that is reported as last run on cpu 10?

Comment by Olaf Faaland [ 28/Oct/14 ]

No, 110 tasks were reported as last running on cpu 10.

There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*):
schedule
cfs_waitq_wait
ldlm_bl_thread_main

1 like this (comm=ldlm_cb02_007):
cfs_waitq_wait
ptlrpc_wait_event
ptlrpc_main

the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.

Comment by Oleg Drokin [ 29/Oct/14 ]

Hm, I see. This is not really promising then.

Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?

Comment by Olaf Faaland [ 29/Oct/14 ]

This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday.

At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change).

Comment by Oleg Drokin [ 29/Oct/14 ]

Ok.
Lat time the problem did go away magically I believe.
Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.

Comment by Niu Yawei (Inactive) [ 18/Nov/16 ]

Any further report on this issue? Can we close it?

Comment by Olaf Faaland [ 21/Nov/16 ]

I will check and get back to you.

Comment by Olaf Faaland [ 21/Nov/16 ]

After rebuilding the kernel with CONFIG_DEBUG_SPINLOCK enabled, we stopped seeing the problem. You can close it. We won't have any additional information to help track it down.

Comment by Peter Jones [ 21/Nov/16 ]

ok. Thanks Olaf.

Generated at Sat Feb 10 01:53:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.