[LU-5720] lustre client hangs on possible imp_lock deadlock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
- llnl
Environment:
Sequoia, 2.6.32-431.23.3.1chaos, github.com/chaos/lustre

Severity:
3
Rank (Obsolete):
16049

Description

The node becomes unresponsive to users and the lustre client appears to be hung after being evicted by the MDT. The node remains responsive to SysRq. After crashing the node, it boots and mounts lustre successfully.

The symptoms develop as follows:

First the node starts reporting connection lost/connection restored
notices for an OST (same one repeatedly). Then the node reports it has
been evicted by the MDT. There are then a series of failure messages
that appear to be the normal consequence of the eviction.

We then start seeing "spinning too long" messages from
ptlrpc_check_set() within the ptlrpcd_rcv task, and the kernel starts
reporting soft lockups on tasks ptlrpcd* and ll_ping. The node becomes
unresponsive to everything other than SysRq. The operators then crash
the node, and it comes up and mounts lustre successfully.

Attachments

Activity

[LU-5720] lustre client hangs on possible imp_lock deadlock

Niu Yawei (Inactive) added a comment - 18/Nov/16 5:04 AM

Any further report on this issue? Can we close it?

Niu Yawei (Inactive) added a comment - 18/Nov/16 5:04 AM Any further report on this issue? Can we close it?

Oleg Drokin added a comment - 29/Oct/14 11:23 PM

Ok.
Lat time the problem did go away magically I believe.
Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.

Oleg Drokin added a comment - 29/Oct/14 11:23 PM Ok. Lat time the problem did go away magically I believe. Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.

Olaf Faaland added a comment - 29/Oct/14 5:13 PM - edited

This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday.

At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change).

Olaf Faaland added a comment - 29/Oct/14 5:13 PM - edited This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday. At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change).

Oleg Drokin added a comment - 29/Oct/14 4:52 PM

Hm, I see. This is not really promising then.

Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?

Oleg Drokin added a comment - 29/Oct/14 4:52 PM Hm, I see. This is not really promising then. Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?

Olaf Faaland added a comment - 28/Oct/14 12:57 AM

No, 110 tasks were reported as last running on cpu 10.

There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*):
schedule
cfs_waitq_wait
ldlm_bl_thread_main

1 like this (comm=ldlm_cb02_007):
cfs_waitq_wait
ptlrpc_wait_event
ptlrpc_main

the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.

Olaf Faaland added a comment - 28/Oct/14 12:57 AM No, 110 tasks were reported as last running on cpu 10. There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*): schedule cfs_waitq_wait ldlm_bl_thread_main 1 like this (comm=ldlm_cb02_007): cfs_waitq_wait ptlrpc_wait_event ptlrpc_main the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.

Oleg Drokin added a comment - 25/Oct/14 2:50 PM

Just to confirm, is this the only task that is reported as last run on cpu 10?

Oleg Drokin added a comment - 25/Oct/14 2:50 PM Just to confirm, is this the only task that is reported as last run on cpu 10?

Olaf Faaland added a comment - 23/Oct/14 9:05 PM

Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is:

schedule
__mutex_lock_slowpath
mutex_lock
mdc_enqueue
lmv_enqueue
ll_layout_refresh
vvp_io_init
cl_io_init0
ll_file_io_generic
ll_file_aio_write
do_sync_readv_writev
do_readv_writev
sys_writev

I don't see that it would be holding an import lock, but I may have missed something.

Olaf Faaland added a comment - 23/Oct/14 9:05 PM Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is: schedule __mutex_lock_slowpath mutex_lock mdc_enqueue lmv_enqueue ll_layout_refresh vvp_io_init cl_io_init0 ll_file_io_generic ll_file_aio_write do_sync_readv_writev do_readv_writev sys_writev I don't see that it would be holding an import lock, but I may have missed something.

Oleg Drokin added a comment - 10/Oct/14 3:56 AM

I audited the call paths leading to the ptlrpc_set_import_discon from ptlrpc_expired_set() and I don't see any other users of imp_lock on the way which seems to mean that the lock was locked by some other thread.

I guess there's another alternative for the lock not being released, though. Another somewhat plausible scenario might be if something that holds this spinlock would go to sleep.
Without seeing what other threads you might be having that sleep while being called from ptlrpc code, it's hard to evaluate that.
I noticed that at least on x86, crash ps output has the last cpu the thread was running on (third column labeled CPU in my case), which might help to further shrink suspect threads.

I noticed that you tried to run with spinlock debug in ~~LU-2327~~, that actually has code to detect a case of getting spinlock on one cpu, and releasing it on another. Something like:

 kernel: [28303.243508] BUG: spinlock wrong CPU on CPU#0, ptlrpcd_0/12315 (Not tainted)

You did not report anything like that back then (other than settign this made the problem disappear outright for some reason), though.

Either way if we discount memory corruption (pretty low probability I think) and compiler issues (probably pretty low probability too), the only two remaining options are: something sleeps holdign this spinlock, or something forgets to release it (and I looked through all callers several times, and everything matches up properly).

Oleg Drokin added a comment - 10/Oct/14 3:56 AM I audited the call paths leading to the ptlrpc_set_import_discon from ptlrpc_expired_set() and I don't see any other users of imp_lock on the way which seems to mean that the lock was locked by some other thread. I guess there's another alternative for the lock not being released, though. Another somewhat plausible scenario might be if something that holds this spinlock would go to sleep. Without seeing what other threads you might be having that sleep while being called from ptlrpc code, it's hard to evaluate that. I noticed that at least on x86, crash ps output has the last cpu the thread was running on (third column labeled CPU in my case), which might help to further shrink suspect threads. I noticed that you tried to run with spinlock debug in LU-2327 , that actually has code to detect a case of getting spinlock on one cpu, and releasing it on another. Something like: kernel: [28303.243508] BUG: spinlock wrong CPU on CPU#0, ptlrpcd_0/12315 (Not tainted) You did not report anything like that back then (other than settign this made the problem disappear outright for some reason), though. Either way if we discount memory corruption (pretty low probability I think) and compiler issues (probably pretty low probability too), the only two remaining options are: something sleeps holdign this spinlock, or something forgets to release it (and I looked through all callers several times, and everything matches up properly).

Christopher Morrone (Inactive) added a comment - 10/Oct/14 3:14 AM

Ah, that is right, I remember that.

On one hit of this bug, from seqlac6, 2014-09-16 19:09:24, imp_lock = 2147483658.

In hex, that is 0x8000000a. So in theory CPU 10 locked the spin lock. Fortunately, unlike ~~LU-2327~~ CPU 10 appears to have ptlrpcd_2 running on it at the time of the crash dump. Grant the crash dump was manually triggered more than ten minutes later, but lets assume it is stuck and holding the spin lock for now.

I also found that I could get some messy stack information for ptlrpcd_2. Olaf is new, and we told him about running "mod -S" when you use crash. Unfortunately that advise probably made things worse in this instance on PPC. I seem to get better somewhat more useful symbol information without using mod -S for the active processes. The symbols all change drastically, and not to anything sensible, when I use mod -S.

Crash still can't give me a clean backtrace, but if I use "bt -T" I can at least get symbols. I haven't attempted to disambiguate the symbols into a sane stack, but here it is:

sprintf
kernel_thread
symbol_string
ptlrpc_set_import_discon
pointer
hvc_console_print
pointer
hvc_console_print
up
up
try_acquire_console_sem
release_console_sem
try_acquire_console_sem
vprintk
task_tick_fair
scheduler_tick
run_local_times
update_process_times
account_system_time
account_system_time
do_softirq
xics_get_irq_lpar
irq_exit
do_IRQ
timer_interrupt
__wake_up
hardware_interrupt_entry
cfs_waitq_signal
ptlrpc_init
cfs_free
cfs_free
_spin_lock
_spin_lock
ptlrpc_set_import_discon
lnet_md_unlink
lnet_eq_eneueue_event
LNetMDUnlink
_debug_req
ptlrpc_set_import_discon
ptlrpc_unregister_reply
ptlrpc_fail_import
ptlrpc_expire_one_request
ptlrpc_expired_set
cfs_waitq_timedwait
ptlrpcd

The bottom of the stack contains ptlrpcd(), and that matches this being the ptlrpcd_2 thread. The lower stack also looks like things the ptlrpcd would do.

Christopher Morrone (Inactive) added a comment - 10/Oct/14 3:14 AM Ah, that is right, I remember that. On one hit of this bug, from seqlac6, 2014-09-16 19:09:24, imp_lock = 2147483658. In hex, that is 0x8000000a. So in theory CPU 10 locked the spin lock. Fortunately, unlike LU-2327 CPU 10 appears to have ptlrpcd_2 running on it at the time of the crash dump. Grant the crash dump was manually triggered more than ten minutes later, but lets assume it is stuck and holding the spin lock for now. I also found that I could get some messy stack information for ptlrpcd_2. Olaf is new, and we told him about running "mod -S" when you use crash. Unfortunately that advise probably made things worse in this instance on PPC. I seem to get better somewhat more useful symbol information without using mod -S for the active processes. The symbols all change drastically, and not to anything sensible, when I use mod -S. Crash still can't give me a clean backtrace, but if I use "bt -T" I can at least get symbols. I haven't attempted to disambiguate the symbols into a sane stack, but here it is: sprintf kernel_thread symbol_string ptlrpc_set_import_discon pointer hvc_console_print pointer hvc_console_print up up try_acquire_console_sem release_console_sem try_acquire_console_sem vprintk task_tick_fair scheduler_tick run_local_times update_process_times account_system_time account_system_time do_softirq xics_get_irq_lpar irq_exit do_IRQ timer_interrupt __wake_up hardware_interrupt_entry cfs_waitq_signal ptlrpc_init cfs_free cfs_free _spin_lock _spin_lock ptlrpc_set_import_discon lnet_md_unlink lnet_eq_eneueue_event LNetMDUnlink _debug_req ptlrpc_set_import_discon ptlrpc_unregister_reply ptlrpc_fail_import ptlrpc_expire_one_request ptlrpc_expired_set cfs_waitq_timedwait ptlrpcd The bottom of the stack contains ptlrpcd(), and that matches this being the ptlrpcd_2 thread. The lower stack also looks like things the ptlrpcd would do.

Oleg Drokin added a comment - 10/Oct/14 2:11 AM

Thanks.

While examining your code I found you carry a patch ~~LU-2327~~ that also seems very related to the problem at hand. Also unresolved at the time.
Now, at the end of it there's this interesting comment by Chris:
"Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number."

Granted, in that ticket it was later determined that the lock was taken by a cpu that was idle, meaning somethign just forgot to release the lock. But I still wonder if you can check current situation in your crashdump too just in case.

Oleg Drokin added a comment - 10/Oct/14 2:11 AM Thanks. While examining your code I found you carry a patch LU-2327 that also seems very related to the problem at hand. Also unresolved at the time. Now, at the end of it there's this interesting comment by Chris: "Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number." Granted, in that ticket it was later determined that the lock was taken by a cpu that was idle, meaning somethign just forgot to release the lock. But I still wonder if you can check current situation in your crashdump too just in case.

Olaf Faaland added a comment - 10/Oct/14 1:02 AM

Thanks, Oleg. I tried all the options except -o and -O (as this machine is PPC). -T produces a listing (for example, on ll_ping task) but I don't believe it's a valid backtrace. No lustre functions appear at all.

You're right about the flush-lustre thread. I just verified the backtraces are different.

Olaf Faaland added a comment - 10/Oct/14 1:02 AM Thanks, Oleg. I tried all the options except -o and -O (as this machine is PPC). -T produces a listing (for example, on ll_ping task) but I don't believe it's a valid backtrace. No lustre functions appear at all. You're right about the flush-lustre thread. I just verified the backtraces are different.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 08/Oct/14 11:44 PM

Updated:: 21/Nov/16 5:46 PM

Resolved:: 21/Nov/16 5:46 PM