[LU-5720] lustre client hangs on possible imp_lock deadlock Created: 08/Oct/14 Updated: 21/Nov/16 Resolved: 21/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Sequoia, 2.6.32-431.23.3.1chaos, github.com/chaos/lustre |
||
| Severity: | 3 |
| Rank (Obsolete): | 16049 |
| Description |
|
The node becomes unresponsive to users and the lustre client appears to be hung after being evicted by the MDT. The node remains responsive to SysRq. After crashing the node, it boots and mounts lustre successfully. The symptoms develop as follows: First the node starts reporting connection lost/connection restored We then start seeing "spinning too long" messages from |
| Comments |
| Comment by Olaf Faaland [ 08/Oct/14 ] |
When the console log reports the node is evicted by mds, the errors
that follow are in:
mdc_enqueue, vvp_io_init, ll_close_inode_openhandle,
lmv_fid_alloc, mdc_intent_getattr_async_interpret,
ll_inode_revalidate_fini
The stack dump following the "spinning too long" message is:
ptlrpc_check_set+0x638
ptlrpc_check+0x66c
ptlrpcd+0x37c
The soft lockups and their stack dumps vary somewhat between instances,
so I've listed them below.
When the operators execute SysRq "t" (list current tasks and their
information) we see ll_ping, ptlrpcd_rcv, and one or more ptlrpcd_#
tasks are active. This matches what we see in crash when analyzing
the core.
Unfortunately the machine is classified and so I cannot post the log
data. Also, neither the SysRq output nor the crash dump shows the stack
traces for the active lustre tasks.
Soft lockup details:
9/16 instance
soft lockups reported on ptlrpcd*, ll_ping, ll_flush (all these
tasks seem to complete before the machine is crashed)
ptlrpcd_# with stack traces like:
NIP: spin_lock
ptlrpc_check_set+0x103c
ptlrpcd_check+0x66c
ptlrpcd+0x37c
or like:
NIP: spin_lock
ptlrpc_set_import_discon+0x58
ptlrpc_fail_import+0x8c
ptlrpc_expire_one_request+0x464
ptlrpc_expired_set+0x104
ptlrpcd+0x35c
ll_flush stack traces like this:
NIP: spin_lock
sptlrpc_import_sec_ref+0x1c
import_sec_validate_get+0x50
sptlrpc_req_get_ctx+0x9c
__ptlrpc_request_bufs_pack+0xa4
ptlrpc_request_pack+0x34
osc_brw_prep_request+0x21
osc_build_rpc+0xc8c
...
ll_ping stack traces like this:
NIP: spin_lock
mutex_lock+0x34
ptlrpc_pinger_main+0x190
[this is pinger.c:259
spin_lock(&imp->imp_lock)]
9/27 instance
soft lockups on ptlrpcd*, ll_ping (all these tasks seem to
complete before the machine is crashed)
ll_ping stack traces like this:
NIP: spin_lock
ptlrpc_pinger_main+0x190
[this is pinger.c:259
spin_lock(&imp->imp_lock)]
ptlrpcd_# with stack traces like:
NIP: spin_lock
ptlrpc_set_import_discon+0x58
ptlrpc_fail_import+0x8c
ptlrpc_expire_one_request+0x464
ptlrpc_expired_set+0x104
ptlrpcd+0x35c
or like:
NIP: spin_lock
sptlrpc_import_sec_ref+0x1c
import_sec_validate_get+0x50
sptlrpc_req_refresh_ctx+0x10c
ptlrpc_check_set+0x1d34
ptlrpcd_check+0x66c
ptlrpcd+0x37c
or like:
NIP: spin_lock
ptlrpc_check_set+0x103c
ptlrpcd_check+0x66c
ptlrpcd+0x37c
|
| Comment by Olaf Faaland [ 08/Oct/14 ] |
|
Looked for similar tickets. |
| Comment by Peter Jones [ 09/Oct/14 ] |
|
Niu Could you please assist with this issue? Thanks Peter |
| Comment by Oleg Drokin [ 09/Oct/14 ] |
|
I think this is really What's the exact tag in your github tree this system is running at, btw? |
| Comment by Oleg Drokin [ 09/Oct/14 ] |
|
Another question, you say that "all these tasks seem to complete before the machine is crashed" - do you mean that you do not have a single task with example traces in the crashdump and all traces are from the log, or do you mean that it's just ll_flush that completes? If it's the former, how long did you wait in this state before crashing the box (on the off chance the condition clears itself after a while)? |
| Comment by Oleg Drokin [ 09/Oct/14 ] |
|
Also I guess since you have the crashdump, please check for me if you see threads in a state similar to |
| Comment by Olaf Faaland [ 09/Oct/14 ] |
|
Oleg: The stack traces are almost entirely from the soft lockup messages in the logs. I don't have a core for the 9/27 instance. However, there is no stack for: The rest of the cores are idle (swapper). |
| Comment by Olaf Faaland [ 09/Oct/14 ] |
|
No, I see no threads similar to |
| Comment by Oleg Drokin [ 09/Oct/14 ] |
|
Thanks, Olaf. Also when you say there's no stack for those processes - do you mean we don't know where they sit and they are on cpu (though usually crash shows stack for on-cpu threads too) or something else? Another side indicator you could use is to check that once there's soft lockup message for some process (esp. one that is not bound to any cpu), it always repeat on the same cpu (if it does not, that's a pretty big hint that the lock was obtained, then released and then the process was rescheduled somewhere else later on and again was waiting for a long time). |
| Comment by Oleg Drokin [ 09/Oct/14 ] |
|
Also, you do you have CONFIG_DEBUG_SPINLOCK enabled by any chance? If you do, then we can see what cpu holds the lock from the memory dump and perhaps see who's sitting on thta cpu and what is it doing. |
| Comment by Olaf Faaland [ 10/Oct/14 ] |
|
Oleg, I cannot find a way to prove whether any of those tasks were able to take hold of the lock or not. All the details below are for the 9/16 incident, which is the one I have a crash dump for. I looked more closely at the logs and find that the soft lockup message appears only once for each task and are on distinct CPUs: ptlrpcd_rcv cpu0 Furthermore, those are the same CPUs that those tasks are on in the dump. The "spinning too long" message from within ptlrpcd_rcv gives all the same information, including the same deadline, for the entire 2+ hours, and produces only one "soft lockup" message. Based on this, I believe these tasks may have been spinning on the lock until the crash. Crash produces a stack for only two tasks that were on CPUs: the tasks that triggers the page fault, which was in swapper, and the flush-lustre-1 task. For the other tasks on-cpu, "bt <pid>" produces only the header giving the PID, address of the task struct, cpu, and command. Unfortunately, CONFIG_DEBUG_SPINLOCK was not enabled so we got no report of the owner of the lock. I'll find out whether that's something we can change. |
| Comment by Oleg Drokin [ 10/Oct/14 ] |
|
I found that sometimes using other options of bt helps to produce backtrace when normal bt does not work. It is still kind of strange that the flush-lustre thread backtrace does not match the backtrace from watchdog. |
| Comment by Olaf Faaland [ 10/Oct/14 ] |
|
Thanks, Oleg. I tried all the options except -o and -O (as this machine is PPC). -T produces a listing (for example, on ll_ping task) but I don't believe it's a valid backtrace. No lustre functions appear at all. You're right about the flush-lustre thread. I just verified the backtraces are different. |
| Comment by Oleg Drokin [ 10/Oct/14 ] |
|
Thanks. While examining your code I found you carry a patch Granted, in that ticket it was later determined that the lock was taken by a cpu that was idle, meaning somethign just forgot to release the lock. But I still wonder if you can check current situation in your crashdump too just in case. |
| Comment by Christopher Morrone [ 10/Oct/14 ] |
|
Ah, that is right, I remember that. On one hit of this bug, from seqlac6, 2014-09-16 19:09:24, imp_lock = 2147483658. In hex, that is 0x8000000a. So in theory CPU 10 locked the spin lock. Fortunately, unlike I also found that I could get some messy stack information for ptlrpcd_2. Olaf is new, and we told him about running "mod -S" when you use crash. Unfortunately that advise probably made things worse in this instance on PPC. I seem to get better somewhat more useful symbol information without using mod -S for the active processes. The symbols all change drastically, and not to anything sensible, when I use mod -S. Crash still can't give me a clean backtrace, but if I use "bt -T" I can at least get symbols. I haven't attempted to disambiguate the symbols into a sane stack, but here it is: sprintf kernel_thread symbol_string ptlrpc_set_import_discon pointer hvc_console_print pointer hvc_console_print up up try_acquire_console_sem release_console_sem try_acquire_console_sem vprintk task_tick_fair scheduler_tick run_local_times update_process_times account_system_time account_system_time do_softirq xics_get_irq_lpar irq_exit do_IRQ timer_interrupt __wake_up hardware_interrupt_entry cfs_waitq_signal ptlrpc_init cfs_free cfs_free _spin_lock _spin_lock ptlrpc_set_import_discon lnet_md_unlink lnet_eq_eneueue_event LNetMDUnlink _debug_req ptlrpc_set_import_discon ptlrpc_unregister_reply ptlrpc_fail_import ptlrpc_expire_one_request ptlrpc_expired_set cfs_waitq_timedwait ptlrpcd The bottom of the stack contains ptlrpcd(), and that matches this being the ptlrpcd_2 thread. The lower stack also looks like things the ptlrpcd would do. |
| Comment by Oleg Drokin [ 10/Oct/14 ] |
|
I audited the call paths leading to the ptlrpc_set_import_discon from ptlrpc_expired_set() and I don't see any other users of imp_lock on the way which seems to mean that the lock was locked by some other thread. I guess there's another alternative for the lock not being released, though. Another somewhat plausible scenario might be if something that holds this spinlock would go to sleep. I noticed that you tried to run with spinlock debug in kernel: [28303.243508] BUG: spinlock wrong CPU on CPU#0, ptlrpcd_0/12315 (Not tainted) You did not report anything like that back then (other than settign this made the problem disappear outright for some reason), though. Either way if we discount memory corruption (pretty low probability I think) and compiler issues (probably pretty low probability too), the only two remaining options are: something sleeps holdign this spinlock, or something forgets to release it (and I looked through all callers several times, and everything matches up properly). |
| Comment by Olaf Faaland [ 23/Oct/14 ] |
|
Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is: schedule I don't see that it would be holding an import lock, but I may have missed something. |
| Comment by Oleg Drokin [ 25/Oct/14 ] |
|
Just to confirm, is this the only task that is reported as last run on cpu 10? |
| Comment by Olaf Faaland [ 28/Oct/14 ] |
|
No, 110 tasks were reported as last running on cpu 10. There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*): 1 like this (comm=ldlm_cb02_007): the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks. |
| Comment by Oleg Drokin [ 29/Oct/14 ] |
|
Hm, I see. This is not really promising then. Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped? |
| Comment by Olaf Faaland [ 29/Oct/14 ] |
|
This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday. At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change). |
| Comment by Oleg Drokin [ 29/Oct/14 ] |
|
Ok. |
| Comment by Niu Yawei (Inactive) [ 18/Nov/16 ] |
|
Any further report on this issue? Can we close it? |
| Comment by Olaf Faaland [ 21/Nov/16 ] |
|
I will check and get back to you. |
| Comment by Olaf Faaland [ 21/Nov/16 ] |
|
After rebuilding the kernel with CONFIG_DEBUG_SPINLOCK enabled, we stopped seeing the problem. You can close it. We won't have any additional information to help track it down. |
| Comment by Peter Jones [ 21/Nov/16 ] |
|
ok. Thanks Olaf. |