[LU-3596] deadlock in kiblnd Created: 16/Jul/13  Updated: 01/Sep/17  Resolved: 01/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oz Rentas Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File lu-3596.out    
Severity: 3
Rank (Obsolete): 9116

 Description   

We had a few OSSes crash at NOAA recently due to an NMI deadlock detected error. I was able to get a vmcore from one and analyze it, and it looks like we are hitting LU-78:
crash> bt
PID: 15515 TASK: ffff810c39c237e0 CPU: 0 COMMAND: "kiblnd_connd"
#0 [ffffffff804b8dc0] crash_kexec at ffffffff800b1192
#1 [ffffffff804b8e80] die_nmi at ffffffff80065285
#2 [ffffffff804b8ea0] nmi_watchdog_tick at ffffffff80065a66
#3 [ffffffff804b8ef0] default_do_nmi at ffffffff80065609
#4 [ffffffff804b8f40] do_nmi at ffffffff800658f1
#5 [ffffffff804b8f50] nmi at ffffffff80064ecf
[exception RIP: __write_lock_failed+15]
RIP: ffffffff80062197 RSP: ffff81061978dc90 RFLAGS: 00000087
RAX: ffffc20000000000 RBX: ffff8107eb940140 RCX: 0000000000000001
RDX: 0000000000006000 RSI: 0000000000000003 RDI: ffffffff8032d42c
RBP: ffffc20000000000 R8: 0000000000000000 R9: ffff810c39c237e0
R10: ffff8105645d2000 R11: 0000000000002000 R12: ffffffffffffffff
R13: 0000000000007000 R14: 0000000000000001 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff81061978dc90] __write_lock_failed at ffffffff80062197
#7 [ffff81061978dc90] _write_lock at ffffffff80064a7d
#8 [ffff81061978dc98] __get_vm_area_node at ffffffff800d51c1
#9 [ffff81061978dcd8] __vmalloc_node at ffffffff800d5952
#10 [ffff81061978dcf8] kiblnd_create_tx_pool at ffffffff8b127e0e
#11 [ffff81061978dd68] kiblnd_pool_alloc_node at ffffffff8b1247a9
#12 [ffff81061978ddc8] kiblnd_get_idle_tx at ffffffff8b12d8d0
#13 [ffff81061978ddd8] kiblnd_check_sends at ffffffff8b12e9b7
#14 [ffff81061978dde8] kiblnd_check_txs at ffffffff8b12c22c
#15 [ffff81061978de48] kiblnd_check_conns at ffffffff8b12ea68
#16 [ffff81061978dea8] kiblnd_connd at ffffffff8b136063
#17 [ffff81061978df48] kernel_thread at ffffffff8005dfc1

It appears that LU-78 was identified for 1.8.x, but never landed on it. If this crash is related to that bug, would it be possible to get an updated patch for potential inclusion on the next 1.8?

Thanks.



 Comments   
Comment by Bruno Faccini (Inactive) [ 16/Jul/13 ]

Hello Kit,
I take ownership of this ticket, since it looks like a new occurence/dup of LU-78 where I have been somewhat involved with !!...
Since you have a crash-dump for this dead-lock, and just to double-check if it is still the same problem, can you check if there is a thread beeing runnable but not scheduled since a long time and get its stack-trace for me ?

Comment by Kit Westneat (Inactive) [ 16/Jul/13 ]

Hi Bruno,

All the last_rans from the runnables seem similar:
last_ran = 2233784280351367,
last_ran = 2233784307286260,
last_ran = 2233784366546510,
last_ran = 2233784494221153,
last_ran = 2233784499282262,
last_ran = 2233784500452337,
last_ran = 2233784508413095,
last_ran = 2233784520565286,
last_ran = 2233784548748409,
last_ran = 2233784548788509,
last_ran = 2233784588049211,
last_ran = 2233795686537779,
last_ran = 2233796503806272,
last_ran = 2233796699153227,
last_ran = 2233805294134249,
last_ran = 2233805294247424,
last_ran = 2233805294287834,

Is that the correct field in task to be looking at?

This is the bt for the oldest last_ran:
crash> bt 28352
PID: 28352 TASK: ffff810bd8f01860 CPU: 0 COMMAND: "ll_ost_137"
#0 [ffff810bb7ebbd50] schedule at ffffffff80062fa0
#1 [ffff810bb7ebbe28] ptlrpc_wait_event at ffffffff8b0afee5
#2 [ffff810bb7ebbeb8] ptlrpc_main at ffffffff8b0b03eb
#3 [ffff810bb7ebbf48] kernel_thread at ffffffff8005dfc1

I looked through the runnable kiblnd_sd_* tasks (there were only 2), and saw this, which looks like LU-78:
crash> bt 15504
PID: 15504 TASK: ffff810c390da0c0 CPU: 17 COMMAND: "kiblnd_sd_13"
#0 [ffff810c3fd3af20] crash_nmi_callback at ffffffff8007c396
#1 [ffff810c3fd3af40] do_nmi at ffffffff800658e5
#2 [ffff810c3fd3af50] nmi at ffffffff80064ecf
[exception RIP: __write_lock_failed+9]
RIP: ffffffff80062191 RSP: ffff810619ad7db0 RFLAGS: 00000087
RAX: 0000000000000286 RBX: ffff8105db4da0c0 RCX: 0000000000000100
RDX: 000000000000029a RSI: 00000000fffffffb RDI: ffffffff8b14d39c
RBP: ffff8105db4da0c0 R8: 0000000000000001 R9: 00000000ffffffff
R10: 0000000000000000 R11: 000000000000000a R12: 00000000fffffffb
R13: 0000000000000001 R14: ffffc20010d5fd10 R15: 000000000000000c
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#3 [ffff810619ad7db0] __write_lock_failed at ffffffff80062191
#4 [ffff810619ad7db0] _write_lock_irqsave at ffffffff80064b83
#5 [ffff810619ad7db8] kiblnd_close_conn at ffffffff8b12c844
#6 [ffff810619ad7dd8] kiblnd_tx_complete at ffffffff8b12f20d
#7 [ffff810619ad7e58] kiblnd_scheduler at ffffffff8b136769
#8 [ffff810619ad7f48] kernel_thread at ffffffff8005dfc1

Here is the only other runnable IB related thread:
PID: 9331 TASK: ffff81063f2707a0 CPU: 12 COMMAND: "ipoib"
#0 [ffff810c3fc43f20] crash_nmi_callback at ffffffff8007c396
#1 [ffff810c3fc43f40] do_nmi at ffffffff800658e5
#2 [ffff810c3fc43f50] nmi at ffffffff80064ecf
[exception RIP: __smp_call_function_many+150]
RIP: ffffffff80077925 RSP: ffff810c3dde7c30 RFLAGS: 00000297
RAX: 0000000000000016 RBX: ffff810c3dde7cb0 RCX: 0000000000000000
RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
RBP: 0000000000000017 R8: 0000000000000018 R9: 0000000000000028
R10: ffff810c3dde7bd0 R11: ffff8106293e7600 R12: 0000000000000001
R13: 0000000000000000 R14: ffffffff800777f9 R15: ffff810c3dde7cb0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#3 [ffff810c3dde7c30] __smp_call_function_many at ffffffff80077925
#4 [ffff810c3dde7c78] smp_call_function_many at ffffffff80077a27
#5 [ffff810c3dde7ca8] smp_call_function at ffffffff80077b18
#6 [ffff810c3dde7cf8] on_each_cpu at ffffffff80096bcd
#7 [ffff810c3dde7d18] __remove_vm_area at ffffffff800d55f8
#8 [ffff810c3dde7d38] remove_vm_area at ffffffff800d5627
#9 [ffff810c3dde7d48] __vunmap at ffffffff800d5673
#10 [ffff810c3dde7d68] mlx4_buf_free at ffffffff884f8352
#11 [ffff810c3dde7d88] mlx4_ib_destroy_qp at ffffffff88542219
#12 [ffff810c3dde7dc8] ib_destroy_qp at ffffffff884a1bb0
#13 [ffff810c3dde7df8] ipoib_cm_tx_reap at ffffffff88741999
#14 [ffff810c3dde7e38] run_workqueue at ffffffff8004d8ac
#15 [ffff810c3dde7e78] worker_thread at ffffffff8004a1e5
#16 [ffff810c3dde7ee8] kthread at ffffffff80032c45
#17 [ffff810c3dde7f48] kernel_thread at ffffffff8005dfc1

Thanks.

Comment by Bruno Faccini (Inactive) [ 17/Jul/13 ]

Hello Cory,
I see only 17 runnable (running?) threads, how many core is your OSS ? I presume 16, and all are in NMI context with the 17th being pid 28352?
Can you check if you get the same order/threads with "ps -l" output in crash ?
Also can you provide the "dmesg/log""ps -l""foreach bt" outputs from crash too ? Or may be the crash-dump itself if site policy is ok ?

Comment by Kit Westneat (Inactive) [ 17/Jul/13 ]

Hi Bruno,

There are actually 24 CPUs. I may have misinterpreted running vs runnable. I looked for all the tasks with RU in ps, but is that running? I've attached the requested output.

The vmcore is 2GB, is there a server I could put it on?

Thanks,
Kit

Comment by Bruno Faccini (Inactive) [ 17/Jul/13 ]

RU in ps stands for RUnnable and not only running. That was my wish to get all runnable vs only the running ones.
Humm, but if there are 24 cores you should see at least 24 RUs including some swapper/idle threads. Thus, I definitely need the 3 crash sub-commands outputs I earlier requested, you said you attached them but I don't see them ?
Concerning the crash-dump upload I have sent you the instructions privately.

Comment by Bruno Faccini (Inactive) [ 18/Jul/13 ]

Having a look to crash-dump extracts you posted in lu-3596.out file, I am less and less convinced it is a dup of LU-78. This is because in LU-78 dead-lock occurred due to lock owner re-schedule and lock competitor elected on CPU, but here lock owner is stuck waiting for a TLB-flush IPI broadcast completion ...

Also, is the OSS running wit /proc/sys/kernel/nmi_watchdog set ?

Comment by Bruno Faccini (Inactive) [ 23/Jul/13 ]

Ok, from the crash-dump, scenario is different than for LU-78, there are 3 threads involved and all 3 spinning on their CPUs with the following stack-traces at the time of the NMI/watchdog :

PID: 9331   TASK: ffff81063f2707a0  CPU: 12  COMMAND: "ipoib"
 #0 [ffff810c3fc43f20] crash_nmi_callback at ffffffff8007c396
 #1 [ffff810c3fc43f40] do_nmi at ffffffff800658e5
 #2 [ffff810c3fc43f50] nmi at ffffffff80064ecf
    [exception RIP: __smp_call_function_many+150]
    RIP: ffffffff80077925  RSP: ffff810c3dde7c30  RFLAGS: 00000297
    RAX: 0000000000000016  RBX: ffff810c3dde7cb0  RCX: 0000000000000000
    RDX: 00000000000000ff  RSI: 00000000000000ff  RDI: 00000000000000c0
    RBP: 0000000000000017   R8: 0000000000000018   R9: 0000000000000028
    R10: ffff810c3dde7bd0  R11: ffff8106293e7600  R12: 0000000000000001
    R13: 0000000000000000  R14: ffffffff800777f9  R15: ffff810c3dde7cb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff810c3dde7c30] __smp_call_function_many at ffffffff80077925
 #4 [ffff810c3dde7c78] smp_call_function_many at ffffffff80077a27
 #5 [ffff810c3dde7ca8] smp_call_function at ffffffff80077b18
 #6 [ffff810c3dde7cf8] on_each_cpu at ffffffff80096bcd
 #7 [ffff810c3dde7d18] __remove_vm_area at ffffffff800d55f8
 #8 [ffff810c3dde7d38] remove_vm_area at ffffffff800d5627
 #9 [ffff810c3dde7d48] __vunmap at ffffffff800d5673
#10 [ffff810c3dde7d68] mlx4_buf_free at ffffffff884f8352 [mlx4_core]
#11 [ffff810c3dde7d88] mlx4_ib_destroy_qp at ffffffff88542219 [mlx4_ib]
#12 [ffff810c3dde7dc8] ib_destroy_qp at ffffffff884a1bb0 [ib_core]
#13 [ffff810c3dde7df8] ipoib_cm_tx_reap at ffffffff88741999 [ib_ipoib]
#14 [ffff810c3dde7e38] run_workqueue at ffffffff8004d8ac
#15 [ffff810c3dde7e78] worker_thread at ffffffff8004a1e5
#16 [ffff810c3dde7ee8] kthread at ffffffff80032c45
#17 [ffff810c3dde7f48] kernel_thread at ffffffff8005dfc1

PID: 15504  TASK: ffff810c390da0c0  CPU: 17  COMMAND: "kiblnd_sd_13"
 #0 [ffff810c3fd3af20] crash_nmi_callback at ffffffff8007c396
 #1 [ffff810c3fd3af40] do_nmi at ffffffff800658e5
 #2 [ffff810c3fd3af50] nmi at ffffffff80064ecf
    [exception RIP: __write_lock_failed+9]
    RIP: ffffffff80062191  RSP: ffff810619ad7db0  RFLAGS: 00000087
    RAX: 0000000000000286  RBX: ffff8105db4da0c0  RCX: 0000000000000100
    RDX: 000000000000029a  RSI: 00000000fffffffb  RDI: ffffffff8b14d39c
    RBP: ffff8105db4da0c0   R8: 0000000000000001   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 000000000000000a  R12: 00000000fffffffb
    R13: 0000000000000001  R14: ffffc20010d5fd10  R15: 000000000000000c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff810619ad7db0] __write_lock_failed at ffffffff80062191
 #4 [ffff810619ad7db0] _write_lock_irqsave at ffffffff80064b83
 #5 [ffff810619ad7db8] kiblnd_close_conn at ffffffff8b12c844 [ko2iblnd]
 #6 [ffff810619ad7dd8] kiblnd_tx_complete at ffffffff8b12f20d [ko2iblnd]
 #7 [ffff810619ad7e58] kiblnd_scheduler at ffffffff8b136769 [ko2iblnd]
 #8 [ffff810619ad7f48] kernel_thread at ffffffff8005dfc1

PID: 15515  TASK: ffff810c39c237e0  CPU: 0   COMMAND: "kiblnd_connd"
 #0 [ffffffff804b8dc0] crash_kexec at ffffffff800b1192
 #1 [ffffffff804b8e80] die_nmi at ffffffff80065285
 #2 [ffffffff804b8ea0] nmi_watchdog_tick at ffffffff80065a66
 #3 [ffffffff804b8ef0] default_do_nmi at ffffffff80065609
 #4 [ffffffff804b8f40] do_nmi at ffffffff800658f1
 #5 [ffffffff804b8f50] nmi at ffffffff80064ecf
    [exception RIP: __write_lock_failed+15]
    RIP: ffffffff80062197  RSP: ffff81061978dc90  RFLAGS: 00000087
    RAX: ffffc20000000000  RBX: ffff8107eb940140  RCX: 0000000000000001
    RDX: 0000000000006000  RSI: 0000000000000003  RDI: ffffffff8032d42c
    RBP: ffffc20000000000   R8: 0000000000000000   R9: ffff810c39c237e0
    R10: ffff8105645d2000  R11: 0000000000002000  R12: ffffffffffffffff
    R13: 0000000000007000  R14: 0000000000000001  R15: 0000000000000002
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff81061978dc90] __write_lock_failed at ffffffff80062197
 #7 [ffff81061978dc90] _write_lock at ffffffff80064a7d
 #8 [ffff81061978dc98] __get_vm_area_node at ffffffff800d51c1
 #9 [ffff81061978dcd8] __vmalloc_node at ffffffff800d5952
#10 [ffff81061978dcf8] kiblnd_create_tx_pool at ffffffff8b127e0e [ko2iblnd]
#11 [ffff81061978dd68] kiblnd_pool_alloc_node at ffffffff8b1247a9 [ko2iblnd]
#12 [ffff81061978ddc8] kiblnd_get_idle_tx at ffffffff8b12d8d0 [ko2iblnd]
#13 [ffff81061978ddd8] kiblnd_check_sends at ffffffff8b12e9b7 [ko2iblnd]
#14 [ffff81061978dde8] kiblnd_check_txs at ffffffff8b12c22c [ko2iblnd]
#15 [ffff81061978de48] kiblnd_check_conns at ffffffff8b12ea68 [ko2iblnd]
#16 [ffff81061978dea8] kiblnd_connd at ffffffff8b136063 [ko2iblnd]
#17 [ffff81061978df48] kernel_thread at ffffffff8005dfc1

where :

_ 9331/"ipoib" thread owns vmlist_lock, during mlx4 driver buffer unmap, but then wants all others CPUs to flush their TLB and thus sent an IPI to each of them and it is now busy waiting for one reluctant peer ...

_ 15504/"kiblnd_sd_13" thread is spinning/waiting on kiblnd_data.kib_global_lock in kiblnd_close_conn() but with interrupts disabled, and thus is likely to be the one still missing the IPI and to flush its TLB !!

_ 15515/"kiblnd_connd" thread owns kiblnd_data.kib_global_lock since kiblnd_check_conns() but is now spinning/waiting for vmlist_lock due to kmem alloc in kiblnd_create_tx_pool().

So 9331 blocks 15515 who blocks 15504 who blocks 9331 !!

But even if it is a new scenario than the 2 already described in LU-78, I can now agree that it would also be fixed by the same patch since with it thread 15515 will not spin on vmlist_lock when already owning kiblnd_data.kib_global_lock !!

Comment by Kit Westneat (Inactive) [ 23/Jul/13 ]

Thanks for the analysis! I have ported the patch from LU-78 to b1_8:
http://review.whamcloud.com/7092

Does that look correct?

Comment by Bruno Faccini (Inactive) [ 24/Jul/13 ]

Yes, patch/back-port looks ok. Need to wait Liang's feed-back to confirm. Also I had to re-trigger auto-tests due to un-related failure.

Comment by Oz Rentas [ 01/Sep/17 ]

This is resolved. Please close.

Comment by Peter Jones [ 01/Sep/17 ]

Thanks Oz

Generated at Sat Feb 10 01:35:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.