[LU-8334] OSS lockup - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
2.7.1-fe

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

OSS deadlocked unable to ping ethernet or IB interfaces. Console showed no errors.
Attaching full trace of all threads. Most notable are kiblnd

ID: 8711   TASK: ffff882020b12ab0  CPU: 4   COMMAND: "kiblnd_sd_01_01"
 #0 [ffff880060c86e90] crash_nmi_callback at ffffffff81032256
 #1 [ffff880060c86ea0] notifier_call_chain at ffffffff81568515
 #2 [ffff880060c86ee0] atomic_notifier_call_chain at ffffffff8156857a
 #3 [ffff880060c86ef0] notify_die at ffffffff810a44fe
 #4 [ffff880060c86f20] do_nmi at ffffffff8156618f
 #5 [ffff880060c86f50] nmi at ffffffff815659f0
    [exception RIP: _spin_lock+33]
    RIP: ffffffff81565261  RSP: ffff882021b75b70  RFLAGS: 00000293
    RAX: 0000000000002b8e  RBX: ffff880ffe7dd240  RCX: 0000000000000000
    RDX: 0000000000002b8b  RSI: 0000000000000003  RDI: ffff88201ee3f140
    RBP: ffff882021b75b70   R8: 6950000000000000   R9: 4a80000000000000
    R10: 0000000000000001  R11: 0000000000000001  R12: 0000000000000018
    R13: ffff881013262e40  R14: ffff8820268ecac0  R15: 0000000000000004
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff882021b75b70] _spin_lock at ffffffff81565261
 #7 [ffff882021b75b78] cfs_percpt_lock at ffffffffa049edab [libcfs]
 #8 [ffff882021b75bb8] lnet_ptl_match_md at ffffffffa0529605 [lnet]
 #9 [ffff882021b75c38] lnet_parse_local at ffffffffa05306e7 [lnet]
#10 [ffff882021b75cd8] lnet_parse at ffffffffa05316da [lnet]
#11 [ffff882021b75d68] kiblnd_handle_rx at ffffffffa0a16f3b [ko2iblnd]
#12 [ffff882021b75db8] kiblnd_scheduler at ffffffffa0a182be [ko2iblnd]
#13 [ffff882021b75ee8] kthread at ffffffff8109dc8e
#14 [ffff882021b75f48] kernel_thread at ffffffff8100c28a

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oss.bt.all
764 kB
27/Jun/16 10:09 PM

Issue Links

is related to

LU-7980 Overrun in generic <size-128> kmem_cache Slabs causing OSS to crash

Resolved

Activity

[LU-8334] OSS lockup

Mahmoud Hanafi added a comment - 22/Sep/16 4:10 PM

Close this case we will track ~~LU-7980~~

Mahmoud Hanafi added a comment - 22/Sep/16 4:10 PM Close this case we will track LU-7980

Bruno Faccini (Inactive) added a comment - 01/Aug/16 3:42 PM

After working with Oleg's help on the crash-dump for this ticket, it seems like there is strong evidence to suggest (similar Slab corruption signature seen during ~~LU-7980~~ tracking) that this is a duplicate of ~~LU-7980~~ and it would make sense to use the fix from that and see if that is sufficient.

Bruno Faccini (Inactive) added a comment - 01/Aug/16 3:42 PM After working with Oleg's help on the crash-dump for this ticket, it seems like there is strong evidence to suggest (similar Slab corruption signature seen during LU-7980 tracking) that this is a duplicate of LU-7980 and it would make sense to use the fix from that and see if that is sufficient.

Oleg Drokin added a comment - 12/Jul/16 11:22 PM

Yes, the next pointer does point to a valid area:

crash> p *(struct slab *)0xffff8817213f0000
$4 = {
  list = {
    next = 0xffff880b7de4b800, 
    prev = 0xffff880b01a70800
  }, 
  colouroff = 1214182228249458060, 
  s_mem = 0x0, 
  inuse = 2690861296, 
  free = 4294967295, 
  nodeid = 0
}

I suspect it's the same type, but I am not sure how to check that easily.
I do not see a real loop in there with some light probing, but I see that there's an alternative valid path around this next node:

crash> p *(struct slab *)0xffff880b7de4b800
$5 = {
  list = {
    next = 0xffff880c60693800, 
    prev = 0xffff8817213f0000
  }, 
  colouroff = 1214182228245329292, 
  s_mem = 0x0, 
  inuse = 2690860912, 
  free = 4294967295, 
  nodeid = 0
}
crash> p *(struct slab *)0xffff880b01a70800
$6 = {
  list = {
    next = 0xffff8817213f0000, 
    prev = 0xffff88180be3e340
  }, 
  colouroff = 1214182229118793100, 
  s_mem = 0x0, 
  inuse = 2690860912, 
  free = 4294967295, 
  nodeid = 0
}

Oleg Drokin added a comment - 12/Jul/16 11:22 PM Yes, the next pointer does point to a valid area: crash> p *(struct slab *)0xffff8817213f0000 $4 = { list = { next = 0xffff880b7de4b800, prev = 0xffff880b01a70800 }, colouroff = 1214182228249458060, s_mem = 0x0, inuse = 2690861296, free = 4294967295, nodeid = 0 } I suspect it's the same type, but I am not sure how to check that easily. I do not see a real loop in there with some light probing, but I see that there's an alternative valid path around this next node: crash> p *(struct slab *)0xffff880b7de4b800 $5 = { list = { next = 0xffff880c60693800, prev = 0xffff8817213f0000 }, colouroff = 1214182228245329292, s_mem = 0x0, inuse = 2690860912, free = 4294967295, nodeid = 0 } crash> p *(struct slab *)0xffff880b01a70800 $6 = { list = { next = 0xffff8817213f0000, prev = 0xffff88180be3e340 }, colouroff = 1214182229118793100, s_mem = 0x0, inuse = 2690860912, free = 4294967295, nodeid = 0 }

Bruno Faccini (Inactive) added a comment - 12/Jul/16 10:18 PM

And what about the next = 0xffff8817213f0000 pointer, is it pointing to a valid area ? And if yes, of which type/family ?
Also, does it leads to a loop that may be the cause of the looping/pseudo-hung execution of s_show()?
Last, does the corruption of the slab at 0xffff8813e0ab36c0 seem to come from an overrun from previous locations ??

Bruno Faccini (Inactive) added a comment - 12/Jul/16 10:18 PM And what about the next = 0xffff8817213f0000 pointer, is it pointing to a valid area ? And if yes, of which type/family ? Also, does it leads to a loop that may be the cause of the looping/pseudo-hung execution of s_show()? Last, does the corruption of the slab at 0xffff8813e0ab36c0 seem to come from an overrun from previous locations ??

Oleg Drokin added a comment - 08/Jul/16 3:35 PM - edited

crash> p *(struct slab *)0xffff8813e0ab36c0
$2 = {
  list = {
    next = 0xffff8817213f0000, 
    prev = 0xffff881ae71d4540
  }, 
  colouroff = 0, 
  s_mem = 0xffff881ae71d4000, 
  inuse = 0, 
  free = 0, 
  nodeid = 1
}

kmem -S size-512 does not list content of this particular one because it is corrupted.
the s_mem location is also invalid, so I cannot peek inside.

Oleg Drokin added a comment - 08/Jul/16 3:35 PM - edited crash> p *(struct slab *)0xffff8813e0ab36c0 $2 = { list = { next = 0xffff8817213f0000, prev = 0xffff881ae71d4540 }, colouroff = 0, s_mem = 0xffff881ae71d4000, inuse = 0, free = 0, nodeid = 1 } kmem -S size-512 does not list content of this particular one because it is corrupted. the s_mem location is also invalid, so I cannot peek inside.

Bruno Faccini (Inactive) added a comment - 08/Jul/16 9:27 AM

Right, "crash/kmem -s" had already indicated this!
But then, what about the full content of this slab and even "kmem -S size-512" output ?

Bruno Faccini (Inactive) added a comment - 08/Jul/16 9:27 AM Right, "crash/kmem -s" had already indicated this! But then, what about the full content of this slab and even "kmem -S size-512" output ?

Oleg Drokin added a comment - 06/Jul/16 4:12 PM

it's "size-512"

Oleg Drokin added a comment - 06/Jul/16 4:12 PM it's "size-512"

Bruno Faccini (Inactive) added a comment - 06/Jul/16 1:19 PM

Well slab at ffff8813e0ab36c0 may be corrupted and cause s_show() to loop ... Also, for which kmem_cache has it been allocated ?

Bruno Faccini (Inactive) added a comment - 06/Jul/16 1:19 PM Well slab at ffff8813e0ab36c0 may be corrupted and cause s_show() to loop ... Also, for which kmem_cache has it been allocated ?

Oleg Drokin added a comment - 05/Jul/16 9:42 PM

Bruno: it looks like you are on to something.

when I do kmem -s in the crashdump, I get this error:

kmem: size-512: full list: slab: ffff8813e0ab36c0  bad prev pointer: ffff881ae71d4540
kmem: size-512: full list: slab: ffff8813e0ab36c0  bad inuse counter: 0
kmem: size-512: full list: slab: ffff8813e0ab36c0  bad s_mem pointer: ffff881ae71d4000

Indeed, ffff881ae71d4540 is invalid kernel address that I cannot access. Same for ffff881ae71d4000

Oleg Drokin added a comment - 05/Jul/16 9:42 PM Bruno: it looks like you are on to something. when I do kmem -s in the crashdump, I get this error: kmem: size-512: full list: slab: ffff8813e0ab36c0 bad prev pointer: ffff881ae71d4540 kmem: size-512: full list: slab: ffff8813e0ab36c0 bad inuse counter: 0 kmem: size-512: full list: slab: ffff8813e0ab36c0 bad s_mem pointer: ffff881ae71d4000 Indeed, ffff881ae71d4540 is invalid kernel address that I cannot access. Same for ffff881ae71d4000

Mahmoud Hanafi added a comment - 05/Jul/16 8:18 PM

Could I get an update on the progress of this issue.

Thanks,

Mahmoud Hanafi added a comment - 05/Jul/16 8:18 PM Could I get an update on the progress of this issue. Thanks,

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Jun/16 10:09 PM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 22/Sep/16 9:42 PM