[LU-7099] Crash in kiblnd_pool_alloc_node Created: 03/Sep/15  Updated: 14/Jun/18  Resolved: 12/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Sebastien Piechurski Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: p4b
Environment:

MDS installed with Bull 2.5.3 version


Issue Links:
Related
is related to LU-5678 kernel crash due to NULL pointer dere... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The MDS crashed in kiblnd_pool_alloc_node, with the message "unable to handle kernel NULL pointer dereference at 0000000000000010".

This looks exactly the same as LU-5678, but since patch http://review.whamcloud.com/12852 was already applied, I am opening this new ticket.

crash> sys         
  SYSTEM MAP: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/boot/System.map-2.6.32-504.8.1.el6.Bull.70.x86_64
DEBUG KERNEL: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/modules/vmlinux (2.6.32-504.8.1.el6.Bull.70.x86_64)
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48 [OFFLINE: 24]
        DATE: Wed Apr  1 16:58:18 2015
      UPTIME: 00:54:42
LOAD AVERAGE: 0.64, 5.04, 7.86
       TASKS: 682
    NODENAME: taurusmds6
     RELEASE: 2.6.32-504.8.1.el6.Bull.70.x86_64
     VERSION: #1 SMP Tue Feb 10 14:51:21 CET 2015
     MACHINE: x86_64  (2399 Mhz)
      MEMORY: 128 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010"
crash> bt
PID: 9622   TASK: ffff881066c50080  CPU: 1   COMMAND: "kiblnd_sd_00_02"
 #0 [ffff880f23ee3630] machine_kexec at ffffffff8103b71b
 #1 [ffff880f23ee3690] crash_kexec at ffffffff810c9852  
 #2 [ffff880f23ee3760] oops_end at ffffffff8152ec30
 #3 [ffff880f23ee3790] no_context at ffffffff8104c80b   
 #4 [ffff880f23ee37e0] __bad_area_nosemaphore at ffffffff8104ca95
 #5 [ffff880f23ee3830] bad_area_nosemaphore at ffffffff8104cb63
 #6 [ffff880f23ee3840] __do_page_fault at ffffffff8104d2bf
 #7 [ffff880f23ee3960] do_page_fault at ffffffff81530b7e
 #8 [ffff880f23ee3990] page_fault at ffffffff8152df35   
    [exception RIP: kiblnd_pool_alloc_node+73]
    RIP: ffffffffa0b77439  RSP: ffff880f23ee3a40  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff880fec59ce40  RCX: 000000000000003f
    RDX: 0000000000000010  RSI: 0000000000000002  RDI: ffff880fec59ce40
    RBP: ffff880f23ee3a80   R8: 72f8000000000000   R9: 97c0000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff880fec59ce70
    R13: ffff880f23ee3a48  R14: ffff880fec59ce50  R15: 0000000000000012
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880f23ee3a88] kiblnd_get_idle_tx at ffffffffa0b81fa9 [ko2iblnd]
#10 [ffff880f23ee3aa8] kiblnd_check_sends at ffffffffa0b857b5 [ko2iblnd]
#11 [ffff880f23ee3b08] kiblnd_post_rx at ffffffffa0b87dd8 [ko2iblnd]
#12 [ffff880f23ee3b58] kiblnd_recv at ffffffffa0b882c6 [ko2iblnd]
#13 [ffff880f23ee3be8] lnet_ni_recv at ffffffffa05f9ecb [lnet]
#14 [ffff880f23ee3c38] lnet_drop_message at ffffffffa05facf1 [lnet]
#15 [ffff880f23ee3c78] lnet_parse at ffffffffa05ff672 [lnet]
#16 [ffff880f23ee3d58] kiblnd_handle_rx at ffffffffa0b889db [ko2iblnd]
#17 [ffff880f23ee3da8] kiblnd_rx_complete at ffffffffa0b896c3 [ko2iblnd]
#18 [ffff880f23ee3df8] kiblnd_complete at ffffffffa0b89872 [ko2iblnd]
#19 [ffff880f23ee3e08] kiblnd_scheduler at ffffffffa0b89c2a [ko2iblnd]
#20 [ffff880f23ee3ee8] kthread at ffffffff8109e66e
#21 [ffff880f23ee3f48] kernel_thread at ffffffff8100c20a

crash> struct kib_poolset_t ffff880fec59ce40
struct kib_poolset_t {
  ps_lock = {
    raw_lock = {
      slock = 131072
    }
  },
  ps_net = 0x0,
  ps_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
00\000\000\000\000",
  ps_pool_list = {
    next = 0x0,
    prev = 0x0
  },
  ps_failed_pool_list = {
    next = 0x0,
    prev = 0x0
  },
  ps_next_retry = 0,
  ps_increasing = 0,
  ps_pool_size = 0,
  ps_cpt = 0,
  ps_pool_create = 0x0,
  ps_pool_destroy = 0x0,
  ps_node_init = 0x0,
  ps_node_fini = 0x0
}

I will upload the dump shortly for analysis.



 Comments   
Comment by Sebastien Piechurski [ 03/Sep/15 ]

The dump with all required objects is currently uploading on the ftp site under uploads/LU-7099.

Comment by Joseph Gmitter (Inactive) [ 03/Sep/15 ]

Hi Amir,
Can you have a look at this issue?
Thanks.
Joe

Comment by Chris Horn [ 24/Nov/15 ]

Can you share any details on the problem here and the proposed solution? Is there a corresponding fix for master?

Comment by Gerrit Updater [ 18/May/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/20322
Subject: LU-7099 lnet: lock improvement for ko2iblnd
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a5b7d7364b0beb16e2a28619595020641753cf4

Comment by Gerrit Updater [ 11/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20322/
Subject: LU-7099 lnet: lock improvement for ko2iblnd
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ddcade952026343bb0d2a56745558dca1cbdafa3

Comment by Peter Jones [ 12/Jul/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:05:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.