Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7099

Crash in kiblnd_pool_alloc_node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3
    • MDS installed with Bull 2.5.3 version
    • 3
    • 9223372036854775807

    Description

      The MDS crashed in kiblnd_pool_alloc_node, with the message "unable to handle kernel NULL pointer dereference at 0000000000000010".

      This looks exactly the same as LU-5678, but since patch http://review.whamcloud.com/12852 was already applied, I am opening this new ticket.

      crash> sys         
        SYSTEM MAP: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/boot/System.map-2.6.32-504.8.1.el6.Bull.70.x86_64
      DEBUG KERNEL: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/modules/vmlinux (2.6.32-504.8.1.el6.Bull.70.x86_64)
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 48 [OFFLINE: 24]
              DATE: Wed Apr  1 16:58:18 2015
            UPTIME: 00:54:42
      LOAD AVERAGE: 0.64, 5.04, 7.86
             TASKS: 682
          NODENAME: taurusmds6
           RELEASE: 2.6.32-504.8.1.el6.Bull.70.x86_64
           VERSION: #1 SMP Tue Feb 10 14:51:21 CET 2015
           MACHINE: x86_64  (2399 Mhz)
            MEMORY: 128 GB
             PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010"
      crash> bt
      PID: 9622   TASK: ffff881066c50080  CPU: 1   COMMAND: "kiblnd_sd_00_02"
       #0 [ffff880f23ee3630] machine_kexec at ffffffff8103b71b
       #1 [ffff880f23ee3690] crash_kexec at ffffffff810c9852  
       #2 [ffff880f23ee3760] oops_end at ffffffff8152ec30
       #3 [ffff880f23ee3790] no_context at ffffffff8104c80b   
       #4 [ffff880f23ee37e0] __bad_area_nosemaphore at ffffffff8104ca95
       #5 [ffff880f23ee3830] bad_area_nosemaphore at ffffffff8104cb63
       #6 [ffff880f23ee3840] __do_page_fault at ffffffff8104d2bf
       #7 [ffff880f23ee3960] do_page_fault at ffffffff81530b7e
       #8 [ffff880f23ee3990] page_fault at ffffffff8152df35   
          [exception RIP: kiblnd_pool_alloc_node+73]
          RIP: ffffffffa0b77439  RSP: ffff880f23ee3a40  RFLAGS: 00010207
          RAX: 0000000000000000  RBX: ffff880fec59ce40  RCX: 000000000000003f
          RDX: 0000000000000010  RSI: 0000000000000002  RDI: ffff880fec59ce40
          RBP: ffff880f23ee3a80   R8: 72f8000000000000   R9: 97c0000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff880fec59ce70
          R13: ffff880f23ee3a48  R14: ffff880fec59ce50  R15: 0000000000000012
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880f23ee3a88] kiblnd_get_idle_tx at ffffffffa0b81fa9 [ko2iblnd]
      #10 [ffff880f23ee3aa8] kiblnd_check_sends at ffffffffa0b857b5 [ko2iblnd]
      #11 [ffff880f23ee3b08] kiblnd_post_rx at ffffffffa0b87dd8 [ko2iblnd]
      #12 [ffff880f23ee3b58] kiblnd_recv at ffffffffa0b882c6 [ko2iblnd]
      #13 [ffff880f23ee3be8] lnet_ni_recv at ffffffffa05f9ecb [lnet]
      #14 [ffff880f23ee3c38] lnet_drop_message at ffffffffa05facf1 [lnet]
      #15 [ffff880f23ee3c78] lnet_parse at ffffffffa05ff672 [lnet]
      #16 [ffff880f23ee3d58] kiblnd_handle_rx at ffffffffa0b889db [ko2iblnd]
      #17 [ffff880f23ee3da8] kiblnd_rx_complete at ffffffffa0b896c3 [ko2iblnd]
      #18 [ffff880f23ee3df8] kiblnd_complete at ffffffffa0b89872 [ko2iblnd]
      #19 [ffff880f23ee3e08] kiblnd_scheduler at ffffffffa0b89c2a [ko2iblnd]
      #20 [ffff880f23ee3ee8] kthread at ffffffff8109e66e
      #21 [ffff880f23ee3f48] kernel_thread at ffffffff8100c20a
      
      crash> struct kib_poolset_t ffff880fec59ce40
      struct kib_poolset_t {
        ps_lock = {
          raw_lock = {
            slock = 131072
          }
        },
        ps_net = 0x0,
        ps_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
      00\000\000\000\000",
        ps_pool_list = {
          next = 0x0,
          prev = 0x0
        },
        ps_failed_pool_list = {
          next = 0x0,
          prev = 0x0
        },
        ps_next_retry = 0,
        ps_increasing = 0,
        ps_pool_size = 0,
        ps_cpt = 0,
        ps_pool_create = 0x0,
        ps_pool_destroy = 0x0,
        ps_node_init = 0x0,
        ps_node_fini = 0x0
      }
      

      I will upload the dump shortly for analysis.

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              spiechurski Sebastien Piechurski
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: