Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5962

OSS crash on lnet_ptl_match_md() due to a null pointer because ptl->ptl_rotor is negatif

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.5.3
    • None
    • kernel 2.6.32-431.23.3 + bull fix
      lustre 2.5.3 + bull fix
    • 3
    • 16654

    Description

      During a restart (umount/mount) of OSTs, we meet an OSS crash
      due to a null pointer :

      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffffa06b37c0>] lnet_ptl_match_md+0x250/0x870 [lnet]
      PGD 0 Oops: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.1/host7/rport-7:0-0/target7:0:0/7:0:0:3/state
      CPU 12 
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lqu
      ota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic crc32c_intel nfs lockd 
      fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr
      (U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) dm_round_robin scsi_dh_rda
      c dm_multipath uinput sg lpc_ich mfd_core ioatdma compat(U) igb dca i2c_algo_bit i2c_core ptp pps_core lpfc scsi_transport_fc scsi_tgt
       ext4 jbd2 mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: libcfs]
      
      Pid: 25204, comm: kiblnd_sd_00_01 Tainted: G        W  ---------------    2.6.32-431.23.3.el6.Bull.56.x86_64 #1 BULL bullx super-node
      RIP: 0010:[<ffffffffa06b37c0>]  [<ffffffffa06b37c0>] lnet_ptl_match_md+0x250/0x870 [lnet]
      RSP: 0018:ffff880c70589bf0  EFLAGS: 00010287
      RAX: ffffffffd4888cbe RBX: ffff880c70589cf0 RCX: 00000000d4888cbd
      RDX: fffffffffffffffe RSI: ffff880c5787b7d0 RDI: 0000000000000003
      RBP: ffff880c70589c70 R08: 8980000000000000 R09: 4c00000000000000
      R10: 000000000000002c R11: 0000000000000012 R12: ffff880434b24000
      R13: ffff880c40941f40 R14: ffff880c40941f40 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 0000000001a85000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kiblnd_sd_00_01 (pid: 25204, threadinfo ffff880c70588000, task ffff880c77ac9500)
      Stack:
       ffff8803ab5eb278 ffff880c70589cb8 ffff8803ab5eb140 ffff8804d4888cbd
      <d> ffff880c70589c70 ffffffffa06c5c36 ffff880c70589c70 0000000000000246
      <d> ffff880c70589c70 ffff8803c94d1580 0000000000000000 ffff880434b24000
      Call Trace:
       [<ffffffffa06bb05b>] lnet_parse+0xb9b/0x18c0 [lnet]
       [<ffffffffa08947fb>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
       [<ffffffffa08954e3>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
       [<ffffffffa0895692>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
       [<ffffffffa0895a4a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
       [<ffffffff81099f56>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
      Code: 00 00 00 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 90 49 8b 45 30 4c 8b 38 4d 85 ff 0f 84 39 fe ff ff
       <41> 8b 37 48 8b 3d c6 62 02 00 e8 01 6d f9 ff 8b 0d 77 64 02 00 
      
      crash> sys
            KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.23.3.el6.Bull.56.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 32
              DATE: Mon Nov  3 17:30:13 2014
            UPTIME: 28 days, 01:29:06
      LOAD AVERAGE: 75.19, 18.37, 8.71
             TASKS: 2258
          NODENAME: bigfoot27
           RELEASE: 2.6.32-431.23.3.el6.Bull.56.x86_64
           VERSION: #1 SMP Thu Jul 31 16:27:31 CEST 2014
           MACHINE: x86_64  (2266 Mhz)
            MEMORY: 64 GB
             PANIC: "Oops: 0000 [#1] SMP " (check log for details)
      crash> 
      crash> bt
      PID: 25204  TASK: ffff880c77ac9500  CPU: 12  COMMAND: "kiblnd_sd_00_01"
       #0 [ffff880c705897e0] machine_kexec at ffffffff8103914b
       #1 [ffff880c70589840] crash_kexec at ffffffff810c6042
       #2 [ffff880c70589910] oops_end at ffffffff8152d9d0
       #3 [ffff880c70589940] no_context at ffffffff8104a19b
       #4 [ffff880c70589990] __bad_area_nosemaphore at ffffffff8104a425
       #5 [ffff880c705899e0] bad_area_nosemaphore at ffffffff8104a4f3
       #6 [ffff880c705899f0] __do_page_fault at ffffffff8104ac4f
       #7 [ffff880c70589b10] do_page_fault at ffffffff8152f91e
       #8 [ffff880c70589b40] page_fault at ffffffff8152ccd5
          [exception RIP: lnet_ptl_match_md+592]
          RIP: ffffffffa06b37c0  RSP: ffff880c70589bf0  RFLAGS: 00010287
          RAX: ffffffffd4888cbe  RBX: ffff880c70589cf0  RCX: 00000000d4888cbd
          RDX: fffffffffffffffe  RSI: ffff880c5787b7d0  RDI: 0000000000000003
          RBP: ffff880c70589c70   R8: 8980000000000000   R9: 4c00000000000000
          R10: 000000000000002c  R11: 0000000000000012  R12: ffff880434b24000
          R13: ffff880c40941f40  R14: ffff880c40941f40  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880c70589c78] lnet_parse at ffffffffa06bb05b [lnet]
      #10 [ffff880c70589d58] kiblnd_handle_rx at ffffffffa08947fb [ko2iblnd]
      #11 [ffff880c70589da8] kiblnd_rx_complete at ffffffffa08954e3 [ko2iblnd]
      #12 [ffff880c70589df8] kiblnd_complete at ffffffffa0895692 [ko2iblnd]
      #13 [ffff880c70589e08] kiblnd_scheduler at ffffffffa0895a4a [ko2iblnd]
      #14 [ffff880c70589ee8] kthread at ffffffff81099f56
      #15 [ffff880c70589f48] kernel_thread at ffffffff8100c20a
      

      We can find the ptl variable use by the function lnet_ptl_match_md()
      and the crash occur because ptl_rotor is negatif

      crash> struct lnet_portal 0xffff880c40941f40
      struct lnet_portal {
        ptl_lock = {
          raw_lock = {
            slock = 409081954
          }
        }, 
        ptl_index = 28, 
        ptl_options = 5, 
        ptl_msg_stealing = {
          next = 0xffff880c40941f50, 
          prev = 0xffff880c40941f50
        }, 
        ptl_msg_delayed = {
          next = 0xffff880c40941f60, 
          prev = 0xffff880c40941f60
        }, 
        ptl_mtables = 0xffff880c5787b7d0, 
        ptl_rotor = -729246580, 
        ptl_mt_nmaps = 4, 
        ptl_mt_maps = 0xffff880c40941f80
      }
      
      

      Proposal fix :

      Nov-04 15:11:34 [root@lascaux0 lustre-2.5.3] # diff -up lnet/lnet/lib-ptl.c lnet/lnet/lib-ptl.c.apr
      --- lnet/lnet/lib-ptl.c 2014-09-11 18:04:07.000000000 +0200
      +++ lnet/lnet/lib-ptl.c.apr     2014-11-04 15:11:34.935503533 +0100
      @@ -773,6 +773,7 @@ lnet_ptl_setup(struct lnet_portal *ptl,
              }
      
              ptl->ptl_index = index;
      +        ptl->ptl_rotor = 0;
              CFS_INIT_LIST_HEAD(&ptl->ptl_msg_delayed);
              CFS_INIT_LIST_HEAD(&ptl->ptl_msg_stealing);
       #ifdef __KERNEL__
      

      I put my full analyze trace in attachment

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: