Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3398

NULL pointer dereference in dump_stack()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 1.8.8, Lustre 1.8.7, Lustre 1.8.9
    • Kernel version: 2.6.32-131.0.15.el6.x86_64
      Lustre version: 1.8.7
    • 4
    • 8413

    Description

      Several different lustre clients crashed from time to time because of NULL pointer dereference in dump_stack(). The clients were not busy when they crashed, but the Lustre watchdog expired, saying a 'ldlm_cb' thead was inactive for 0.00s.

      I think the cause of the crashes is that walk_stack field of struct stacktrace_ops is not inited by Lustre-1.8.X. The problem is already fixed in Lustre-2.1 - https://jira.hpdd.intel.com/browse/LU-73, but the patch is not landed in Lustre-1.8.X. Attachment is the patch for Lustre-1.8.9.

      The logs in crash dumps are similar. Following is one of them:

      DUMPFILE: vmcore [PARTIAL DUMP]
      CPUS: 16
      DATE: Sat May 18 23:58:37 2013
      UPTIME: 32 days, 12:06:40
      LOAD AVERAGE: 0.44, 0.42, 0.42
      TASKS: 653
      NODENAME: t114
      RELEASE: 2.6.32-131.0.15.el6.x86_64
      VERSION: #1 SMP Tue May 10 15:42:40 EDT 2011
      MACHINE: x86_64 (2593 Mhz)
      MEMORY: 64 GB
      PANIC: "Oops: 0010 1 SMP " (check log for details)
      PID: 0
      COMMAND: "swapper"
      TASK: ffff88081c24ea80 (1 of 16) [THREAD_INFO: ffff88101c14c000]
      CPU: 10
      STATE: TASK_RUNNING (PANIC)

      crash>
      crash>
      crash> bt
      ESC[?1hESC=^MPID: 0 TASK: ffff88081c24ea80 CPU: 10 COMMAND: "swapper"
      #0 [ffff88085c4438f0] machine_kexec at ffffffff810310db
      #1 [ffff88085c443950] crash_kexec at ffffffff810b63b2
      #2 [ffff88085c443a20] oops_end at ffffffff814dec50
      #3 [ffff88085c443a50] no_context at ffffffff81040cdb
      #4 [ffff88085c443aa0] __bad_area_nosemaphore at ffffffff81040f65
      #5 [ffff88085c443af0] bad_area_nosemaphore at ffffffff81041033
      #6 [ffff88085c443b00] __do_page_fault at ffffffff8104170d
      #7 [ffff88085c443c20] do_page_fault at ffffffff814e0c3e
      #8 [ffff88085c443c50] page_fault at ffffffff814ddfe5
      #9 [ffff88085c443da8] libcfs_debug_dumpstack at ffffffffa11948f5 [libcfs]
      #10 [ffff88085c443dc8] lcw_cb at ffffffffa11a0025 [libcfs]
      #11 [ffff88085c443e48] run_timer_softirq at ffffffff81079f57
      #12 [ffff88085c443ed8] __do_softirq at ffffffff8106f717
      #13 [ffff88085c443f48] call_softirq at ffffffff8100c2cc
      #14 [ffff88085c443f60] do_softirq at ffffffff8100df05
      #15 [ffff88085c443f80] irq_exit at ffffffff8106f505
      #16 [ffff88085c443f90] smp_apic_timer_interrupt at ffffffff814e35f0
      #17 [ffff88085c443fb0] apic_timer_interrupt at ffffffff8100bc93
      — <IRQ stack> —
      #18 [ffff88101c14de38] apic_timer_interrupt at ffffffff8100bc93
      [exception RIP: mwait_idle+119]
      RIP: ffffffff810141a7 RSP: ffff88101c14dee8 RFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffff88101c14def8 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88101c14dfd8 RDI: ffff88101c911ec0
      RBP: ffffffff8100bc8e R8: 0000000000000000 R9: 0000000000000000
      R10: 00000000ffffffff R11: 0000000000000000 R12: ffffffff81b7c0b8
      R13: 0000000000000000 R14: ffffffff810ece03 R15: ffff88101c14de68
      ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
      #19 [ffff88101c14df00] cpu_idle at ffffffff81009e96
      ^MESC[KESC[?1lESC>crash>
      crash> log
      ...
      LustreError: 11-0: an error occurred while communicating with 172.19.4.13@o2ib. The ost_write operation failed with -122
      LustreError: Skipped 6 previous similar messages
      Lustre: Service thread pid 43738 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Lustre: Service thread pid 43738 was inactive for 0.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 43738, comm: ldlm_cb_07

      Call Trace:
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<(null)>] (null)
      PGD 100bcce067 PUD 100bccf067 PMD 0
      Oops: 0010 1 SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/device
      CPU 10
      Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc mptctl mptbase mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler sunrpc rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm_intel kvm uinput power_meter hwmon nvidia(P)(U) sg hpilo hpwdt igb(U) dca microcode serio_raw iTCO_wdt iTCO_vendor support shpchp ext4 mbcache jbd2 sd_mod crc_t10dif nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output hpsa(U) mpt2sas(U) scsi_transport_sas raid_class ahci dm_mod [last unloaded: scsi_ESC[7m – MORE – forward: <SPACE>, <ENTER> or j backward: b or k quit: qESC[27mESC[K^MESC[Kwait_scan]

      Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc mptctl mptbase mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler sunrpc rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm_intel kvm uinput power_meter hwmon nvidia(P)(U) sg hpilo hpwdt igb(U) dca microcode serio_raw iTCO_wdt iTCO_vendor support shpchp ext4 mbcache jbd2 sd_mod crc_t10dif nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output hpsa(U) mpt2sas(U) scsi_transport_sas raid_class ahci dm_mod [last unloaded: scsi_wait_scan]
      Pid: 0, comm: swapper Tainted: P ---------------- 2.6.32-131.0.15.el6.x86_64 #1 ProLiant SL250s Gen8
      RIP: 0010:[<0000000000000000>] [<(null)>] (null)
      RSP: 0018:ffff88085c443d08 EFLAGS: 00010246
      RAX: ffff88085c443d6c RBX: ffff88021611bd00 RCX: ffffffffa11a0320
      RDX: ffff88021611bdc0 RSI: ffff88021611bd00 RDI: ffff88021611a000
      RBP: ffff88085c443da0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000004 R11: 0000000000000000 R12: 000000000000cc20
      R13: ffffffffa11a0320 R14: 0000000000000000 R15: ffff88085c443fc0
      FS: 0000000000000000(0000) GS:ffff88085c440000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 000000100bccd000 CR4: 00000000000406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process swapper (pid: 0, threadinfo ffff88101c14c000, task ffff88081c24ea80)
      Stack:
      ffffffff8100e5a0 ffff88085c443d6c ffff8801be15eb40 0000000000000000
      <0> 0000000000000000 ffff88101c14c000 ffff88101c14dfd8 ffff88021611a000
      <0> 000000000000000a ffff88085c440000 ffff88021611bdc0 ffff88085c443d70
      Call Trace:
      <IRQ>
      [<ffffffff8100e5a0>] ? dump_trace+0x190/0x3b0
      [<ffffffffa11948f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa11a0025>] lcw_cb+0x255/0x4cc [libcfs]
      [<ffffffff81077f31>] ? ftrace_raw_event_timer_expire_entry+0xb1/0xc0
      [<ffffffffa119fdd0>] ? lcw_cb+0x0/0x4cc [libcfs]
      [<ffffffff81079f57>] run_timer_softirq+0x197/0x340
      [<ffffffff8106f717>] __do_softirq+0xb7/0x1e0
      [<ffffffff81092ca0>] ? hrtimer_interrupt+0x140/0x250
      [<ffffffff8100c2cc>] call_softirq+0x1c/0x30
      [<ffffffff8100df05>] do_softirq+0x65/0xa0
      [<ffffffff8106f505>] irq_exit+0x85/0x90
      [<ffffffff814e35f0>] smp_apic_timer_interrupt+0x70/0x9b
      [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20
      <EOI>
      [<ffffffff810141a7>] ? mwait_idle+0x77/0xd0
      [<ffffffff810141f0>] ? mwait_idle+0xc0/0xd0
      [<ffffffff81009e96>] cpu_idle+0xb6/0x110
      [<ffffffff814d493c>] start_secondary+0x202/0x245
      Code: Bad RIP value.
      RIP [<(null)>] (null)
      RSP <ffff88085c443d08>
      CR2: 0000000000000000

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: