Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.0
-
3
-
4709
Description
We have seen several crash with the following signature (lustre 2.1.0)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] PGD 8783c2067 PUD 87703f067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map CPU 3 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan] Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan] Pid: 21930, comm: tgt_recov Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 bullx super-node RIP: 0010:[<ffffffffa0409336>] [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] RSP: 0018:ffff8810461bb560 EFLAGS: 00010286 RAX: ffff880c8e452e20 RBX: ffff88107d07e790 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa04296a0 RBP: ffff8810461bb570 R08: 0000000000000002 R09: 0000000000000000 R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88107d07e7e8 R13: 0000000000000002 R14: ffff8810461bbc70 R15: ffff8810763fc038 FS: 00002ba692f28b20(0000) GS:ffff880c8e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000087c1d8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tgt_recov (pid: 21930, threadinfo ffff8810461b8000, task ffff88107d74e100) Stack: ffff88107d07e790 0000000000000004 ffff8810461bb5b0 ffffffffa0409388 <0> ffff8810461bb5b0 0000000000000000 ffff88107dd3e680 0000000000000002 <0> ffff8810461bbc70 0000000000000000 ffff8810461bb710 ffffffffa06ce579 Call Trace: [<ffffffffa0409388>] lc_watchdog_disable+0x38/0x120 [libcfs] [<ffffffffa06ce579>] quota_chk_acq_common+0x179/0xbc0 [lquota] [<ffffffffa06cc170>] ? quota_acquire_common+0x0/0x130 [lquota] [<ffffffff8118d82f>] ? generic_block_bmap+0x3f/0x50 [<ffffffffa08fa6c0>] ? filter_alloc_iobuf+0x170/0x850 [obdfilter] [<ffffffffa08fba28>] filter_commitrw_write+0xc88/0x2ec8 [obdfilter] [<ffffffff8147de2a>] ? thread_return+0x4e/0x754 [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70 [<ffffffff8106629b>] ? try_to_del_timer_sync+0x7b/0xe0 [<ffffffffa08ee62d>] filter_commitrw+0x2bd/0x2e0 [obdfilter] [<ffffffffa05879a5>] ? lustre_msg_buf+0x85/0x90 [ptlrpc] [<ffffffffa05b5c5b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc] [<ffffffffa00895ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs] [<ffffffffa08a3f5a>] obd_commitrw+0x11a/0x410 [ost] [<ffffffffa08ad922>] ost_brw_write+0x1132/0x1870 [ost] [<ffffffff812fd4a0>] ? vt_console_print+0x260/0x330 [<ffffffff81258d33>] ? cpumask_next_and+0x23/0x40 [<ffffffffa0549940>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc] [<ffffffffa08b1cd5>] ost_handle+0x3325/0x4b90 [ost] [<ffffffffa07f3686>] ? vvp_session_key_init+0x76/0x1d0 [lustre] [<ffffffffa08ae9b0>] ? ost_handle+0x0/0x4b90 [ost] [<ffffffffa054cde6>] handle_recovery_req+0x1f6/0x330 [ptlrpc] [<ffffffffa0549830>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] [<ffffffff8107a2b0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa054d347>] target_recovery_thread+0x3a7/0xf50 [ptlrpc] [<ffffffff810583b6>] ? do_exit+0x5c6/0x870 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffff810041aa>] child_rip+0xa/0x20 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffff810041a0>] ? child_rip+0x0/0x20 Code: e8 e0 74 07 e1 48 8b 1c 24 4c 8b 64 24 08 c9 c3 48 c7 c7 a0 96 42 a0 e8 f9 73 07 e1 48 8b 53 58 48 8b 43 60 48 c7 c7 a0 96 42 a0 <48> 89 42 08 48 89 10 83 6b 04 01 4c 89 63 58 4c 89 63 60 e8 a2 RIP [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] RSP <ffff8810461bb560> CR2: 0000000000000008
at time of crash (unless I missed something), RBX is referencing lcw which doesn't look very healthy
crash> lc_watchdog 0xffff88107d07e790 struct lc_watchdog { lcw_lock = { raw_lock = { slock = 64 } }, lcw_refcount = 0, lcw_timer = { entry = { next = 0xffff880ff09ac000, prev = 0x60a04c00000001 }, expires = 18446744069414584320, function = 0x7800000078, data = 120, base = 0xffffffff81a16ae0, start_site = 0x400, start_comm = "\000\000@\000\000\000\000\000\001\000\000\000\000\000\000", start_pid = 0 }, lcw_list = { next = 0x0, prev = 0xffff88107d07e7f0 }, lcw_last_touched = 18446612203131365360, lcw_task = 0x0, lcw_callback = 0x9ae64cd295, lcw_data = 0x6de64, lcw_pid = -1900685735, lcw_state = 12 }
Alex.
Attachments
Issue Links
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....