[LU-1125] crash in lc_watchdog_del_pending during tgt_recov Created: 21/Feb/12 Updated: 07/Jun/12 Resolved: 06/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alexandre Louvet | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | paj | ||
| Severity: | 3 |
| Rank (Obsolete): | 4709 |
| Description |
|
We have seen several crash with the following signature (lustre 2.1.0) BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] PGD 8783c2067 PUD 87703f067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map CPU 3 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan] Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan] Pid: 21930, comm: tgt_recov Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 bullx super-node RIP: 0010:[<ffffffffa0409336>] [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] RSP: 0018:ffff8810461bb560 EFLAGS: 00010286 RAX: ffff880c8e452e20 RBX: ffff88107d07e790 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa04296a0 RBP: ffff8810461bb570 R08: 0000000000000002 R09: 0000000000000000 R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88107d07e7e8 R13: 0000000000000002 R14: ffff8810461bbc70 R15: ffff8810763fc038 FS: 00002ba692f28b20(0000) GS:ffff880c8e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000087c1d8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tgt_recov (pid: 21930, threadinfo ffff8810461b8000, task ffff88107d74e100) Stack: ffff88107d07e790 0000000000000004 ffff8810461bb5b0 ffffffffa0409388 <0> ffff8810461bb5b0 0000000000000000 ffff88107dd3e680 0000000000000002 <0> ffff8810461bbc70 0000000000000000 ffff8810461bb710 ffffffffa06ce579 Call Trace: [<ffffffffa0409388>] lc_watchdog_disable+0x38/0x120 [libcfs] [<ffffffffa06ce579>] quota_chk_acq_common+0x179/0xbc0 [lquota] [<ffffffffa06cc170>] ? quota_acquire_common+0x0/0x130 [lquota] [<ffffffff8118d82f>] ? generic_block_bmap+0x3f/0x50 [<ffffffffa08fa6c0>] ? filter_alloc_iobuf+0x170/0x850 [obdfilter] [<ffffffffa08fba28>] filter_commitrw_write+0xc88/0x2ec8 [obdfilter] [<ffffffff8147de2a>] ? thread_return+0x4e/0x754 [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70 [<ffffffff8106629b>] ? try_to_del_timer_sync+0x7b/0xe0 [<ffffffffa08ee62d>] filter_commitrw+0x2bd/0x2e0 [obdfilter] [<ffffffffa05879a5>] ? lustre_msg_buf+0x85/0x90 [ptlrpc] [<ffffffffa05b5c5b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc] [<ffffffffa00895ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs] [<ffffffffa08a3f5a>] obd_commitrw+0x11a/0x410 [ost] [<ffffffffa08ad922>] ost_brw_write+0x1132/0x1870 [ost] [<ffffffff812fd4a0>] ? vt_console_print+0x260/0x330 [<ffffffff81258d33>] ? cpumask_next_and+0x23/0x40 [<ffffffffa0549940>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc] [<ffffffffa08b1cd5>] ost_handle+0x3325/0x4b90 [ost] [<ffffffffa07f3686>] ? vvp_session_key_init+0x76/0x1d0 [lustre] [<ffffffffa08ae9b0>] ? ost_handle+0x0/0x4b90 [ost] [<ffffffffa054cde6>] handle_recovery_req+0x1f6/0x330 [ptlrpc] [<ffffffffa0549830>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] [<ffffffff8107a2b0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa054d347>] target_recovery_thread+0x3a7/0xf50 [ptlrpc] [<ffffffff810583b6>] ? do_exit+0x5c6/0x870 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffff810041aa>] child_rip+0xa/0x20 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] [<ffffffff810041a0>] ? child_rip+0x0/0x20 Code: e8 e0 74 07 e1 48 8b 1c 24 4c 8b 64 24 08 c9 c3 48 c7 c7 a0 96 42 a0 e8 f9 73 07 e1 48 8b 53 58 48 8b 43 60 48 c7 c7 a0 96 42 a0 <48> 89 42 08 48 89 10 83 6b 04 01 4c 89 63 58 4c 89 63 60 e8 a2 RIP [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs] RSP <ffff8810461bb560> CR2: 0000000000000008 at time of crash (unless I missed something), RBX is referencing lcw which doesn't look very healthy crash> lc_watchdog 0xffff88107d07e790
struct lc_watchdog {
lcw_lock = {
raw_lock = {
slock = 64
}
},
lcw_refcount = 0,
lcw_timer = {
entry = {
next = 0xffff880ff09ac000,
prev = 0x60a04c00000001
},
expires = 18446744069414584320,
function = 0x7800000078,
data = 120,
base = 0xffffffff81a16ae0,
start_site = 0x400,
start_comm = "\000\000@\000\000\000\000\000\001\000\000\000\000\000\000",
start_pid = 0
},
lcw_list = {
next = 0x0,
prev = 0xffff88107d07e7f0
},
lcw_last_touched = 18446612203131365360,
lcw_task = 0x0,
lcw_callback = 0x9ae64cd295,
lcw_data = 0x6de64,
lcw_pid = -1900685735,
lcw_state = 12
}
Alex. |
| Comments |
| Comment by Peter Jones [ 22/Feb/12 ] |
|
Bobi Could you look into this one please? Thanks Peter |
| Comment by Zhenyu Xu [ 22/Feb/12 ] |
|
recovery thread hasn't initialize its watchdog correctly, patch tracking at http://review.whamcloud.com/2174 |
| Comment by Sebastien Buisson (Inactive) [ 22/Feb/12 ] |
|
Hi, I cannot access the patch on Gerrit, I get a server error. Sebastien. |
| Comment by Peter Jones [ 22/Feb/12 ] |
|
Strange. It seems to be working ok for me atm. Is this problem still occurring for you? |
| Comment by Sebastien Buisson (Inactive) [ 22/Feb/12 ] |
|
Yes, still the same. Here is the exact error message: Application Error |
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 06/Mar/12 ] |
|
Landed for 2.2 |