[LU-1125] crash in lc_watchdog_del_pending during tgt_recov Created: 21/Feb/12  Updated: 07/Jun/12  Resolved: 06/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.2.0, Lustre 2.1.2

Type: Bug Priority: Minor
Reporter: Alexandre Louvet Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: paj

Severity: 3
Rank (Obsolete): 4709

 Description   

We have seen several crash with the following signature (lustre 2.1.0)

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
PGD 8783c2067 PUD 87703f067 PMD 0 
Oops: 0002 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map
CPU 3 
Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan]

Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan]
Pid: 21930, comm: tgt_recov Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 bullx super-node
RIP: 0010:[<ffffffffa0409336>]  [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
RSP: 0018:ffff8810461bb560  EFLAGS: 00010286
RAX: ffff880c8e452e20 RBX: ffff88107d07e790 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa04296a0
RBP: ffff8810461bb570 R08: 0000000000000002 R09: 0000000000000000
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88107d07e7e8
R13: 0000000000000002 R14: ffff8810461bbc70 R15: ffff8810763fc038
FS:  00002ba692f28b20(0000) GS:ffff880c8e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000087c1d8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tgt_recov (pid: 21930, threadinfo ffff8810461b8000, task ffff88107d74e100)
Stack:
 ffff88107d07e790 0000000000000004 ffff8810461bb5b0 ffffffffa0409388
<0> ffff8810461bb5b0 0000000000000000 ffff88107dd3e680 0000000000000002
<0> ffff8810461bbc70 0000000000000000 ffff8810461bb710 ffffffffa06ce579
Call Trace:
 [<ffffffffa0409388>] lc_watchdog_disable+0x38/0x120 [libcfs]
 [<ffffffffa06ce579>] quota_chk_acq_common+0x179/0xbc0 [lquota]
 [<ffffffffa06cc170>] ? quota_acquire_common+0x0/0x130 [lquota]
 [<ffffffff8118d82f>] ? generic_block_bmap+0x3f/0x50
 [<ffffffffa08fa6c0>] ? filter_alloc_iobuf+0x170/0x850 [obdfilter]
 [<ffffffffa08fba28>] filter_commitrw_write+0xc88/0x2ec8 [obdfilter]
 [<ffffffff8147de2a>] ? thread_return+0x4e/0x754
 [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8106629b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa08ee62d>] filter_commitrw+0x2bd/0x2e0 [obdfilter]
 [<ffffffffa05879a5>] ? lustre_msg_buf+0x85/0x90 [ptlrpc]
 [<ffffffffa05b5c5b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
 [<ffffffffa00895ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs]
 [<ffffffffa08a3f5a>] obd_commitrw+0x11a/0x410 [ost]
 [<ffffffffa08ad922>] ost_brw_write+0x1132/0x1870 [ost]
 [<ffffffff812fd4a0>] ? vt_console_print+0x260/0x330
 [<ffffffff81258d33>] ? cpumask_next_and+0x23/0x40
 [<ffffffffa0549940>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc]
 [<ffffffffa08b1cd5>] ost_handle+0x3325/0x4b90 [ost]
 [<ffffffffa07f3686>] ? vvp_session_key_init+0x76/0x1d0 [lustre]
 [<ffffffffa08ae9b0>] ? ost_handle+0x0/0x4b90 [ost]
 [<ffffffffa054cde6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]
 [<ffffffffa0549830>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
 [<ffffffff8107a2b0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa054d347>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]
 [<ffffffff810583b6>] ? do_exit+0x5c6/0x870
 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
 [<ffffffff810041aa>] child_rip+0xa/0x20
 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
 [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
 [<ffffffff810041a0>] ? child_rip+0x0/0x20
Code: e8 e0 74 07 e1 48 8b 1c 24 4c 8b 64 24 08 c9 c3 48 c7 c7 a0 96 42 a0 e8 f9 73 07 e1 48 8b 53 58 48 8b 43 60 48 c7 c7 a0 96 42 a0 <48> 89 42 08 48 89 10 83 6b 04 01 4c 89 63 58 4c 89 63 60 e8 a2 
RIP  [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
 RSP <ffff8810461bb560>
CR2: 0000000000000008

at time of crash (unless I missed something), RBX is referencing lcw which doesn't look very healthy

crash> lc_watchdog 0xffff88107d07e790
struct lc_watchdog {
  lcw_lock = {
    raw_lock = {
      slock = 64
    }
  }, 
  lcw_refcount = 0, 
  lcw_timer = {
    entry = {
      next = 0xffff880ff09ac000, 
      prev = 0x60a04c00000001
    }, 
    expires = 18446744069414584320, 
    function = 0x7800000078, 
    data = 120, 
    base = 0xffffffff81a16ae0, 
    start_site = 0x400, 
    start_comm = "\000\000@\000\000\000\000\000\001\000\000\000\000\000\000", 
    start_pid = 0
  }, 
  lcw_list = {
    next = 0x0, 
    prev = 0xffff88107d07e7f0
  }, 
  lcw_last_touched = 18446612203131365360, 
  lcw_task = 0x0, 
  lcw_callback = 0x9ae64cd295, 
  lcw_data = 0x6de64, 
  lcw_pid = -1900685735, 
  lcw_state = 12
}

Alex.



 Comments   
Comment by Peter Jones [ 22/Feb/12 ]

Bobi

Could you look into this one please?

Thanks

Peter

Comment by Zhenyu Xu [ 22/Feb/12 ]

recovery thread hasn't initialize its watchdog correctly, patch tracking at http://review.whamcloud.com/2174

Comment by Sebastien Buisson (Inactive) [ 22/Feb/12 ]

Hi,

I cannot access the patch on Gerrit, I get a server error.
I can view other patches from other tickets, but not this one.

Sebastien.

Comment by Peter Jones [ 22/Feb/12 ]

Strange. It seems to be working ok for me atm. Is this problem still occurring for you?

Comment by Sebastien Buisson (Inactive) [ 22/Feb/12 ]

Yes, still the same.
Diego and Gregoire suffer from the same issue.

Here is the exact error message:

Application Error
Server Unavailable
0

Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Feb/12 ]

Integrated in lustre-master » i686,server,el6,ofa #493
LU-1125 recovery: initial recovery thread's watchdog (Revision 039c582adfb8fbb537f1b3dcacd518a6681b0cef)

Result = SUCCESS
Oleg Drokin : 039c582adfb8fbb537f1b3dcacd518a6681b0cef
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Peter Jones [ 06/Mar/12 ]

Landed for 2.2

Generated at Sat Feb 10 01:13:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.