Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1125

crash in lc_watchdog_del_pending during tgt_recov

    XMLWordPrintable

Details

    • 3
    • 4709

    Description

      We have seen several crash with the following signature (lustre 2.1.0)

      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      IP: [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
      PGD 8783c2067 PUD 87703f067 PMD 0 
      Oops: 0002 [#1] SMP 
      last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map
      CPU 3 
      Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan]
      
      Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin uinput usbhid hid ghes sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif dm_multipath dm_mod megaraid_sas ahci [last unloaded: scsi_wait_scan]
      Pid: 21930, comm: tgt_recov Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 bullx super-node
      RIP: 0010:[<ffffffffa0409336>]  [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
      RSP: 0018:ffff8810461bb560  EFLAGS: 00010286
      RAX: ffff880c8e452e20 RBX: ffff88107d07e790 RCX: 0000000000001000
      RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa04296a0
      RBP: ffff8810461bb570 R08: 0000000000000002 R09: 0000000000000000
      R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88107d07e7e8
      R13: 0000000000000002 R14: ffff8810461bbc70 R15: ffff8810763fc038
      FS:  00002ba692f28b20(0000) GS:ffff880c8e400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000008 CR3: 000000087c1d8000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process tgt_recov (pid: 21930, threadinfo ffff8810461b8000, task ffff88107d74e100)
      Stack:
       ffff88107d07e790 0000000000000004 ffff8810461bb5b0 ffffffffa0409388
      <0> ffff8810461bb5b0 0000000000000000 ffff88107dd3e680 0000000000000002
      <0> ffff8810461bbc70 0000000000000000 ffff8810461bb710 ffffffffa06ce579
      Call Trace:
       [<ffffffffa0409388>] lc_watchdog_disable+0x38/0x120 [libcfs]
       [<ffffffffa06ce579>] quota_chk_acq_common+0x179/0xbc0 [lquota]
       [<ffffffffa06cc170>] ? quota_acquire_common+0x0/0x130 [lquota]
       [<ffffffff8118d82f>] ? generic_block_bmap+0x3f/0x50
       [<ffffffffa08fa6c0>] ? filter_alloc_iobuf+0x170/0x850 [obdfilter]
       [<ffffffffa08fba28>] filter_commitrw_write+0xc88/0x2ec8 [obdfilter]
       [<ffffffff8147de2a>] ? thread_return+0x4e/0x754
       [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
       [<ffffffff8106629b>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffffa08ee62d>] filter_commitrw+0x2bd/0x2e0 [obdfilter]
       [<ffffffffa05879a5>] ? lustre_msg_buf+0x85/0x90 [ptlrpc]
       [<ffffffffa05b5c5b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
       [<ffffffffa00895ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs]
       [<ffffffffa08a3f5a>] obd_commitrw+0x11a/0x410 [ost]
       [<ffffffffa08ad922>] ost_brw_write+0x1132/0x1870 [ost]
       [<ffffffff812fd4a0>] ? vt_console_print+0x260/0x330
       [<ffffffff81258d33>] ? cpumask_next_and+0x23/0x40
       [<ffffffffa0549940>] ? target_bulk_timeout+0x0/0xe0 [ptlrpc]
       [<ffffffffa08b1cd5>] ost_handle+0x3325/0x4b90 [ost]
       [<ffffffffa07f3686>] ? vvp_session_key_init+0x76/0x1d0 [lustre]
       [<ffffffffa08ae9b0>] ? ost_handle+0x0/0x4b90 [ost]
       [<ffffffffa054cde6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]
       [<ffffffffa0549830>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
       [<ffffffff8107a2b0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa054d347>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]
       [<ffffffff810583b6>] ? do_exit+0x5c6/0x870
       [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
       [<ffffffff810041aa>] child_rip+0xa/0x20
       [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
       [<ffffffffa054cfa0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
       [<ffffffff810041a0>] ? child_rip+0x0/0x20
      Code: e8 e0 74 07 e1 48 8b 1c 24 4c 8b 64 24 08 c9 c3 48 c7 c7 a0 96 42 a0 e8 f9 73 07 e1 48 8b 53 58 48 8b 43 60 48 c7 c7 a0 96 42 a0 <48> 89 42 08 48 89 10 83 6b 04 01 4c 89 63 58 4c 89 63 60 e8 a2 
      RIP  [<ffffffffa0409336>] lc_watchdog_del_pending+0x56/0x70 [libcfs]
       RSP <ffff8810461bb560>
      CR2: 0000000000000008
      

      at time of crash (unless I missed something), RBX is referencing lcw which doesn't look very healthy

      crash> lc_watchdog 0xffff88107d07e790
      struct lc_watchdog {
        lcw_lock = {
          raw_lock = {
            slock = 64
          }
        }, 
        lcw_refcount = 0, 
        lcw_timer = {
          entry = {
            next = 0xffff880ff09ac000, 
            prev = 0x60a04c00000001
          }, 
          expires = 18446744069414584320, 
          function = 0x7800000078, 
          data = 120, 
          base = 0xffffffff81a16ae0, 
          start_site = 0x400, 
          start_comm = "\000\000@\000\000\000\000\000\001\000\000\000\000\000\000", 
          start_pid = 0
        }, 
        lcw_list = {
          next = 0x0, 
          prev = 0xffff88107d07e7f0
        }, 
        lcw_last_touched = 18446612203131365360, 
        lcw_task = 0x0, 
        lcw_callback = 0x9ae64cd295, 
        lcw_data = 0x6de64, 
        lcw_pid = -1900685735, 
        lcw_state = 12
      }
      

      Alex.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: