Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1368

lctl abort_recovery deadlocked

    XMLWordPrintable

Details

    • 3
    • 9749

    Description

      Recovery was aborted for on an OST because it was taking too long (possibly due to LU-1352). Recovery never completed for one OST. The 'lctl abort_recovery --device=3' process was hung with the following backtrace.

      PID: 25072  TASK: ffff880311ab2080  CPU: 9   COMMAND: "lctl"
       #0 [ffff880311b5fae8] schedule at ffffffff814eeee0
       #1 [ffff880311b5fbb0] schedule_timeout at ffffffff814efd95
       #2 [ffff880311b5fc60] wait_for_common at ffffffff814efa13
       #3 [ffff880311b5fcf0] wait_for_completion at ffffffff814efb2d
       #4 [ffff880311b5fd00] target_stop_recovery_thread at ffffffffa063b360 [ptlrpc]
       #5 [ffff880311b5fd20] filter_iocontrol at ffffffffa0b07ceb [obdfilter]
       #6 [ffff880311b5fd90] class_handle_ioctl at ffffffffa0509c37 [obdclass]
       #7 [ffff880311b5fe40] obd_class_ioctl at ffffffffa04fa21b [obdclass]
       #8 [ffff880311b5fe60] vfs_ioctl at ffffffff8118ab72
       #9 [ffff880311b5fea0] do_vfs_ioctl at ffffffff8118ad14
      #10 [ffff880311b5ff30] sys_ioctl at ffffffff8118b291
      #11 [ffff880311b5ff80] system_call_fastpath at ffffffff8100b0f2
      

      Also, tgt_recov backtrace:

      PID: 23416  TASK: ffff88033603b500  CPU: 15  COMMAND: "tgt_recov"
       #0 [ffff880310bd38f0] schedule at ffffffff814eeee0
       #1 [ffff880310bd39b8] schedule_timeout at ffffffff814efd12
       #2 [ffff880310bd3a68] cfs_waitq_timedwait at ffffffffa0422521 [libcfs]
       #3 [ffff880310bd3a78] target_bulk_io at ffffffffa0641ea0 [ptlrpc]
       #4 [ffff880310bd3b48] ost_brw_write at ffffffffa0abc21b [ost]
       #5 [ffff880310bd3cb8] ost_handle at ffffffffa0abf0e8 [ost]
       #6 [ffff880310bd3de8] handle_recovery_req at ffffffffa063bcac [ptlrpc]
       #7 [ffff880310bd3e28] target_recovery_thread at ffffffffa063c0b8 [ptlrpc]
       #8 [ffff880310bd3f48] kernel_thread at ffffffff8100c14a
      

      Also attaching complete 'foreach bt' output from crash.

      LLNL-bugzilla-ID: 1607

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: