Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.1.1
-
3
-
9749
Description
Recovery was aborted for on an OST because it was taking too long (possibly due to LU-1352). Recovery never completed for one OST. The 'lctl abort_recovery --device=3' process was hung with the following backtrace.
PID: 25072 TASK: ffff880311ab2080 CPU: 9 COMMAND: "lctl" #0 [ffff880311b5fae8] schedule at ffffffff814eeee0 #1 [ffff880311b5fbb0] schedule_timeout at ffffffff814efd95 #2 [ffff880311b5fc60] wait_for_common at ffffffff814efa13 #3 [ffff880311b5fcf0] wait_for_completion at ffffffff814efb2d #4 [ffff880311b5fd00] target_stop_recovery_thread at ffffffffa063b360 [ptlrpc] #5 [ffff880311b5fd20] filter_iocontrol at ffffffffa0b07ceb [obdfilter] #6 [ffff880311b5fd90] class_handle_ioctl at ffffffffa0509c37 [obdclass] #7 [ffff880311b5fe40] obd_class_ioctl at ffffffffa04fa21b [obdclass] #8 [ffff880311b5fe60] vfs_ioctl at ffffffff8118ab72 #9 [ffff880311b5fea0] do_vfs_ioctl at ffffffff8118ad14 #10 [ffff880311b5ff30] sys_ioctl at ffffffff8118b291 #11 [ffff880311b5ff80] system_call_fastpath at ffffffff8100b0f2
Also, tgt_recov backtrace:
PID: 23416 TASK: ffff88033603b500 CPU: 15 COMMAND: "tgt_recov" #0 [ffff880310bd38f0] schedule at ffffffff814eeee0 #1 [ffff880310bd39b8] schedule_timeout at ffffffff814efd12 #2 [ffff880310bd3a68] cfs_waitq_timedwait at ffffffffa0422521 [libcfs] #3 [ffff880310bd3a78] target_bulk_io at ffffffffa0641ea0 [ptlrpc] #4 [ffff880310bd3b48] ost_brw_write at ffffffffa0abc21b [ost] #5 [ffff880310bd3cb8] ost_handle at ffffffffa0abf0e8 [ost] #6 [ffff880310bd3de8] handle_recovery_req at ffffffffa063bcac [ptlrpc] #7 [ffff880310bd3e28] target_recovery_thread at ffffffffa063c0b8 [ptlrpc] #8 [ffff880310bd3f48] kernel_thread at ffffffff8100c14a
Also attaching complete 'foreach bt' output from crash.
LLNL-bugzilla-ID: 1607
Hi, Mikhail
Is the patch still needed, and to be updated against master? Thanks!