[LU-1368] lctl abort_recovery deadlocked Created: 03/May/12 Updated: 12/Dec/17 Resolved: 12/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9749 |
| Description |
|
Recovery was aborted for on an OST because it was taking too long (possibly due to PID: 25072 TASK: ffff880311ab2080 CPU: 9 COMMAND: "lctl" #0 [ffff880311b5fae8] schedule at ffffffff814eeee0 #1 [ffff880311b5fbb0] schedule_timeout at ffffffff814efd95 #2 [ffff880311b5fc60] wait_for_common at ffffffff814efa13 #3 [ffff880311b5fcf0] wait_for_completion at ffffffff814efb2d #4 [ffff880311b5fd00] target_stop_recovery_thread at ffffffffa063b360 [ptlrpc] #5 [ffff880311b5fd20] filter_iocontrol at ffffffffa0b07ceb [obdfilter] #6 [ffff880311b5fd90] class_handle_ioctl at ffffffffa0509c37 [obdclass] #7 [ffff880311b5fe40] obd_class_ioctl at ffffffffa04fa21b [obdclass] #8 [ffff880311b5fe60] vfs_ioctl at ffffffff8118ab72 #9 [ffff880311b5fea0] do_vfs_ioctl at ffffffff8118ad14 #10 [ffff880311b5ff30] sys_ioctl at ffffffff8118b291 #11 [ffff880311b5ff80] system_call_fastpath at ffffffff8100b0f2 Also, tgt_recov backtrace: PID: 23416 TASK: ffff88033603b500 CPU: 15 COMMAND: "tgt_recov" #0 [ffff880310bd38f0] schedule at ffffffff814eeee0 #1 [ffff880310bd39b8] schedule_timeout at ffffffff814efd12 #2 [ffff880310bd3a68] cfs_waitq_timedwait at ffffffffa0422521 [libcfs] #3 [ffff880310bd3a78] target_bulk_io at ffffffffa0641ea0 [ptlrpc] #4 [ffff880310bd3b48] ost_brw_write at ffffffffa0abc21b [ost] #5 [ffff880310bd3cb8] ost_handle at ffffffffa0abf0e8 [ost] #6 [ffff880310bd3de8] handle_recovery_req at ffffffffa063bcac [ptlrpc] #7 [ffff880310bd3e28] target_recovery_thread at ffffffffa063c0b8 [ptlrpc] #8 [ffff880310bd3f48] kernel_thread at ffffffff8100c14a Also attaching complete 'foreach bt' output from crash. LLNL-bugzilla-ID: 1607 |
| Comments |
| Comment by Peter Jones [ 03/May/12 ] |
|
Hi Hongchao Could you please look into this one? Thanks Peter |
| Comment by Ned Bass [ 03/May/12 ] |
|
From the affected server's logs, note the negative recovery time. 2012-05-03 13:24:54 Lustre: lsc-OST0132: Denying connection for new client 192.168.117.130@o2ib10 (at 9bbecaa5-b07c-320a-ef31-4620f1b26f3c), waiting for 7 clients in recovery for -12:-06 This seems to support the comments in t = cfs_timer_deadline(&target->obd_recovery_timer); So we must have cfs_time_current > target->obd_recovery_timer. |
| Comment by Hongchao Zhang [ 04/May/12 ] |
|
during recovery, the request will be kept forever if the server can't complete it, else { ... } newdl = cfs_time_current_sec() + at_get(&svc->srv_at_estimate); <--- here, the new deadline is set. and in target_bulk_io, the request will be waited until it reaches the deadline of the request! _ |
| Comment by Oleg Drokin [ 08/May/12 ] |
|
hm, negative recovery time. |
| Comment by Hongchao Zhang [ 10/May/12 ] |
|
the |
| Comment by Hongchao Zhang [ 11/May/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,2720 |
| Comment by Mikhail Pershin [ 05/Jun/12 ] |
|
This looks like just another |
| Comment by Hongchao Zhang [ 18/Jun/12 ] |
|
the main problem here is if the LNetPut or LNetGet cost much long time(longer than the srv_at_estimate) to transfer data, |
| Comment by Hongchao Zhang [ 04/Jul/14 ] |
|
Hi, Mikhail Is the patch still needed, and to be updated against master? Thanks! |