[LU-1368] lctl abort_recovery deadlocked Created: 03/May/12  Updated: 12/Dec/17  Resolved: 12/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

https://github.com/chaos/lustre/commits/2.1.1-4chaos


Attachments: Text File foreach_bt.txt    
Severity: 3
Rank (Obsolete): 9749

 Description   

Recovery was aborted for on an OST because it was taking too long (possibly due to LU-1352). Recovery never completed for one OST. The 'lctl abort_recovery --device=3' process was hung with the following backtrace.

PID: 25072  TASK: ffff880311ab2080  CPU: 9   COMMAND: "lctl"
 #0 [ffff880311b5fae8] schedule at ffffffff814eeee0
 #1 [ffff880311b5fbb0] schedule_timeout at ffffffff814efd95
 #2 [ffff880311b5fc60] wait_for_common at ffffffff814efa13
 #3 [ffff880311b5fcf0] wait_for_completion at ffffffff814efb2d
 #4 [ffff880311b5fd00] target_stop_recovery_thread at ffffffffa063b360 [ptlrpc]
 #5 [ffff880311b5fd20] filter_iocontrol at ffffffffa0b07ceb [obdfilter]
 #6 [ffff880311b5fd90] class_handle_ioctl at ffffffffa0509c37 [obdclass]
 #7 [ffff880311b5fe40] obd_class_ioctl at ffffffffa04fa21b [obdclass]
 #8 [ffff880311b5fe60] vfs_ioctl at ffffffff8118ab72
 #9 [ffff880311b5fea0] do_vfs_ioctl at ffffffff8118ad14
#10 [ffff880311b5ff30] sys_ioctl at ffffffff8118b291
#11 [ffff880311b5ff80] system_call_fastpath at ffffffff8100b0f2

Also, tgt_recov backtrace:

PID: 23416  TASK: ffff88033603b500  CPU: 15  COMMAND: "tgt_recov"
 #0 [ffff880310bd38f0] schedule at ffffffff814eeee0
 #1 [ffff880310bd39b8] schedule_timeout at ffffffff814efd12
 #2 [ffff880310bd3a68] cfs_waitq_timedwait at ffffffffa0422521 [libcfs]
 #3 [ffff880310bd3a78] target_bulk_io at ffffffffa0641ea0 [ptlrpc]
 #4 [ffff880310bd3b48] ost_brw_write at ffffffffa0abc21b [ost]
 #5 [ffff880310bd3cb8] ost_handle at ffffffffa0abf0e8 [ost]
 #6 [ffff880310bd3de8] handle_recovery_req at ffffffffa063bcac [ptlrpc]
 #7 [ffff880310bd3e28] target_recovery_thread at ffffffffa063c0b8 [ptlrpc]
 #8 [ffff880310bd3f48] kernel_thread at ffffffff8100c14a

Also attaching complete 'foreach bt' output from crash.

LLNL-bugzilla-ID: 1607



 Comments   
Comment by Peter Jones [ 03/May/12 ]

Hi Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Ned Bass [ 03/May/12 ]

From the affected server's logs, note the negative recovery time.

2012-05-03 13:24:54 Lustre: lsc-OST0132: Denying connection for new client 192.168.117.130@o2ib10 (at 9bbecaa5-b07c-320a-ef31-4620f1b26f3c), waiting for 7 clients in recovery for -12:-06

This seems to support the comments in LU-1352. Time here is calculated as

t = cfs_timer_deadline(&target->obd_recovery_timer);
t = cfs_time_sub(t, cfs_time_current());

So we must have cfs_time_current > target->obd_recovery_timer.

Comment by Hongchao Zhang [ 04/May/12 ]

during recovery, the request will be kept forever if the server can't complete it,
in ptlrpc_at_send_early_reply,
...
if (req->rq_export &&
lustre_msg_get_flags(req->rq_reqmsg) &
(MSG_REPLAY | MSG_REQ_REPLAY_DONE | MSG_LOCK_REPLAY_DONE))

{ /* During recovery, we don't want to send too many early * replies, but on the other hand we want to make sure the * client has enough time to resend if the rpc is lost. So * during the recovery period send at least 4 early replies, * spacing them every at_extra if we can. at_estimate should * always equal this fixed value during recovery. */ at_measured(&svc->srv_at_estimate, min(at_extra, req->rq_export->exp_obd->obd_recovery_timeout / 4)); }

else

{ ... }

newdl = cfs_time_current_sec() + at_get(&svc->srv_at_estimate); <--- here, the new deadline is set.
...

and in target_bulk_io, the request will be waited until it reaches the deadline of the request!

_

Comment by Oleg Drokin [ 08/May/12 ]

hm, negative recovery time.
I saw a couple bugs like this.
might be dup of LU-1166

Comment by Hongchao Zhang [ 10/May/12 ]

the LU-1352 should be caused by negative recovery time, and this one is related to it but it could contain a new issue,
will attach a possible patch for it.

Comment by Hongchao Zhang [ 11/May/12 ]

the patch is tracked at http://review.whamcloud.com/#change,2720

Comment by Mikhail Pershin [ 05/Jun/12 ]

This looks like just another LU-1166 case. About replay request issue, I don't see issue here, why request is waiting and what it is waiting? The ptlrpc_at_send_early_reply() is about sending early reply it doesn't affect request processing time on server. Can you explain a bit more what problem do you see here?

Comment by Hongchao Zhang [ 18/Jun/12 ]

the main problem here is if the LNetPut or LNetGet cost much long time(longer than the srv_at_estimate) to transfer data,
the client still can see AT reply and continue to wait instead of failing the request.

Comment by Hongchao Zhang [ 04/Jul/14 ]

Hi, Mikhail

Is the patch still needed, and to be updated against master? Thanks!

Generated at Sat Feb 10 01:16:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.