[LU-8197] early reply causes replay request deadline decrease - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0, Lustre 2.10.0
Affects Version/s: None
Labels:
- patch

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When calculating timeout to be sent in reply ptlrpc_at_set_reply() uses at_get(&svcpt->scp_at_estimate)

static void ptlrpc_at_set_reply(struct ptlrpc_request *req, int flags)
{
...
        if (req->rq_type == PTL_RPC_MSG_ERR &&
            (req->rq_export == NULL || req->rq_export->exp_obd->obd_recovering))
                lustre_msg_set_timeout(req->rq_repmsg, 0);
        else
                lustre_msg_set_timeout(req->rq_repmsg,
                                       at_get(&svcpt->scp_at_estimate));
...
}

Setting timeout that way for replay requests results in replay request deadlines decrease, early timeout and clients' disconnect/reconnect as in example shown below:

00:0.0:1455205888.609614:0:4097:0:(client.c:2898:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 0 to 0 dl 1455205882 ref 1 fl New:R/4/0 rc 0/0
00000100:00001000:0.0:1455205889.610452:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #1, new deadline in 1s (-5s)  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 0 dl 1455205890 ref 2 fl Rpc:/4/ffffffff rc 0/-1
00000100:00001000:0.0:1455205900.616950:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #2, new deadline in -4s (-5s)  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 2 to 0 dl 1455205896 ref 2 fl Rpc:/6/ffffffff rc 0/-1
00000100:00000400:0.0:1455205894.611675:0:4097:0:(client.c:1979:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455205888/real 1455205888]  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 1 dl 1455205890 ref 2 fl Rpc:X/4/ffffffff rc 0/-1

Instead in case of recovery the timeout should be calculated similar to way new deadline is calculated for the replay requests in ptlrpc_at_send_early_reply().

static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
{
...
        if (req->rq_export &&
            lustre_msg_get_flags(req->rq_reqmsg) &
            (MSG_REPLAY | MSG_REQ_REPLAY_DONE | MSG_LOCK_REPLAY_DONE)) {
                /* During recovery, we don't want to send too many early                                                                                                     
                 * replies, but on the other hand we want to make sure the                                                                                                   
                 * client has enough time to resend if the rpc is lost. So                                                                                                   
                 * during the recovery period send at least 4 early replies,                                                                                                 
                 * spacing them every at_extra if we can. at_estimate should                                                                                                 
                 * always equal this fixed value during recovery. */
                /* Don't account request processing time into AT history                                                                                                     
                 * during recovery, it is not service time we need but                                                                                                       
                 * includes also waiting time for recovering clients */
                newdl = cfs_time_current_sec() + min(at_extra,
                        req->rq_export->exp_obd->obd_recovery_timeout / 4);

...
}

Attachments

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Vladimir Saveliev

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/May/16 12:36 PM

Updated:: 19/May/17 12:00 PM

Resolved:: 29/Jun/16 5:19 PM