[LU-8197] early reply causes replay request deadline decrease Created: 24/May/16  Updated: 19/May/17  Resolved: 29/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0, Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When calculating timeout to be sent in reply ptlrpc_at_set_reply() uses at_get(&svcpt->scp_at_estimate)

static void ptlrpc_at_set_reply(struct ptlrpc_request *req, int flags)
{
...
        if (req->rq_type == PTL_RPC_MSG_ERR &&
            (req->rq_export == NULL || req->rq_export->exp_obd->obd_recovering))
                lustre_msg_set_timeout(req->rq_repmsg, 0);
        else
                lustre_msg_set_timeout(req->rq_repmsg,
                                       at_get(&svcpt->scp_at_estimate));
...
}

Setting timeout that way for replay requests results in replay request deadlines decrease, early timeout and clients' disconnect/reconnect as in example shown below:

00:0.0:1455205888.609614:0:4097:0:(client.c:2898:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 0 to 0 dl 1455205882 ref 1 fl New:R/4/0 rc 0/0
00000100:00001000:0.0:1455205889.610452:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #1, new deadline in 1s (-5s)  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 0 dl 1455205890 ref 2 fl Rpc:/4/ffffffff rc 0/-1
00000100:00001000:0.0:1455205900.616950:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #2, new deadline in -4s (-5s)  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 2 to 0 dl 1455205896 ref 2 fl Rpc:/6/ffffffff rc 0/-1
00000100:00000400:0.0:1455205894.611675:0:4097:0:(client.c:1979:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455205888/real 1455205888]  req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 1 dl 1455205890 ref 2 fl Rpc:X/4/ffffffff rc 0/-1

Instead in case of recovery the timeout should be calculated similar to way new deadline is calculated for the replay requests in ptlrpc_at_send_early_reply().

static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
{
...
        if (req->rq_export &&
            lustre_msg_get_flags(req->rq_reqmsg) &
            (MSG_REPLAY | MSG_REQ_REPLAY_DONE | MSG_LOCK_REPLAY_DONE)) {
                /* During recovery, we don't want to send too many early                                                                                                     
                 * replies, but on the other hand we want to make sure the                                                                                                   
                 * client has enough time to resend if the rpc is lost. So                                                                                                   
                 * during the recovery period send at least 4 early replies,                                                                                                 
                 * spacing them every at_extra if we can. at_estimate should                                                                                                 
                 * always equal this fixed value during recovery. */
                /* Don't account request processing time into AT history                                                                                                     
                 * during recovery, it is not service time we need but                                                                                                       
                 * includes also waiting time for recovering clients */
                newdl = cfs_time_current_sec() + min(at_extra,
                        req->rq_export->exp_obd->obd_recovery_timeout / 4);

...
}


 Comments   
Comment by Gerrit Updater [ 24/May/16 ]

Vladimir Saveliev (vladimir_saveliev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/20399
Subject: LU-8197 ptlrpc: do not reduce replay request deadline
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 278ec48c5a74da279b168d9436c40af900005eb8

Comment by Peter Jones [ 24/May/16 ]

Mike

Could you please review this patch?

Thanks

Peter

Comment by Gerrit Updater [ 27/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20399/
Subject: LU-8197 ptlrpc: do not reduce replay request deadline
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f3ce48ce543791d60326777d95ab2c9c03826965

Comment by Joseph Gmitter (Inactive) [ 29/Jun/16 ]

Landed to master for 2.9.0

Comment by Gerrit Updater [ 17/Nov/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/23833
Subject: LU-8197 ptlrpc: set correct rq_timeout for normal replay reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: af6a922f26fce7a6093c2074be3cf0a9f3c61940

Comment by Gerrit Updater [ 17/Dec/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23833/
Subject: LU-8197 ptlrpc: set correct rq_timeout for normal replay reply
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: db263714fb93fbed2822af775df5fcdb96eab9a2

Generated at Sat Feb 10 02:15:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.