[LU-8197] early reply causes replay request deadline decrease Created: 24/May/16 Updated: 19/May/17 Resolved: 29/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Vladimir Saveliev | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
When calculating timeout to be sent in reply ptlrpc_at_set_reply() uses at_get(&svcpt->scp_at_estimate) static void ptlrpc_at_set_reply(struct ptlrpc_request *req, int flags) { ... if (req->rq_type == PTL_RPC_MSG_ERR && (req->rq_export == NULL || req->rq_export->exp_obd->obd_recovering)) lustre_msg_set_timeout(req->rq_repmsg, 0); else lustre_msg_set_timeout(req->rq_repmsg, at_get(&svcpt->scp_at_estimate)); ... } Setting timeout that way for replay requests results in replay request deadlines decrease, early timeout and clients' disconnect/reconnect as in example shown below: 00:0.0:1455205888.609614:0:4097:0:(client.c:2898:ptlrpc_replay_req()) @@@ REPLAY req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 0 to 0 dl 1455205882 ref 1 fl New:R/4/0 rc 0/0 00000100:00001000:0.0:1455205889.610452:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #1, new deadline in 1s (-5s) req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 0 dl 1455205890 ref 2 fl Rpc:/4/ffffffff rc 0/-1 00000100:00001000:0.0:1455205900.616950:0:4097:0:(client.c:399:ptlrpc_at_recv_early_reply()) @@@ Early reply #2, new deadline in -4s (-5s) req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 2 to 0 dl 1455205896 ref 2 fl Rpc:/6/ffffffff rc 0/-1 00000100:00000400:0.0:1455205894.611675:0:4097:0:(client.c:1979:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455205888/real 1455205888] req@ffff88001f993400 x1525892057668728/t17179869186(17179869186) o4->lustre-OST0000-osc-ffff880022d97800@192.168.2.100@tcp:6/4 lens 488/416 e 1 to 1 dl 1455205890 ref 2 fl Rpc:X/4/ffffffff rc 0/-1 Instead in case of recovery the timeout should be calculated similar to way new deadline is calculated for the replay requests in ptlrpc_at_send_early_reply(). static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req) { ... if (req->rq_export && lustre_msg_get_flags(req->rq_reqmsg) & (MSG_REPLAY | MSG_REQ_REPLAY_DONE | MSG_LOCK_REPLAY_DONE)) { /* During recovery, we don't want to send too many early * replies, but on the other hand we want to make sure the * client has enough time to resend if the rpc is lost. So * during the recovery period send at least 4 early replies, * spacing them every at_extra if we can. at_estimate should * always equal this fixed value during recovery. */ /* Don't account request processing time into AT history * during recovery, it is not service time we need but * includes also waiting time for recovering clients */ newdl = cfs_time_current_sec() + min(at_extra, req->rq_export->exp_obd->obd_recovery_timeout / 4); ... } |
| Comments |
| Comment by Gerrit Updater [ 24/May/16 ] |
|
Vladimir Saveliev (vladimir_saveliev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/20399 |
| Comment by Peter Jones [ 24/May/16 ] |
|
Mike Could you please review this patch? Thanks Peter |
| Comment by Gerrit Updater [ 27/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20399/ |
| Comment by Joseph Gmitter (Inactive) [ 29/Jun/16 ] |
|
Landed to master for 2.9.0 |
| Comment by Gerrit Updater [ 17/Nov/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/23833 |
| Comment by Gerrit Updater [ 17/Dec/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23833/ |