[LU-15115] ptlrpc resend on EINPROGRESS timeouts can be not correct Created: 15/Oct/21 Updated: 04/Jan/23 Resolved: 27/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
It appears request resend on EINPROGRESS reply may happen not early than the request expires: 00000100:00100000:2.0:1630664693.569340:0:2236:0:(client.c:1461:after_reply()) @@@ Resending request on EINPROGRESS req@ffff9873c9fb6300 x1709875676995456/t0(0) o36->lustre-MDT0000-mdc-ffff9873b91a8800@0@lo:12/10 lens 496/456 e 0 to 0 dl 1630664736 ref 2 fl Rpc:R/0/0 rc 0/-115 job:'unlinkmany.0' ... 00000100:00100000:2.0:1630664735.433532:0:2236:0:(niobuf.c:757:ptl_send_rpc()) @@@ Allocating new xid for resend on EINPROGRESS req@ffff9873c9fb6300 x1709875677015680/t0(0) o36->lustre-MDT0000-mdc-ffff9873b91a8800@0@lo:12/10 lens 496/456 e 0 to 0 dl 1630664736 ref 2 fl Rpc:RS/0/0 rc 0/-115 job:'unlinkmany.0' as you can see above, the resend is in 40 sec after getting -EINPROGRESS reply. at the same time, after_reply() has a code snippet with another logic:
/* retry indefinitely on EINPROGRESS */
if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS &&
ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
time64_t now = ktime_get_real_seconds();
DEBUG_REQ(req->rq_nr_resend > 0 ? D_ERROR : D_RPCTRACE, req,
"Resending request on EINPROGRESS");
spin_lock(&req->rq_lock);
req->rq_resend = 1;
req->rq_deadline = now;
spin_unlock(&req->rq_lock);
req->rq_nr_resend++;
/* Readjust the timeout for current conditions */
ptlrpc_at_set_req_timeout(req);
/* delay resend to give a chance to the server to get ready.
* The delay is increased by 1s on every resend and is capped to
* the current request timeout (i.e. obd_timeout if AT is off,
* or AT service time x 125% + 5s, see at_est2timeout) */
if (req->rq_nr_resend > req->rq_timeout)
req->rq_sent = now + req->rq_timeout;
else
req->rq_sent = now + req->rq_nr_resend;
|
| Comments |
| Comment by Alexander Zarochentsev [ 15/Oct/21 ] |
|
the issue was discovered while running == sanity test 831: throttling unlink/setattr queuing on OSP ========================================================== 20:47:10 (1634320030) total: 1000 open/close in 3.51 seconds: 285.16 ops/second - unlinked 0 (time 1634320036 ; total 0 ; last 0) total: 1000 unlinks in 390 seconds: 2.564103 unlinks/second PASS 831 (398s) == sanity test complete, duration 490 sec ================ 20:53:48 (1634320428) Stopping clients: devvm1 /mnt/lustre (opts:-f) Stopping client devvm1 /mnt/lustre opts:-f Stopping clients: devvm1 /mnt/lustre2 (opts:-f) Stopping /mnt/lustre-mds1 (opts:-f) on devvm1 Stopping /mnt/lustre-mds2 (opts:-f) on devvm1 Stopping /mnt/lustre-ost1 (opts:-f) on devvm1 Stopping /mnt/lustre-ost2 (opts:-f) on devvm1 LNET ready to unload modules unloaded. [root@devvm1 tests]# |
| Comment by Gerrit Updater [ 15/Oct/21 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45266 |
| Comment by Gerrit Updater [ 27/Oct/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45266/ |
| Comment by Peter Jones [ 27/Oct/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45926 |