Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15115

ptlrpc resend on EINPROGRESS timeouts can be not correct

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It appears request resend on EINPROGRESS reply may happen not early than the request expires:

      00000100:00100000:2.0:1630664693.569340:0:2236:0:(client.c:1461:after_reply()) @@@ Resending request on EINPROGRESS  req@ffff9873c9fb6300 x1709875676995456/t0(0) o36->lustre-MDT0000-mdc-ffff9873b91a8800@0@lo:12/10 lens 496/456 e 0 to 0 dl 1630664736 ref 2 fl Rpc:R/0/0 rc 0/-115 job:'unlinkmany.0'
      
      ...
      
      00000100:00100000:2.0:1630664735.433532:0:2236:0:(niobuf.c:757:ptl_send_rpc()) @@@ Allocating new xid for resend on EINPROGRESS  req@ffff9873c9fb6300 x1709875677015680/t0(0) o36->lustre-MDT0000-mdc-ffff9873b91a8800@0@lo:12/10 lens 496/456 e 0 to 0 dl 1630664736 ref 2 fl Rpc:RS/0/0 rc 0/-115 job:'unlinkmany.0'
      

      as you can see above, the resend is in 40 sec after getting -EINPROGRESS reply.

      at the same time, after_reply() has a code snippet with another logic:

              /* retry indefinitely on EINPROGRESS */
              if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS &&
                  ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
                      time64_t now = ktime_get_real_seconds();
      
                      DEBUG_REQ(req->rq_nr_resend > 0 ? D_ERROR : D_RPCTRACE, req,
                                "Resending request on EINPROGRESS");
                      spin_lock(&req->rq_lock);
                      req->rq_resend = 1;
                      req->rq_deadline = now;
                      spin_unlock(&req->rq_lock);
                      req->rq_nr_resend++;
      
                      /* Readjust the timeout for current conditions */
                      ptlrpc_at_set_req_timeout(req);
                      /* delay resend to give a chance to the server to get ready.
                       * The delay is increased by 1s on every resend and is capped to
                       * the current request timeout (i.e. obd_timeout if AT is off,
                       * or AT service time x 125% + 5s, see at_est2timeout) */
                      if (req->rq_nr_resend > req->rq_timeout)
                              req->rq_sent = now + req->rq_timeout;
                      else
                              req->rq_sent = now + req->rq_nr_resend;
      

      Attachments

        Issue Links

          Activity

            [LU-15115] ptlrpc resend on EINPROGRESS timeouts can be not correct

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45926
            Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 53719e696f91f925bd119a2f097f02937c132b6d

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45926 Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 53719e696f91f925bd119a2f097f02937c132b6d
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45266/
            Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9a5bace55a5ddb8a928af2de1b199e968f3fbecd

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45266/ Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9a5bace55a5ddb8a928af2de1b199e968f3fbecd

            "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45266
            Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 737f6d2e7a32d95669c6039241db2873313fc387

            gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45266 Subject: LU-15115 ptlrpc: recalc timer on EINPROGRESS reply Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 737f6d2e7a32d95669c6039241db2873313fc387

            the issue was discovered while running LU-15114 test, sometimes the unlinks take abnormally long time:

            == sanity test 831: throttling unlink/setattr queuing on OSP ========================================================== 20:47:10 (1634320030)
            total: 1000 open/close in 3.51 seconds: 285.16 ops/second
             - unlinked 0 (time 1634320036 ; total 0 ; last 0)
            total: 1000 unlinks in 390 seconds: 2.564103 unlinks/second
            PASS 831 (398s)
            == sanity test complete, duration 490 sec ================ 20:53:48 (1634320428)
            Stopping clients: devvm1 /mnt/lustre (opts:-f)
            Stopping client devvm1 /mnt/lustre opts:-f
            Stopping clients: devvm1 /mnt/lustre2 (opts:-f)
            Stopping /mnt/lustre-mds1 (opts:-f) on devvm1
            Stopping /mnt/lustre-mds2 (opts:-f) on devvm1
            Stopping /mnt/lustre-ost1 (opts:-f) on devvm1
            Stopping /mnt/lustre-ost2 (opts:-f) on devvm1
            LNET ready to unload
            modules unloaded.
            [root@devvm1 tests]# 
            
            
            zam Alexander Zarochentsev added a comment - the issue was discovered while running LU-15114 test, sometimes the unlinks take abnormally long time: == sanity test 831: throttling unlink/setattr queuing on OSP ========================================================== 20:47:10 (1634320030) total: 1000 open/close in 3.51 seconds: 285.16 ops/second - unlinked 0 (time 1634320036 ; total 0 ; last 0) total: 1000 unlinks in 390 seconds: 2.564103 unlinks/second PASS 831 (398s) == sanity test complete, duration 490 sec ================ 20:53:48 (1634320428) Stopping clients: devvm1 /mnt/lustre (opts:-f) Stopping client devvm1 /mnt/lustre opts:-f Stopping clients: devvm1 /mnt/lustre2 (opts:-f) Stopping /mnt/lustre-mds1 (opts:-f) on devvm1 Stopping /mnt/lustre-mds2 (opts:-f) on devvm1 Stopping /mnt/lustre-ost1 (opts:-f) on devvm1 Stopping /mnt/lustre-ost2 (opts:-f) on devvm1 LNET ready to unload modules unloaded. [root@devvm1 tests]#

            People

              zam Alexander Zarochentsev
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: