Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17446

BL AST should stop resending if lock is cancelled

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It was recently observed on at least two sites that AST resend logic will keep resending ASTs until timeout is hit, even if the cancel for that lock has long been received and therefore it makes no sense to insist on getting the AST reply.

       

      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895372/real 1704895372]
          req@000000004fc3550d x1785656878161152/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895483 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895399/real 1704895399]
          req@000000001ee9fdbf x1785656878165248/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895510 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) Skipped 1 previous similar message
      lnet_handle_recovery_reply()) peer NI (10.0.1.101@tcp) recovery failed with -110
      lnet_handle_recovery_reply()) Skipped 1837 previous similar messages
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
          req@00000000674a17cf x1785656878167232/t0(0) o103->fs00-MDT0005-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
          req@00000000eef404a4 x1785656878167168/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      

      Eventually once the timeout is hit, we even see the AST error message saying as much, that "Hey, the lock was already cancelled" so the client is not even evicted as the result :

      (client.c:1273:ptlrpc_import_delay_req()) @@@ send limit expired  req@00000000063d38a1
         x1788029434094272/t0(0) lustre-MDT0001@1.1.1.1@tcp:15/16 lens 328/224 e 0 to 1
         dl 1705530737 ref 1 fl Rpc:XQU/2/ffffffff rc 0/-1 job:''
      
      
      (ldlm_lockd.c:739:ldlm_handle_ast_error()) ### blocking AST (req@00000000063d38a1 x1788029434094272)
         timeout from nid 1.1.1.1@tcp, but cancel was received (AST reply lost?)
         ns: lustre-MDT0001_UUID lock: 000000008783e209/0x4937295654f3a72a lrc: 1/0,0 mode: --/PR
         res: [0x24001144f:0x8c35:0x0].0x0 bits 0x13/0x0 rrc: 6 type: IBT gid 0 flags: 0x44a01400000020
         nid: 1.1.1.2@tcp remote: 0x6196afd4e206854a expref: 7 pid: 376559 timeout: 337171 lvb_type: 0
      

      Attachments

        Issue Links

          Activity

            [LU-17446] BL AST should stop resending if lock is cancelled

            vitaly_fertman suggested a better approach on patch 53739 to resolve this issue:

            once BL AST RPCs are sent out, we are waiting for the RPC set to be completed in full, whereas the key part for us to get the new lock granted. I think the right way to go here would be to delegate the RPC set to a separate thread (probably just let ptlrpcd to handle it asynchronously) and wait for the wanted lock itself. it would unblock the MDT thread even if some blocking AST are still in progress.

            I am not sure we want to introduce a new complexity by linking BL ASTs to locks, we may let BL ASTs to be hanging around until the next resend, as it would not block new enqueues anymore (IIUC, it is a limited amount of BL ASTs having this problem, so not much mem consumption and seems to be not a big problem), so only ldlm_update_resend() of the current patch would be needed.

            adilger Andreas Dilger added a comment - vitaly_fertman suggested a better approach on patch 53739 to resolve this issue: once BL AST RPCs are sent out, we are waiting for the RPC set to be completed in full, whereas the key part for us to get the new lock granted. I think the right way to go here would be to delegate the RPC set to a separate thread (probably just let ptlrpcd to handle it asynchronously) and wait for the wanted lock itself. it would unblock the MDT thread even if some blocking AST are still in progress. I am not sure we want to introduce a new complexity by linking BL ASTs to locks, we may let BL ASTs to be hanging around until the next resend, as it would not block new enqueues anymore (IIUC, it is a limited amount of BL ASTs having this problem, so not much mem consumption and seems to be not a big problem), so only ldlm_update_resend() of the current patch would be needed.

            "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53739
            Subject: LU-17446 ldlm: Do not wait for AST RPC completion on lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 578d772b3481fab2fb5307cb34be8ffffb9f292e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53739 Subject: LU-17446 ldlm: Do not wait for AST RPC completion on lock cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 578d772b3481fab2fb5307cb34be8ffffb9f292e

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: