Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17446

BL AST should stop resending if lock is cancelled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It was recently observed on at least two sites that AST resend logic will keep resending ASTs until timeout is hit, even if the cancel for that lock has long been received and therefore it makes no sense to insist on getting the AST reply.

       

      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895372/real 1704895372]
          req@000000004fc3550d x1785656878161152/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895483 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895399/real 1704895399]
          req@000000001ee9fdbf x1785656878165248/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895510 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) Skipped 1 previous similar message
      lnet_handle_recovery_reply()) peer NI (10.0.1.101@tcp) recovery failed with -110
      lnet_handle_recovery_reply()) Skipped 1837 previous similar messages
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
          req@00000000674a17cf x1785656878167232/t0(0) o103->fs00-MDT0005-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
          req@00000000eef404a4 x1785656878167168/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
          10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      

      Eventually once the timeout is hit, we even see the AST error message saying as much, that "Hey, the lock was already cancelled" so the client is not even evicted as the result :

      (client.c:1273:ptlrpc_import_delay_req()) @@@ send limit expired  req@00000000063d38a1
         x1788029434094272/t0(0) lustre-MDT0001@1.1.1.1@tcp:15/16 lens 328/224 e 0 to 1
         dl 1705530737 ref 1 fl Rpc:XQU/2/ffffffff rc 0/-1 job:''
      
      
      (ldlm_lockd.c:739:ldlm_handle_ast_error()) ### blocking AST (req@00000000063d38a1 x1788029434094272)
         timeout from nid 1.1.1.1@tcp, but cancel was received (AST reply lost?)
         ns: lustre-MDT0001_UUID lock: 000000008783e209/0x4937295654f3a72a lrc: 1/0,0 mode: --/PR
         res: [0x24001144f:0x8c35:0x0].0x0 bits 0x13/0x0 rrc: 6 type: IBT gid 0 flags: 0x44a01400000020
         nid: 1.1.1.2@tcp remote: 0x6196afd4e206854a expref: 7 pid: 376559 timeout: 337171 lvb_type: 0
      

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: