Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18229

BLAST and CANCELLING lock still can be batched with others in one cancel RPC

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      MDS gets OOM, at this time it has 36M locks granted and SLV == 1 and is getting tons of cancel RPCs with 1 lock handle in each.

      one possibility of sending lock 1-by-1 is.

      LU-16285 ldlm: send the cancel RPC asap
      

      even after fixing it with:

      LU-16285 ldlm: BL_AST lock cancel still can be batched
      

      it is still possible for CANCELLING locks - i.e. those which are taken by another thread for cancelling but an RPC is not formed/sent yet, in which case a separate cancel (with just 1 lock handle) is sent.

      how could it happen that we have so many BLAST for locks which are already in a process of being cancelled? the client activity is still not clear i full but theoretically it is possible that in a low mem condition on server it starts massively reclaiming the locks, i.e. sending out many BLAST RPCs. in addition with a small SLV, it may result in a cancel RPC with 1K locks being prepared (it may take some time due to data flush) and 1K BLAST RPCs for the same set of locks which results in 1K separate cancel RPCs.

      let's try to get it well optimised so that even CANCELLING lock still could be batched.

      Attachments

        Issue Links

          Activity

            People

              vitaly_fertman Vitaly Fertman
              vitaly_fertman Vitaly Fertman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: