Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-21

Allow parallel modifying metadata operations on MDS

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 1.8.6
    • None

    Description

      Description of the problem:
      In 1.8 codebase all modifying parallel operations on a single dir are bottlenecked on a single ldlm lock. To make matters worse, the dlm lock is not only held while the operation is performed, but also until a confirmation is received from the client executing each op, one by one (process known as rep-ack).
      Another downside of that is MDT threads are actually blocked waiting for a lock, so if sufficiently many clients try to modify a directory ocntent in parallel, they occupy all MDT threads and so even unrelated processes now cannot perform any metadata operations until the backlog of requests is cleared. This often leads to delays of many seconds to several minutes (at ORNL during large scale file creating jobs) of much fs activity when certain jobs are run.

      Proposed solution:

      Introduce an extra 'hash' property for inodebit locks. The users will pass it along with bits they want to lock. The locks with intersecting bits and incompatible lock modes would only be declared incompatible and block each other if the hash values match, or any of the locks have the special hash value of zero.
      The MDT threads for modifying operations would initialize the value to the hash of the name of the file the operation modifies.
      Clients and MDS with other operations would keep hash value as zero in requested locks.
      Since the parent lock obtained during modifying metadata operations on MDS is not passed to the client, the is no protocol change.
      The ldlm policy descriptor in the wire structure also has enough space to accommodate 64 bits of hash value.
      As a result the MDS threads would stop blocking on shared directory operations with the exception of a case where multiple threads attempt to create the same file. This usecase will need to be addressed separately.

      Downsides:
      The lock resource on MDS corresponding to the parent directory lock in which shared operations are performed will accumulate a lot of locks (one for each modifying name in a dir) and their traversal might become cpu-consuming as number of modification grows.
      This problem can be significantly compounded should the Commit on Share (COS) logic be implemented in 1.8 codebase later on as that lets locks in the resource accumulate for several seconds until underlying filesystem commits the transactions instead just for the duration of an RPC round-trip for the rep-ack process.

      Still in the absence of COS logic the end-result should be better compared to current behavior as a forward progress could be made on multiple CPUs.
      UNMEASURED: large SMP systems need to be tested separately to see how data pingpong between CPUs iterating locks on the resource affects overall performance in the face of a lot of clients doing shared-dire creates/unlinks.

      Alternatives:
      An approach implemented in 2.0+ code could be adopted where instead of a value in inodebits policy the hash value for the name being modified would be used as part of the resource name. This would totally erase the downsides listed with the first method.
      This approach would is like this: for every modifying operation two locks would need to be taken. One exclusive lock with the modified resource to protect against concurrent operations on the same name and the other lock just on the parent resource id in a "semi-shared" way so that multiple MDS threads could obtain compatible locks, but the lock used by clients to guard READDIR pages would be conflicting.
      Since all modifying metadata operations would need to take two locks on server now, the obvious downside of this approach is the cpu utilization on MS would grow and metadata operations not done in a shared directory would still be somewhat penalized with the extra lock.
      UNMEASURED: but possible that on large SMP systems the overhead from cpu data and spinlock-pingpong in case of shared-dir modifications might be quite substantial.
      Protocol compatibility for this alternative approach is still not a concern since clients currently obtain PR lock on the parent directory, and so if the server threads to use CW lock on the parent dir, they would not be conflicting between themselves, but would conflict with PR locks held by clients.

      Out of scope:
      Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

      Useful information:
      ldlm modifications outlined for the 1st approach are implemented in bug 21660 (patch id 30074) for 2.0 codebase that should be pretty similar to 1.8 in ldlm area.

      Attachments

        Activity

          [LU-21] Allow parallel modifying metadata operations on MDS
          green Oleg Drokin added a comment -

          This is not going to be fixed in 1.8 after all.

          green Oleg Drokin added a comment - This is not going to be fixed in 1.8 after all.
          green Oleg Drokin added a comment -

          The back-end fs modifications are out of scope just because this is a totally different kind of work that could be worked on and tracked separately if desired. The idea for now is to boost 1.8 performance to reach certain goals and just allowing us to scale in shared create/unlink case is going to significantly boost this usecase. Perhaps significantly enough that no further improvements in 1.8 codebase would be necessary.

          I agree big smp systems might be a concern. How big of a concern is to be determined by testing.

          As for porting mdd layer, - let's see the numbers from just allowing MDS threads to scale into the ldiskfs and see if we need any more performance or not. If we do need, then we can start looking into what additional measures we can implement. It's not like we have infinite resources to push forward multiple releases for infinite time, unfortunately, so let's carefully pick our priorities.

          green Oleg Drokin added a comment - The back-end fs modifications are out of scope just because this is a totally different kind of work that could be worked on and tracked separately if desired. The idea for now is to boost 1.8 performance to reach certain goals and just allowing us to scale in shared create/unlink case is going to significantly boost this usecase. Perhaps significantly enough that no further improvements in 1.8 codebase would be necessary. I agree big smp systems might be a concern. How big of a concern is to be determined by testing. As for porting mdd layer, - let's see the numbers from just allowing MDS threads to scale into the ldiskfs and see if we need any more performance or not. If we do need, then we can start looking into what additional measures we can implement. It's not like we have infinite resources to push forward multiple releases for infinite time, unfortunately, so let's carefully pick our priorities.

          Oleg, nice to see you,
          I'm a little concerning on this:

          Out of scope:
          Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

          You probably remember I did some metadata performance test over ldiskfs directly several months ago, results show pdirops patch can help on file creation, it's a nice increasing but not a dramatically boost because I think there are still quite a lot serialized operations in diskfs operations (i.e: contention on buffer_head, contention on dynlock itself which is used by pdirops patch), also, file removal performance will drop significantly (I think parallel removal will increase chance of disk seek). I'm worrying things can be worse on super large SMP system. How do you think?

          As you know, I'm working on MDD dir-cache + schedulers to localize FS operations on a few cores and reduce data pingpong(already got a working prototype), however, we don't have such a layer on 1.8, do you think it would be possible to build a very simple layer like that on 1.8 instead of using pdirops patch?

          Liang

          liang Liang Zhen (Inactive) added a comment - Oleg, nice to see you, I'm a little concerning on this: Out of scope: Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch). You probably remember I did some metadata performance test over ldiskfs directly several months ago, results show pdirops patch can help on file creation, it's a nice increasing but not a dramatically boost because I think there are still quite a lot serialized operations in diskfs operations (i.e: contention on buffer_head, contention on dynlock itself which is used by pdirops patch), also, file removal performance will drop significantly (I think parallel removal will increase chance of disk seek). I'm worrying things can be worse on super large SMP system. How do you think? As you know, I'm working on MDD dir-cache + schedulers to localize FS operations on a few cores and reduce data pingpong(already got a working prototype), however, we don't have such a layer on 1.8, do you think it would be possible to build a very simple layer like that on 1.8 instead of using pdirops patch? Liang
          green Oleg Drokin added a comment -

          Additional concern with such a project for 1.8 that I envision is Oracle would be very unlikely to pick an improvement like that into their 1.8 release.

          green Oleg Drokin added a comment - Additional concern with such a project for 1.8 that I envision is Oracle would be very unlikely to pick an improvement like that into their 1.8 release.

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: