Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-21

Allow parallel modifying metadata operations on MDS

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 1.8.6
    • None

    Description

      Description of the problem:
      In 1.8 codebase all modifying parallel operations on a single dir are bottlenecked on a single ldlm lock. To make matters worse, the dlm lock is not only held while the operation is performed, but also until a confirmation is received from the client executing each op, one by one (process known as rep-ack).
      Another downside of that is MDT threads are actually blocked waiting for a lock, so if sufficiently many clients try to modify a directory ocntent in parallel, they occupy all MDT threads and so even unrelated processes now cannot perform any metadata operations until the backlog of requests is cleared. This often leads to delays of many seconds to several minutes (at ORNL during large scale file creating jobs) of much fs activity when certain jobs are run.

      Proposed solution:

      Introduce an extra 'hash' property for inodebit locks. The users will pass it along with bits they want to lock. The locks with intersecting bits and incompatible lock modes would only be declared incompatible and block each other if the hash values match, or any of the locks have the special hash value of zero.
      The MDT threads for modifying operations would initialize the value to the hash of the name of the file the operation modifies.
      Clients and MDS with other operations would keep hash value as zero in requested locks.
      Since the parent lock obtained during modifying metadata operations on MDS is not passed to the client, the is no protocol change.
      The ldlm policy descriptor in the wire structure also has enough space to accommodate 64 bits of hash value.
      As a result the MDS threads would stop blocking on shared directory operations with the exception of a case where multiple threads attempt to create the same file. This usecase will need to be addressed separately.

      Downsides:
      The lock resource on MDS corresponding to the parent directory lock in which shared operations are performed will accumulate a lot of locks (one for each modifying name in a dir) and their traversal might become cpu-consuming as number of modification grows.
      This problem can be significantly compounded should the Commit on Share (COS) logic be implemented in 1.8 codebase later on as that lets locks in the resource accumulate for several seconds until underlying filesystem commits the transactions instead just for the duration of an RPC round-trip for the rep-ack process.

      Still in the absence of COS logic the end-result should be better compared to current behavior as a forward progress could be made on multiple CPUs.
      UNMEASURED: large SMP systems need to be tested separately to see how data pingpong between CPUs iterating locks on the resource affects overall performance in the face of a lot of clients doing shared-dire creates/unlinks.

      Alternatives:
      An approach implemented in 2.0+ code could be adopted where instead of a value in inodebits policy the hash value for the name being modified would be used as part of the resource name. This would totally erase the downsides listed with the first method.
      This approach would is like this: for every modifying operation two locks would need to be taken. One exclusive lock with the modified resource to protect against concurrent operations on the same name and the other lock just on the parent resource id in a "semi-shared" way so that multiple MDS threads could obtain compatible locks, but the lock used by clients to guard READDIR pages would be conflicting.
      Since all modifying metadata operations would need to take two locks on server now, the obvious downside of this approach is the cpu utilization on MS would grow and metadata operations not done in a shared directory would still be somewhat penalized with the extra lock.
      UNMEASURED: but possible that on large SMP systems the overhead from cpu data and spinlock-pingpong in case of shared-dir modifications might be quite substantial.
      Protocol compatibility for this alternative approach is still not a concern since clients currently obtain PR lock on the parent directory, and so if the server threads to use CW lock on the parent dir, they would not be conflicting between themselves, but would conflict with PR locks held by clients.

      Out of scope:
      Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

      Useful information:
      ldlm modifications outlined for the 1st approach are implemented in bug 21660 (patch id 30074) for 2.0 codebase that should be pretty similar to 1.8 in ldlm area.

      Attachments

        Activity

          [LU-21] Allow parallel modifying metadata operations on MDS
          green Oleg Drokin made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          green Oleg Drokin added a comment -

          This is not going to be fixed in 1.8 after all.

          green Oleg Drokin added a comment - This is not going to be fixed in 1.8 after all.
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.1.0 [ 10021 ]
          Fix Version/s Original: Lustre 2.0.0 [ 10011 ]
          pjones Peter Jones made changes -
          Affects Version/s New: Lustre 1.8.6 [ 10022 ]
          Affects Version/s Original: Lustre 1.8.x [ 10010 ]
          ian Ian Colle (Inactive) made changes -
          Comment [ Oleg,

          This looks like you've already got a great start on where we need to head
          for the MDS performance improvement.

          Under this new contract we need to add a few things to present to the
          customer. Please take a look at the following wiki link that describes
          what we're looking for in the Solution Architecture document and let me
          know what you think.

          http://wiki.whamcloud.com/display/ENG/Solution+Architecture

          I have a meeting with Galen on Thursday morning, so anything you could get
          done before then would be greatly appreciated. I'd be happy to format it
          in word or do the wordsmithing, but I'm dependent upon you for the
          technical content - especially for the Use Cases, Test Plans, and
          Acceptance Criteria.

          I'm still trying to work out a skype time that will be convenient for Sanf
          and us. How late do you work in the evenings?

          Please don't hesitate to ping me on skype if you have any questions or
          comments.

          Thanks,
            
          Ian R. Colle, PMP
          Project Manager
          Whamcloud
          Cell: +1 303.601.7713
          Email: ian@whamcloud.com








          ]
          sarah Sarah Liu made changes -
          Assignee Original: Sarah Liu [ sarah ] New: Oleg Drokin [ green ]
          sarah Sarah Liu made changes -
          Assignee Original: Oleg Drokin [ green ] New: Sarah Liu [ sarah ]
          green Oleg Drokin made changes -
          Description Original: Description of the problem:
          In 1.8 codebase all modifying parallel operations on a single dir are botlenecked on a single ldlm lock. To make matters worse, the dlm lock is not only held while the operation is performed, but also until a confirmation is received from the client executing each op, one by one (process known as rep-ack).
          Another downside of that is MDT threads are actually blocked waiting for a lock, so if sufficiently many clients try to modify a directory ocntent in parallel, they occupy all MDT threads and so even unrelated processes now cannot perform any metadata operations until the backlock of requests is cleared. This often leads to delays of many secods to several minutes (at ORNL during large scale file creating jobs) of much fs activity when certain jobs are run.

          Proposed solution:

          Introduce an extra 'hash' property for inodebit locks. The users will pass it along with bits they want to lock. The locks with intersecting bits and incompatible lock modes would only be decrared incompatible and block each other if the hash values match, or any of the locks have the special hash value of zero.
          The MDT threads for modifying operations would initialize the value to the hash of the name of the file the operation modifies.
          Clients and MDS with other operations would keep hash value as zero in requested locks.
          Since the parent lock obtained during modifying metadata operations on MDS is not passed to the client, the is no protocol change.
          The ldlm policy descriptor in the wire structure also has enough space to accomodate 64 bits of hash value.
          As a result the MDS threads would stop blocking on shared directory operations with the exception of a case where multiple threads attempt to create the same file. This usecase will need to be addressed separately.

          Downsides:
          The lock resource on MDS corresponding to the parent directory lock in which shared operations are performed will accumulate a lot of locks (one for each modifying name in a dir) and their traversal might become cpu-consuming as number of modification grows.
          This problem can be significantly compounded should the Commit on Share (COS) logic be implemented in 1.8 codebase later on as that lets locks in the resource accumulate for several seconds until underlying filesystem commits the transactions instead just for the duration of an RPC round-trip for the rep-ack process.

          Still in the absense of COS logic the end-result should be better compared to current behavior as a forward progress could be made on multiple CPUs.
            UNMEASURED: large SMP systems need to be tested separately to see how data pingpong between CPUs iterating locks on the resource affects overall performance in the face of a lot of clients doing shared-dire creates/unlinks.

          Alternatives:
          An approach implemented in 2.0+ code could be adopted where intead of a value in inodebits policy the hash valie for the name being modified would be used as part of the resource name. This would totally erase the downsides listed with the first method.
          This approach would is like this: for every modifying operation two locks would need to be taken. One exclusive lock with the modified resource to protect against concurrent operations on the same name and the other lock just on the parent resource id in a "semi-shared" way so that multiple MDS threads could obtain compatible locks, but the lock used by clients to guard READDIR pages would be conflicting.
          Since all modifying metadata operations would need to take two locks on server now, the obvious downside of this approach is the cpu utilization on MS would grow and metadata operations not done in a shared directory would still be somewhat penalized with the extra lock.
            UNMEASURED: but possible that on large SMP systems the overhead from cpu data and spinlock-pingpong in case of shared-dir modifications might be quite substantial.
          Protocol compatibility for this alternative approach is still not a concern since clients currently obtain PR lock on the parent directory, and so if the server threads to use CW lock on the parent dir, they would not be conflicting between themselves, but would conflict with PR locks held by clients.

          Out of scope:
          Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

          Useful information:
          ldlm modifications outlined for the 1st approach are implemented in bug 21660 (patch id 30074) for 2.0 codebase that should be pretty similar to 1.8 in ldlm area.
          New: Description of the problem:
          In 1.8 codebase all modifying parallel operations on a single dir are bottlenecked on a single ldlm lock. To make matters worse, the dlm lock is not only held while the operation is performed, but also until a confirmation is received from the client executing each op, one by one (process known as rep-ack).
          Another downside of that is MDT threads are actually blocked waiting for a lock, so if sufficiently many clients try to modify a directory ocntent in parallel, they occupy all MDT threads and so even unrelated processes now cannot perform any metadata operations until the backlog of requests is cleared. This often leads to delays of many seconds to several minutes (at ORNL during large scale file creating jobs) of much fs activity when certain jobs are run.

          Proposed solution:

          Introduce an extra 'hash' property for inodebit locks. The users will pass it along with bits they want to lock. The locks with intersecting bits and incompatible lock modes would only be declared incompatible and block each other if the hash values match, or any of the locks have the special hash value of zero.
          The MDT threads for modifying operations would initialize the value to the hash of the name of the file the operation modifies.
          Clients and MDS with other operations would keep hash value as zero in requested locks.
          Since the parent lock obtained during modifying metadata operations on MDS is not passed to the client, the is no protocol change.
          The ldlm policy descriptor in the wire structure also has enough space to accommodate 64 bits of hash value.
          As a result the MDS threads would stop blocking on shared directory operations with the exception of a case where multiple threads attempt to create the same file. This usecase will need to be addressed separately.

          Downsides:
          The lock resource on MDS corresponding to the parent directory lock in which shared operations are performed will accumulate a lot of locks (one for each modifying name in a dir) and their traversal might become cpu-consuming as number of modification grows.
          This problem can be significantly compounded should the Commit on Share (COS) logic be implemented in 1.8 codebase later on as that lets locks in the resource accumulate for several seconds until underlying filesystem commits the transactions instead just for the duration of an RPC round-trip for the rep-ack process.

          Still in the absence of COS logic the end-result should be better compared to current behavior as a forward progress could be made on multiple CPUs.
            UNMEASURED: large SMP systems need to be tested separately to see how data pingpong between CPUs iterating locks on the resource affects overall performance in the face of a lot of clients doing shared-dire creates/unlinks.

          Alternatives:
          An approach implemented in 2.0+ code could be adopted where instead of a value in inodebits policy the hash value for the name being modified would be used as part of the resource name. This would totally erase the downsides listed with the first method.
          This approach would is like this: for every modifying operation two locks would need to be taken. One exclusive lock with the modified resource to protect against concurrent operations on the same name and the other lock just on the parent resource id in a "semi-shared" way so that multiple MDS threads could obtain compatible locks, but the lock used by clients to guard READDIR pages would be conflicting.
          Since all modifying metadata operations would need to take two locks on server now, the obvious downside of this approach is the cpu utilization on MS would grow and metadata operations not done in a shared directory would still be somewhat penalized with the extra lock.
            UNMEASURED: but possible that on large SMP systems the overhead from cpu data and spinlock-pingpong in case of shared-dir modifications might be quite substantial.
          Protocol compatibility for this alternative approach is still not a concern since clients currently obtain PR lock on the parent directory, and so if the server threads to use CW lock on the parent dir, they would not be conflicting between themselves, but would conflict with PR locks held by clients.

          Out of scope:
          Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

          Useful information:
          ldlm modifications outlined for the 1st approach are implemented in bug 21660 (patch id 30074) for 2.0 codebase that should be pretty similar to 1.8 in ldlm area.
          green Oleg Drokin added a comment -

          The back-end fs modifications are out of scope just because this is a totally different kind of work that could be worked on and tracked separately if desired. The idea for now is to boost 1.8 performance to reach certain goals and just allowing us to scale in shared create/unlink case is going to significantly boost this usecase. Perhaps significantly enough that no further improvements in 1.8 codebase would be necessary.

          I agree big smp systems might be a concern. How big of a concern is to be determined by testing.

          As for porting mdd layer, - let's see the numbers from just allowing MDS threads to scale into the ldiskfs and see if we need any more performance or not. If we do need, then we can start looking into what additional measures we can implement. It's not like we have infinite resources to push forward multiple releases for infinite time, unfortunately, so let's carefully pick our priorities.

          green Oleg Drokin added a comment - The back-end fs modifications are out of scope just because this is a totally different kind of work that could be worked on and tracked separately if desired. The idea for now is to boost 1.8 performance to reach certain goals and just allowing us to scale in shared create/unlink case is going to significantly boost this usecase. Perhaps significantly enough that no further improvements in 1.8 codebase would be necessary. I agree big smp systems might be a concern. How big of a concern is to be determined by testing. As for porting mdd layer, - let's see the numbers from just allowing MDS threads to scale into the ldiskfs and see if we need any more performance or not. If we do need, then we can start looking into what additional measures we can implement. It's not like we have infinite resources to push forward multiple releases for infinite time, unfortunately, so let's carefully pick our priorities.

          Oleg, nice to see you,
          I'm a little concerning on this:

          Out of scope:
          Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).

          You probably remember I did some metadata performance test over ldiskfs directly several months ago, results show pdirops patch can help on file creation, it's a nice increasing but not a dramatically boost because I think there are still quite a lot serialized operations in diskfs operations (i.e: contention on buffer_head, contention on dynlock itself which is used by pdirops patch), also, file removal performance will drop significantly (I think parallel removal will increase chance of disk seek). I'm worrying things can be worse on super large SMP system. How do you think?

          As you know, I'm working on MDD dir-cache + schedulers to localize FS operations on a few cores and reduce data pingpong(already got a working prototype), however, we don't have such a layer on 1.8, do you think it would be possible to build a very simple layer like that on 1.8 instead of using pdirops patch?

          Liang

          liang Liang Zhen (Inactive) added a comment - Oleg, nice to see you, I'm a little concerning on this: Out of scope: Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch). You probably remember I did some metadata performance test over ldiskfs directly several months ago, results show pdirops patch can help on file creation, it's a nice increasing but not a dramatically boost because I think there are still quite a lot serialized operations in diskfs operations (i.e: contention on buffer_head, contention on dynlock itself which is used by pdirops patch), also, file removal performance will drop significantly (I think parallel removal will increase chance of disk seek). I'm worrying things can be worse on super large SMP system. How do you think? As you know, I'm working on MDD dir-cache + schedulers to localize FS operations on a few cores and reduce data pingpong(already got a working prototype), however, we don't have such a layer on 1.8, do you think it would be possible to build a very simple layer like that on 1.8 instead of using pdirops patch? Liang

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: