Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 9223372036854775807

    Description

      When LSOM is not used at cluster it is better to disable it, I don't see such option for now.
      During analyze of MDS vmocre with high load avarage, we have found LSOM feature add big impact to it.
      205 threads were blocked with mdt_lsom_update(). A few threads got further and were waiting for the osd lock for read. It seems that mdt_lsom_update() has a serious issue with a single shared file because of its mdt-level mutex for every close request.

      Attachments

        Activity

          [LU-15252] option to disable LSOM updates

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45709/
          Subject: LU-15252 mdt: reduce contention at mdt_lsom_update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c8b7afe4970415f8dae84f5e20661f8a3b3681a0

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45709/ Subject: LU-15252 mdt: reduce contention at mdt_lsom_update Project: fs/lustre-release Branch: master Current Patch Set: Commit: c8b7afe4970415f8dae84f5e20661f8a3b3681a0

          Still a second patch in flight that fixes the performance issue instead of working around it.

          adilger Andreas Dilger added a comment - Still a second patch in flight that fixes the performance issue instead of working around it.
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45619/
          Subject: LU-15252 mdc: add client tunable to disable LSOM update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 19172ed37851fdd5731b1319c12151f5cb1fe267

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45619/ Subject: LU-15252 mdc: add client tunable to disable LSOM update Project: fs/lustre-release Branch: master Current Patch Set: Commit: 19172ed37851fdd5731b1319c12151f5cb1fe267

          "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45709
          Subject: LU-15252 mdt: reduce contention at mdt_lsom_update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 495a7eb370cf9dbb5eec67bbd0a59ae206cdb68a

          gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45709 Subject: LU-15252 mdt: reduce contention at mdt_lsom_update Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 495a7eb370cf9dbb5eec67bbd0a59ae206cdb68a

          aboyko, I think the LSOM update can be fairly lazy, and there isn't a serious danger if some updates are lost, but there should still be occasional writes of new LSOM data to disk:

          • if the inode is being written already for some other reason (e.g. atime update, link count, etc.)
          • after some long time since the last LSOM update (e.g. 60s, like atime_diff). maybe LSOM and atime writes will happen at the same time?
          • when the last client closes the file, including when the client is evicted

          My main concern would be that files which are never properly closed will also never get LSOM updates. That could happen with config files or shared libraries for jobs that run a very long time, and/or files that are being accessed by different jobs and always have some client process holding them open.

          adilger Andreas Dilger added a comment - aboyko , I think the LSOM update can be fairly lazy, and there isn't a serious danger if some updates are lost, but there should still be occasional writes of new LSOM data to disk: if the inode is being written already for some other reason (e.g. atime update, link count, etc.) after some long time since the last LSOM update (e.g. 60s, like atime_diff ). maybe LSOM and atime writes will happen at the same time? when the last client closes the file, including when the client is evicted My main concern would be that files which are never properly closed will also never get LSOM updates. That could happen with config files or shared libraries for jobs that run a very long time, and/or files that are being accessed by different jobs and always have some client process holding them open.

          >Similarly, it should be possible to cache flag on the mdt_object if there is no LOV EA or DoM is ised, since this changes very rarely, so reading the LOV EA just for these two bits of information is expensive. That would allow checking whether an LSOM update is needed without the object mutex.
          Maybe it is better to store LSOM at mdt object and update xattr only for a last close? With this case only failover would affect lazy size to be wrong at xattr, but I think it is normal for LSOM.
          Andreas, any objection?

          aboyko Alexander Boyko added a comment - >Similarly, it should be possible to cache flag on the mdt_object if there is no LOV EA or DoM is ised, since this changes very rarely, so reading the LOV EA just for these two bits of information is expensive. That would allow checking whether an LSOM update is needed without the object mutex. Maybe it is better to store LSOM at mdt object and update xattr only for a last close? With this case only failover would affect lazy size to be wrong at xattr, but I think it is normal for LSOM. Andreas, any objection?

          Andreas, I've made a quick fix for Lustre clients only. But,I agree with you, LSOM requires fixes on the server side to improve single shared file perfomance. By default LSOM is enabled still, so it has no any impact by default.

          aboyko Alexander Boyko added a comment - Andreas, I've made a quick fix for Lustre clients only. But,I agree with you, LSOM requires fixes on the server side to improve single shared file perfomance. By default LSOM is enabled still, so it has no any impact by default.

           I agree with Andreas. LSOM is too critical to disable for us and we don't want our MDS servers bogged down at the same time.

          simmonsja James A Simmons added a comment -  I agree with Andreas. LSOM is too critical to disable for us and we don't want our MDS servers bogged down at the same time.
          adilger Andreas Dilger added a comment - - edited

          It would be better to fix the reason why LSOM updates are slow. From the comments, this is due to lock contention on the MDS, but there should be a way to avoid it, since this is "lazy" since and does not have to be totally accurate all the time. Fro example, checking if the incoming LSOM size/blocks is already smaller than current size/blocks without the lock, since LSOM should only be increasing.

          Also, it may be possible to batch LSOM updates in memory for a few seconds as long as the open counter > 0, since we know/expect some later close will write it to disk, and the update does not need to be part of a transaction if there is no other reason for it.

          Similarly, it should be possible to cache flag on the mdt_object if there is no LOV EA or DoM is ised, since this changes very rarely, so reading the LOV EA just for these two bits of information is expensive. That would allow checking whether an LSOM update is needed without the object mutex.

          adilger Andreas Dilger added a comment - - edited It would be better to fix the reason why LSOM updates are slow. From the comments, this is due to lock contention on the MDS, but there should be a way to avoid it, since this is "lazy" since and does not have to be totally accurate all the time. Fro example, checking if the incoming LSOM size/blocks is already smaller than current size/blocks without the lock, since LSOM should only be increasing. Also, it may be possible to batch LSOM updates in memory for a few seconds as long as the open counter > 0, since we know/expect some later close will write it to disk, and the update does not need to be part of a transaction if there is no other reason for it. Similarly, it should be possible to cache flag on the mdt_object if there is no LOV EA or DoM is ised, since this changes very rarely, so reading the LOV EA just for these two bits of information is expensive. That would allow checking whether an LSOM update is needed without the object mutex.

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: