Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9538

Size on MDT with guarantee of eventual consistency

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      I belive that size on MDT has been discussed for a long time, and there were even some implementations of it before. I am creating this ticket to discuss it again, because keeping file size on MDTs seems very important for the new policy engine of Lustre (LiPE) that I am currently work on.

      LiPE scans MDTs directly and extracts almost all the file attributes. Values of a series of mathematical expressions are calculated by LiPE according to these attribute values. And the expression values determine which rules the corresponding file matches with. This works perfectly for almost all metadata of files, except the file sizes, because MDT doesn't keep file sizes. That is the reason why we want to add file size on MDT.

      Given the fact that file size on MDT has been discussed for a long time, I believe a lot of problems/difficulties of implementing this feature has been recognized by people in Lustre community. And I think is obvious that implementing a strict size on MDT with strong guarantees is too hard.

      For LiPE, I think file sizes with guarantees of eventual consistency should be enough for most use cases. Because 1) smart administrators will leave enough margin of data management. I don't think smart administrator will define any dangerous rule based on the strict file size without enough margins of timestamps and file size. 2) Most management actions can be withdrawn without any data lose. And 3) Data removing are usually double/triple checked before being committed. It is reasonable to ask administrator to double check the sizes of removing files on Lustre client if file size on MDT is not precise all the time.

      Still, we have a lot of choices about how to implement file size on MDT, even we choose to imlement a relax/lazy version. I believe that a lot of related work in the history could be reused . I guess using a new extended attribute for file size on MDT might be better than using the i_size in inode structure, since data on MDT is coming. And file size on MDT should be synced in a couple of scenarios which provides enough consistency guarantees yet at the same time introduces little performance impact, for example 1) when the last file close finishes, and 2) when a significant time has been past since last sync

      I'd like to work on this when this is fully discussed and a design is agreed by all people involved. Any advice would be appreciated. Thanks!

      Attachments

        Issue Links

          Activity

            [LU-9538] Size on MDT with guarantee of eventual consistency

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33565
            Subject: LU-9538 utils: update description of ldiskfs xattrs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b889b2caf3f791083a3785c3c60eb9b78127eca5

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33565 Subject: LU-9538 utils: update description of ldiskfs xattrs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b889b2caf3f791083a3785c3c60eb9b78127eca5
            adilger Andreas Dilger added a comment - - edited

            John, the llsom_sync tool, if running on the MDSes, will monitor the Changelog and update the LSOM data by default 10 minutes after the file was modified. It also aggregates updates so that multiple file modifications in the prior 10 minutes do not result in multiple LSOM updates (it is set to the most current size/blocks value).

            If the llsom_sync tool is not running, then the majority of new files will still have the LSOM data updated at close, except when there are strange file write orderings (e.g. many clients doing write/truncate/etc.), or the clients crash before they close the file. That update typically happens as soon as the client closes the file on the MDS.

            Files will also have their LSOM data updated to the current size/blocks when opened and closed by any client (if it has changed), so it is naturally correcting itself over time. That is all the llsom_sync tool is doing in the end - open and close the file after (presumably) it has stopped being modified. If it is still being modified, or is modified again later, there will be another Changelog record written, and llsom_sync will open/close the file another time.

            adilger Andreas Dilger added a comment - - edited John, the llsom_sync tool, if running on the MDSes, will monitor the Changelog and update the LSOM data by default 10 minutes after the file was modified. It also aggregates updates so that multiple file modifications in the prior 10 minutes do not result in multiple LSOM updates (it is set to the most current size/blocks value). If the llsom_sync tool is not running, then the majority of new files will still have the LSOM data updated at close, except when there are strange file write orderings (e.g. many clients doing write/truncate/etc.), or the clients crash before they close the file. That update typically happens as soon as the client closes the file on the MDS. Files will also have their LSOM data updated to the current size/blocks when opened and closed by any client (if it has changed), so it is naturally correcting itself over time. That is all the llsom_sync tool is doing in the end - open and close the file after (presumably) it has stopped being modified. If it is still being modified, or is modified again later, there will be another Changelog record written, and llsom_sync will open/close the file another time.

            "when a significant time has been past since last sync" . Was this value defined?  Is there a config variable?

            johnbent John Bent (Inactive) added a comment - "when a significant time has been past since last sync" . Was this value defined?  Is there a config variable?
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30124/
            Subject: LU-9538 utils: Tool for syncing file LSOM xattr
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: caba6b9af07567ff4cdae9f6450f399cd3ca445e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30124/ Subject: LU-9538 utils: Tool for syncing file LSOM xattr Project: fs/lustre-release Branch: master Current Patch Set: Commit: caba6b9af07567ff4cdae9f6450f399cd3ca445e

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32918/
            Subject: LU-9538 utils: fix lfs xattr.h header usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cc234da91b6c00cbe681d7352320df94c09dc288

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32918/ Subject: LU-9538 utils: fix lfs xattr.h header usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: cc234da91b6c00cbe681d7352320df94c09dc288

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32918
            Subject: LU-9538 utils: fix lfs xattr.h header usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 59921f66904c17b77a69f9bb4bc0b0d8676d32f4

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32918 Subject: LU-9538 utils: fix lfs xattr.h header usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 59921f66904c17b77a69f9bb4bc0b0d8676d32f4

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/29960/
            Subject: LU-9538 mdt: Lazy size on MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f1ebf88aef2101ff9ee30b0ddea107e8f700c07f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/29960/ Subject: LU-9538 mdt: Lazy size on MDT Project: fs/lustre-release Branch: master Current Patch Set: Commit: f1ebf88aef2101ff9ee30b0ddea107e8f700c07f

            For the 2.12 release, it would be great if lfs find could be enhanced to use the LSOM data from the MDS when checking -size or blocks. Maybe a lfs find --lazy option could be added to determine if the LSOM data is used or not. At first, this could use lgetxattr("trusted.som") interface to get the LSOM attr, but eventually this should be converted to use the statx(AT_STATX_DONT_SYNC) interface on the client. That is an internal implementation detail that the user should not care about when using -lazy and can be done at some later time.

            Ideally, the use of lgetxattr() would avoid sending an extra RPC to the MDS to fetch the lazy size, but this is not going to be worse than fetching the size from the OSS nodes, as it would only involve a single MDS_GETXATTR RPC (and may already be prefetched to the client).

            adilger Andreas Dilger added a comment - For the 2.12 release, it would be great if lfs find could be enhanced to use the LSOM data from the MDS when checking - size or blocks . Maybe a lfs find --lazy option could be added to determine if the LSOM data is used or not. At first, this could use lgetxattr("trusted.som") interface to get the LSOM attr, but eventually this should be converted to use the statx(AT_STATX_DONT_SYNC) interface on the client. That is an internal implementation detail that the user should not care about when using -lazy and can be done at some later time. Ideally, the use of lgetxattr() would avoid sending an extra RPC to the MDS to fetch the lazy size, but this is not going to be worse than fetching the size from the OSS nodes, as it would only involve a single MDS_GETXATTR RPC (and may already be prefetched to the client).

            Paste my comment on the patch here in case for future discussion:

            It's clearly awared that LSOM will be inaccurate in many cases, especially when there are concurrent writes/truncates. However, the LSOM is designed to be inaccurate in this way. The accuracy of LSOM can not be trusted in any time. It is only a way to speedup scanning tools or policy engines. Whenever the MDT scanning tools or policy engines need accurate LSOM, it should check the real file size/blocks.

            For example, the policy engine might want to run "lfs migrate" on all of the files that are bigger than 1GB to balance the OST usages. Even though LSOM might be inaccurate, but it is very likely the small files have small LSOM, and large files have large LSOM. In the worst case, all of the LSOM could be wrong. But because most large files have large LSOM, it is very likely the policy engine can find most of the files that are bigger than 1GB. If 90% of the files larger than 1GB have LSOM larger than 1GB, then the policy engine can find out and migrate 90% of the files. And that is totally good enough for OST usage balancing.

            After the policy engine gets the "suspected" file list that could be larget than 1GB by scanning LSOM, it can double check by getting the real file size. If the file size is smaller than 1GB, policy engine could skip the migrating.

            There might be some files that are larger than 1GB but have <1GB LSOM. The syncing tool of LSOM could help to sync the LSOM eventually, and policy engine might found and migrate those file eventually in the next loop.

            Even with LSOM syncing tool, the policy engine could still miss some files that are large than 1GB. But that is totally fine. It is very likely that a very small part of the files are missing. And please note the file system is being changed all the time. Even a tool can print files that are larger than 1GB at the scanning time point, conditions could change in the next second because of writing/truncating. In these kind of use cases, it is impossible to ensure accuracy of 100%, and 100% accuray is unnessary at all for these use cases. To get 100% accuray, the administrator will need to 1) stop all of the I/O, 2) scanning the whole file system using real file size. I can't think up any use cases for that. It is really exclusive and really slow, which I don't think is useful for data management of actively running file system.

            I think the current design is alreay enough for a lot of use cases. Of course, improving the LSOM would always be nice. The accuracy that policy engine can get from scanning LSOM could be improved maybe from 99% to 99.9%. But we need to think whether the efforce to decrease the inaccuracy from 1% to 0.1% really worth it. And we currently don't have any data or experience about the accuracy at all. I think we need to land the LSOM feature and use it for current use cases. We will be able to know whether the accuracy is enough or not soon. Enough statistics can be collected soon for making the decision or judge at that time. And if the accuracy is obviously not enough for some use cases or in some corner cases, we can improve it latter at any time.

            lixi Li Xi (Inactive) added a comment - Paste my comment on the patch here in case for future discussion: It's clearly awared that LSOM will be inaccurate in many cases, especially when there are concurrent writes/truncates. However, the LSOM is designed to be inaccurate in this way. The accuracy of LSOM can not be trusted in any time. It is only a way to speedup scanning tools or policy engines. Whenever the MDT scanning tools or policy engines need accurate LSOM, it should check the real file size/blocks. For example, the policy engine might want to run "lfs migrate" on all of the files that are bigger than 1GB to balance the OST usages. Even though LSOM might be inaccurate, but it is very likely the small files have small LSOM, and large files have large LSOM. In the worst case, all of the LSOM could be wrong. But because most large files have large LSOM, it is very likely the policy engine can find most of the files that are bigger than 1GB. If 90% of the files larger than 1GB have LSOM larger than 1GB, then the policy engine can find out and migrate 90% of the files. And that is totally good enough for OST usage balancing. After the policy engine gets the "suspected" file list that could be larget than 1GB by scanning LSOM, it can double check by getting the real file size. If the file size is smaller than 1GB, policy engine could skip the migrating. There might be some files that are larger than 1GB but have <1GB LSOM. The syncing tool of LSOM could help to sync the LSOM eventually, and policy engine might found and migrate those file eventually in the next loop. Even with LSOM syncing tool, the policy engine could still miss some files that are large than 1GB. But that is totally fine. It is very likely that a very small part of the files are missing. And please note the file system is being changed all the time. Even a tool can print files that are larger than 1GB at the scanning time point, conditions could change in the next second because of writing/truncating. In these kind of use cases, it is impossible to ensure accuracy of 100%, and 100% accuray is unnessary at all for these use cases. To get 100% accuray, the administrator will need to 1) stop all of the I/O, 2) scanning the whole file system using real file size. I can't think up any use cases for that. It is really exclusive and really slow, which I don't think is useful for data management of actively running file system. I think the current design is alreay enough for a lot of use cases. Of course, improving the LSOM would always be nice. The accuracy that policy engine can get from scanning LSOM could be improved maybe from 99% to 99.9%. But we need to think whether the efforce to decrease the inaccuracy from 1% to 0.1% really worth it. And we currently don't have any data or experience about the accuracy at all. I think we need to land the LSOM feature and use it for current use cases. We will be able to know whether the accuracy is enough or not soon. Enough statistics can be collected soon for making the decision or judge at that time. And if the accuracy is obviously not enough for some use cases or in some corner cases, we can improve it latter at any time.

            here - I mean to the current patch

            vitaly_fertman Vitaly Fertman added a comment - here - I mean to the current patch

            People

              lixi Li Xi (Inactive)
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: