Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9538

Size on MDT with guarantee of eventual consistency

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      I belive that size on MDT has been discussed for a long time, and there were even some implementations of it before. I am creating this ticket to discuss it again, because keeping file size on MDTs seems very important for the new policy engine of Lustre (LiPE) that I am currently work on.

      LiPE scans MDTs directly and extracts almost all the file attributes. Values of a series of mathematical expressions are calculated by LiPE according to these attribute values. And the expression values determine which rules the corresponding file matches with. This works perfectly for almost all metadata of files, except the file sizes, because MDT doesn't keep file sizes. That is the reason why we want to add file size on MDT.

      Given the fact that file size on MDT has been discussed for a long time, I believe a lot of problems/difficulties of implementing this feature has been recognized by people in Lustre community. And I think is obvious that implementing a strict size on MDT with strong guarantees is too hard.

      For LiPE, I think file sizes with guarantees of eventual consistency should be enough for most use cases. Because 1) smart administrators will leave enough margin of data management. I don't think smart administrator will define any dangerous rule based on the strict file size without enough margins of timestamps and file size. 2) Most management actions can be withdrawn without any data lose. And 3) Data removing are usually double/triple checked before being committed. It is reasonable to ask administrator to double check the sizes of removing files on Lustre client if file size on MDT is not precise all the time.

      Still, we have a lot of choices about how to implement file size on MDT, even we choose to imlement a relax/lazy version. I believe that a lot of related work in the history could be reused . I guess using a new extended attribute for file size on MDT might be better than using the i_size in inode structure, since data on MDT is coming. And file size on MDT should be synced in a couple of scenarios which provides enough consistency guarantees yet at the same time introduces little performance impact, for example 1) when the last file close finishes, and 2) when a significant time has been past since last sync

      I'd like to work on this when this is fully discussed and a design is agreed by all people involved. Any advice would be appreciated. Thanks!

      Attachments

        Issue Links

          Activity

            [LU-9538] Size on MDT with guarantee of eventual consistency

            "when a significant time has been past since last sync" . Was this value defined?  Is there a config variable?

            johnbent John Bent (Inactive) added a comment - "when a significant time has been past since last sync" . Was this value defined?  Is there a config variable?
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30124/
            Subject: LU-9538 utils: Tool for syncing file LSOM xattr
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: caba6b9af07567ff4cdae9f6450f399cd3ca445e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30124/ Subject: LU-9538 utils: Tool for syncing file LSOM xattr Project: fs/lustre-release Branch: master Current Patch Set: Commit: caba6b9af07567ff4cdae9f6450f399cd3ca445e

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32918/
            Subject: LU-9538 utils: fix lfs xattr.h header usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cc234da91b6c00cbe681d7352320df94c09dc288

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32918/ Subject: LU-9538 utils: fix lfs xattr.h header usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: cc234da91b6c00cbe681d7352320df94c09dc288

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32918
            Subject: LU-9538 utils: fix lfs xattr.h header usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 59921f66904c17b77a69f9bb4bc0b0d8676d32f4

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32918 Subject: LU-9538 utils: fix lfs xattr.h header usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 59921f66904c17b77a69f9bb4bc0b0d8676d32f4

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/29960/
            Subject: LU-9538 mdt: Lazy size on MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f1ebf88aef2101ff9ee30b0ddea107e8f700c07f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/29960/ Subject: LU-9538 mdt: Lazy size on MDT Project: fs/lustre-release Branch: master Current Patch Set: Commit: f1ebf88aef2101ff9ee30b0ddea107e8f700c07f

            For the 2.12 release, it would be great if lfs find could be enhanced to use the LSOM data from the MDS when checking -size or blocks. Maybe a lfs find --lazy option could be added to determine if the LSOM data is used or not. At first, this could use lgetxattr("trusted.som") interface to get the LSOM attr, but eventually this should be converted to use the statx(AT_STATX_DONT_SYNC) interface on the client. That is an internal implementation detail that the user should not care about when using -lazy and can be done at some later time.

            Ideally, the use of lgetxattr() would avoid sending an extra RPC to the MDS to fetch the lazy size, but this is not going to be worse than fetching the size from the OSS nodes, as it would only involve a single MDS_GETXATTR RPC (and may already be prefetched to the client).

            adilger Andreas Dilger added a comment - For the 2.12 release, it would be great if lfs find could be enhanced to use the LSOM data from the MDS when checking - size or blocks . Maybe a lfs find --lazy option could be added to determine if the LSOM data is used or not. At first, this could use lgetxattr("trusted.som") interface to get the LSOM attr, but eventually this should be converted to use the statx(AT_STATX_DONT_SYNC) interface on the client. That is an internal implementation detail that the user should not care about when using -lazy and can be done at some later time. Ideally, the use of lgetxattr() would avoid sending an extra RPC to the MDS to fetch the lazy size, but this is not going to be worse than fetching the size from the OSS nodes, as it would only involve a single MDS_GETXATTR RPC (and may already be prefetched to the client).

            Paste my comment on the patch here in case for future discussion:

            It's clearly awared that LSOM will be inaccurate in many cases, especially when there are concurrent writes/truncates. However, the LSOM is designed to be inaccurate in this way. The accuracy of LSOM can not be trusted in any time. It is only a way to speedup scanning tools or policy engines. Whenever the MDT scanning tools or policy engines need accurate LSOM, it should check the real file size/blocks.

            For example, the policy engine might want to run "lfs migrate" on all of the files that are bigger than 1GB to balance the OST usages. Even though LSOM might be inaccurate, but it is very likely the small files have small LSOM, and large files have large LSOM. In the worst case, all of the LSOM could be wrong. But because most large files have large LSOM, it is very likely the policy engine can find most of the files that are bigger than 1GB. If 90% of the files larger than 1GB have LSOM larger than 1GB, then the policy engine can find out and migrate 90% of the files. And that is totally good enough for OST usage balancing.

            After the policy engine gets the "suspected" file list that could be larget than 1GB by scanning LSOM, it can double check by getting the real file size. If the file size is smaller than 1GB, policy engine could skip the migrating.

            There might be some files that are larger than 1GB but have <1GB LSOM. The syncing tool of LSOM could help to sync the LSOM eventually, and policy engine might found and migrate those file eventually in the next loop.

            Even with LSOM syncing tool, the policy engine could still miss some files that are large than 1GB. But that is totally fine. It is very likely that a very small part of the files are missing. And please note the file system is being changed all the time. Even a tool can print files that are larger than 1GB at the scanning time point, conditions could change in the next second because of writing/truncating. In these kind of use cases, it is impossible to ensure accuracy of 100%, and 100% accuray is unnessary at all for these use cases. To get 100% accuray, the administrator will need to 1) stop all of the I/O, 2) scanning the whole file system using real file size. I can't think up any use cases for that. It is really exclusive and really slow, which I don't think is useful for data management of actively running file system.

            I think the current design is alreay enough for a lot of use cases. Of course, improving the LSOM would always be nice. The accuracy that policy engine can get from scanning LSOM could be improved maybe from 99% to 99.9%. But we need to think whether the efforce to decrease the inaccuracy from 1% to 0.1% really worth it. And we currently don't have any data or experience about the accuracy at all. I think we need to land the LSOM feature and use it for current use cases. We will be able to know whether the accuracy is enough or not soon. Enough statistics can be collected soon for making the decision or judge at that time. And if the accuracy is obviously not enough for some use cases or in some corner cases, we can improve it latter at any time.

            lixi Li Xi (Inactive) added a comment - Paste my comment on the patch here in case for future discussion: It's clearly awared that LSOM will be inaccurate in many cases, especially when there are concurrent writes/truncates. However, the LSOM is designed to be inaccurate in this way. The accuracy of LSOM can not be trusted in any time. It is only a way to speedup scanning tools or policy engines. Whenever the MDT scanning tools or policy engines need accurate LSOM, it should check the real file size/blocks. For example, the policy engine might want to run "lfs migrate" on all of the files that are bigger than 1GB to balance the OST usages. Even though LSOM might be inaccurate, but it is very likely the small files have small LSOM, and large files have large LSOM. In the worst case, all of the LSOM could be wrong. But because most large files have large LSOM, it is very likely the policy engine can find most of the files that are bigger than 1GB. If 90% of the files larger than 1GB have LSOM larger than 1GB, then the policy engine can find out and migrate 90% of the files. And that is totally good enough for OST usage balancing. After the policy engine gets the "suspected" file list that could be larget than 1GB by scanning LSOM, it can double check by getting the real file size. If the file size is smaller than 1GB, policy engine could skip the migrating. There might be some files that are larger than 1GB but have <1GB LSOM. The syncing tool of LSOM could help to sync the LSOM eventually, and policy engine might found and migrate those file eventually in the next loop. Even with LSOM syncing tool, the policy engine could still miss some files that are large than 1GB. But that is totally fine. It is very likely that a very small part of the files are missing. And please note the file system is being changed all the time. Even a tool can print files that are larger than 1GB at the scanning time point, conditions could change in the next second because of writing/truncating. In these kind of use cases, it is impossible to ensure accuracy of 100%, and 100% accuray is unnessary at all for these use cases. To get 100% accuray, the administrator will need to 1) stop all of the I/O, 2) scanning the whole file system using real file size. I can't think up any use cases for that. It is really exclusive and really slow, which I don't think is useful for data management of actively running file system. I think the current design is alreay enough for a lot of use cases. Of course, improving the LSOM would always be nice. The accuracy that policy engine can get from scanning LSOM could be improved maybe from 99% to 99.9%. But we need to think whether the efforce to decrease the inaccuracy from 1% to 0.1% really worth it. And we currently don't have any data or experience about the accuracy at all. I think we need to land the LSOM feature and use it for current use cases. We will be able to know whether the accuracy is enough or not soon. Enough statistics can be collected soon for making the decision or judge at that time. And if the accuracy is obviously not enough for some use cases or in some corner cases, we can improve it latter at any time.

            here - I mean to the current patch

            vitaly_fertman Vitaly Fertman added a comment - here - I mean to the current patch

            I would like to clarify again when the [cm]time logic will be covered. originally SOM, despite its name, was intended to handle all the OSS side attributes. having a kind of lazy SOM, it looks like there is no need to go to OSS at all what is not true, thus not all of the robinhood issues are resolved as it has time policies. yes, we can defer it for the next patch, so let’s create a ticket. from my point of view it has more sense to add it right here.

            vitaly_fertman Vitaly Fertman added a comment - I would like to clarify again when the [cm] time logic will be covered. originally SOM, despite its name, was intended to handle all the OSS side attributes. having a kind of lazy SOM, it looks like there is no need to go to OSS at all what is not true, thus not all of the robinhood issues are resolved as it has time policies. yes, we can defer it for the next patch, so let’s create a ticket. from my point of view it has more sense to add it right here.

            Alex, that's right, it is more about guessed SOM which may still be useful for things like RH, and would be better to not lose it completely.

            vitaly_fertman Vitaly Fertman added a comment - Alex, that's right, it is more about guessed SOM which may still be useful for things like RH, and would be better to not lose it completely.

            People

              lixi Li Xi (Inactive)
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: