Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • 7020

    Description

      Record linkEA verification history in RAM

      To know which linkEA entries on the object_A have been verified, LFSCK must pin object_A in RAM and record the linkEA entries verification history. To avoid exhausting available memory, not all objects are pinned in RAM. LFSCK permanently pins the object in RAM only for the first of verified link V and number of hard links 'N' or linkEA entries 'L' is more than one, (V == 1) && (N > 1 || L > 1). Consider the following cases:

      L > 1 || N > 1
      LFSCK treats the linkEA entry as unverified as the in-RAM verification history is absent. It is necessary to pin the object in RAM for this case until all of the linkEA entries are verified.

      L == 1 && N == 1
      Typically, this is for singly-linked object. If LFSCK finds the directory entry pointing to the object_A that matches the unique linkEA entry, then processing is complete. Otherwise if a name entry pointing to the object_A does not match the unique linkEA entry, then a new linkEA entry will be added, and 'L' will increase ('N' will not increase, become the case 1). object_A and its linkEA verification history will be pinned in RAM.

      It is possible that, the first found name entry matches the unique linkEA entry, then L == V == N == 1, we neither record the object in the lfsck_namespace or pin the object in RAM, but as the LFSCK scanning, more name entries pointing to the same object may be found, at that time, with those new linkEA entries added, the object will be pinned in RAM and recorded in the lfsck_namespace file, and will be double scanned later. For a large system, this kind "upgrading" is very rare. We prefer to double scan these objects instead of pinning most unnecessary objects in RAM.

      L == 0
      It is usually for IGIF object. When new linkEA entries are added, it becomes the case 2 or the case 1.

      If too many objects are pinned in RAM, it may cause server memory pressure. To avoid exhausting memory, LFSCK needs to unpin objects from RAM. The following conditions to un-pinning are applied:

      L == V
      All the known linkEA entries on the object are valid. Although there may be other directory entries pointing to the object will be found as the LFSCK scanning. It is unnecessary to maintain the linkEA entries verification history, instead, add some on-disk flag VERIFIED on the object in the lfsck_namespace file. If more directory entries pointing to the object are found, the LFSCK can detect this flag and just adding new linkEA entries without maintaining the verification history.

      Memory pressure
      All the objects with L == V have been unpinned from RAM but there still is memory pressure. LFSCK will unpin some half-verified objects from RAM. Since these objects have been stored in the lfsck_namespace when they pinned in RAM, the possible invalid linkEA entries on these unpinned half-processed objects can be handled during the double scan.

      Attachments

        Issue Links

          Activity

            [LU-5820] LFSCK 4: Record linkEA verification history in RAM
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12809/
            Subject: LU-5820 lfsck: use multiple namespace LFSCK trace files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 16beb6a0bd9633585781978e01c3bb21d44b2f69

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12809/ Subject: LU-5820 lfsck: use multiple namespace LFSCK trace files Project: fs/lustre-release Branch: master Current Patch Set: Commit: 16beb6a0bd9633585781978e01c3bb21d44b2f69

            It is verified that it may be unnecessary to cache the linkEA verification history in RAM, but just as shown above, when a lot of FIDs are recorded in the namespace LFSCK trace file, the namespace LFSCK performance slows down. In fact, the the namespace LFSCK uses trace file to record the FID of the object that has multiple hard links, or has remote name entry, or contains some uncertain inconsistency, and so on. Only single namespace LFSCK trace file is not efficient, especially when there are millions of FIDs to be recorded. So it is still valuable to improve the namespace LFSCK performance via enhancing the namespace LFSCK trace file: uses multiple namespace LFSCK trace files and per trace file based semaphore to control the concurrent access of the trace file. The patch http://review.whamcloud.com/#/c/12809/ is for that.

            We need to land the http://review.whamcloud.com/#/c/12809/ to master before b2_7 released to avoid compatibility issues in the future.

            yong.fan nasf (Inactive) added a comment - It is verified that it may be unnecessary to cache the linkEA verification history in RAM, but just as shown above, when a lot of FIDs are recorded in the namespace LFSCK trace file, the namespace LFSCK performance slows down. In fact, the the namespace LFSCK uses trace file to record the FID of the object that has multiple hard links, or has remote name entry, or contains some uncertain inconsistency, and so on. Only single namespace LFSCK trace file is not efficient, especially when there are millions of FIDs to be recorded. So it is still valuable to improve the namespace LFSCK performance via enhancing the namespace LFSCK trace file: uses multiple namespace LFSCK trace files and per trace file based semaphore to control the concurrent access of the trace file. The patch http://review.whamcloud.com/#/c/12809/ is for that. We need to land the http://review.whamcloud.com/#/c/12809/ to master before b2_7 released to avoid compatibility issues in the future.

            According to http://www.pdsi-scidac.org/fsstats/ and fsstats that I've collected from other Lustre sites, the number of files with more than one hard link is very small in HPC. Of 24 sites that I have data for, 99.98% of 196M total inodes only had a single link, and 6/24 sites had no files with hard links at all. Based on the performance results shown, processing time is increased by about 1% per 1% of hard-linked files, so this optimization would typically only improve speed by about 0.02% and it doesn't seem worth the extra complexity for such a marginal improvement. Also, for very large filesystems or filesystems with many hard links the extra memory usage may also slow down normal usage due to cache pressure.

            adilger Andreas Dilger added a comment - According to http://www.pdsi-scidac.org/fsstats/ and fsstats that I've collected from other Lustre sites, the number of files with more than one hard link is very small in HPC. Of 24 sites that I have data for, 99.98% of 196M total inodes only had a single link, and 6/24 sites had no files with hard links at all. Based on the performance results shown, processing time is increased by about 1% per 1% of hard-linked files, so this optimization would typically only improve speed by about 0.02% and it doesn't seem worth the extra complexity for such a marginal improvement. Also, for very large filesystems or filesystems with many hard links the extra memory usage may also slow down normal usage due to cache pressure.
            yong.fan nasf (Inactive) added a comment - - edited

            Here are some test:

            -------------------------
            Test Environment: OpenSFS Cluster

            node CPU RAM DISK Role Partition
            mds03 2 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 16 logic processors 64GB 200GB SATA MDS (MDT0), client /dev/sda1
            oss01 2 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 16 logic processors 64GB 2TB DS S12S-R2240 OSS (OST0) /dev/sda1

            -------------------------
            Test Description:
            Use single MDT and single OST for testing namespace LFSCK performance. Create 1'000 sub-directories under /ROOT, under each sub-directory, generate 10'000 regular files, and N of them has two hardlinks. We test several N cases: N = 100, 200, 300, 400, 500. So the percentage of the multiple-linked files will be 1% - 5%. For each percentage, we run namespace LFSCK routine check without caching linkEA verification history in RAM, and also simulate of caching the linkEA verification history in RAM. Each case will be tested for 3 times.

            -------------------------
            Raw data: (the digits that are marked as red looks some abnormal, will not be considered for the average performance)

            hardlink % test cycle time_phase1 (no cache linkEA) time_phase2 (no cache linkEA) time_phase1 (cache linkEA) time_phase1 (cache linkEA)
            1 1 90 1 98 0
            1 2 101 1 100 0
            1 3 99 1 99 0
            2 1 102 2 97 0
            2 2 100 2 102 26
            2 3 100 2 101 0
            3 1 103 2 102 1
            3 2 102 2 101 1
            3 3 100 2 101 1
            4 1 135 3 102 1
            4 2 104 3 103 1
            4 3 102 3 102 1
            5 1 106 4 100 1
            5 2 106 4 102 1
            5 3 105 4 100 1

            -------------------------
            Average data without invalid ones (marked as red digits)

            hardlink % total time (seconds, no cache linkEA) total time (seconds, cache linkEA) Performance impact %
            1 101 99 1.98%
            2 103 100 2.91%
            3 105 102 2.86%
            4 106 103 2.83%
            5 110 102 7.27%

            -------------------------
            Conclusion:
            For the file set with 10'000'000 files, if the percentage of multiple-linked files is not more than 4%, then the performance improvement with caching linkEA verification history in RAM will be less 3%. In the real world, 4% multiple-linked files are very high percentage. On the other hand, the real implementation for caching linkEA verification history in RAM will be more complex than current simulation, and the performance improvement cannot achieve the simulation case because it simplifies and ignores a lot of corner/race cases.

            So caching linkEA verification history in RAM for most of cases may be not worth.

            yong.fan nasf (Inactive) added a comment - - edited Here are some test: ------------------------- Test Environment: OpenSFS Cluster node CPU RAM DISK Role Partition mds03 2 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 16 logic processors 64GB 200GB SATA MDS (MDT0), client /dev/sda1 oss01 2 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 16 logic processors 64GB 2TB DS S12S-R2240 OSS (OST0) /dev/sda1 ------------------------- Test Description: Use single MDT and single OST for testing namespace LFSCK performance. Create 1'000 sub-directories under /ROOT, under each sub-directory, generate 10'000 regular files, and N of them has two hardlinks. We test several N cases: N = 100, 200, 300, 400, 500. So the percentage of the multiple-linked files will be 1% - 5%. For each percentage, we run namespace LFSCK routine check without caching linkEA verification history in RAM, and also simulate of caching the linkEA verification history in RAM. Each case will be tested for 3 times. ------------------------- Raw data: (the digits that are marked as red looks some abnormal, will not be considered for the average performance) hardlink % test cycle time_phase1 (no cache linkEA) time_phase2 (no cache linkEA) time_phase1 (cache linkEA) time_phase1 (cache linkEA) 1 1 90 1 98 0 1 2 101 1 100 0 1 3 99 1 99 0 2 1 102 2 97 0 2 2 100 2 102 26 2 3 100 2 101 0 3 1 103 2 102 1 3 2 102 2 101 1 3 3 100 2 101 1 4 1 135 3 102 1 4 2 104 3 103 1 4 3 102 3 102 1 5 1 106 4 100 1 5 2 106 4 102 1 5 3 105 4 100 1 ------------------------- Average data without invalid ones (marked as red digits) hardlink % total time (seconds, no cache linkEA) total time (seconds, cache linkEA) Performance impact % 1 101 99 1.98% 2 103 100 2.91% 3 105 102 2.86% 4 106 103 2.83% 5 110 102 7.27% ------------------------- Conclusion: For the file set with 10'000'000 files, if the percentage of multiple-linked files is not more than 4%, then the performance improvement with caching linkEA verification history in RAM will be less 3%. In the real world, 4% multiple-linked files are very high percentage. On the other hand, the real implementation for caching linkEA verification history in RAM will be more complex than current simulation, and the performance improvement cannot achieve the simulation case because it simplifies and ignores a lot of corner/race cases. So caching linkEA verification history in RAM for most of cases may be not worth.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12809
            Subject: LU-5820 lfsck: not record FID for double scan repeatedly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 315a176e8445da3c3eb20b2c2ff5b3f94f1a2bd2

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12809 Subject: LU-5820 lfsck: not record FID for double scan repeatedly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 315a176e8445da3c3eb20b2c2ff5b3f94f1a2bd2

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12794
            Subject: LU-5820 lfsck: test performance of caching linkea
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c1216f1cf89dcd37729dd7e5234c3b284a890f66

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12794 Subject: LU-5820 lfsck: test performance of caching linkea Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c1216f1cf89dcd37729dd7e5234c3b284a890f66

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12769
            Subject: LU-5820 lfsck: misc patch for LFSCK performance test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 653db911a0802da003c8b0c9a22fec5eaceb794b

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12769 Subject: LU-5820 lfsck: misc patch for LFSCK performance test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 653db911a0802da003c8b0c9a22fec5eaceb794b

            The first think to measure here is how much performance impact writing entries to lfsck_namespace actually has. There isn't any benefit to implementing this complex change if it is not going to improve performance. This can be done by running LFSCK on a good-size filesystem with some percentage of hard links, say 1%, 5%, 10%, 25%, 50% either with the current code having lfsck_namespace written to a file on disk, or a hack mode where it is recorded only in memory (e.g. linked list or similar). If there is no significant difference in the performance there is no reason to implement this change.

            If the performance improvement is significant when inodes are not written to disk, then I think that inodes should not be written to lfsck_namespace unless they are pushed from RAM. The vast majority of objects will not need to be written to disk.

            adilger Andreas Dilger added a comment - The first think to measure here is how much performance impact writing entries to lfsck_namespace actually has. There isn't any benefit to implementing this complex change if it is not going to improve performance. This can be done by running LFSCK on a good-size filesystem with some percentage of hard links, say 1%, 5%, 10%, 25%, 50% either with the current code having lfsck_namespace written to a file on disk, or a hack mode where it is recorded only in memory (e.g. linked list or similar). If there is no significant difference in the performance there is no reason to implement this change. If the performance improvement is significant when inodes are not written to disk, then I think that inodes should not be written to lfsck_namespace unless they are pushed from RAM. The vast majority of objects will not need to be written to disk.

            This is a complex optimization that was omitted from LFSCK 1.5. The design is recorded here for review during future LFSCK phases.

            rhenwood Richard Henwood (Inactive) added a comment - This is a complex optimization that was omitted from LFSCK 1.5. The design is recorded here for review during future LFSCK phases.

            People

              yong.fan nasf (Inactive)
              rhenwood Richard Henwood (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: