Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15330

ext2fs_get_pathname() very slow for large directory

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Running e2fsck on an MDT with a very large REMOTE_PARENT_DIR is extremely slow if there are entries in that directory that need to be repaired. In the case of a 60M-entry REMOTE_PARENT_DIR system, each directory entry was taking about 1.4s to repair due to PR_3_UNCONNECTED_DIR:

      Unconnected directory inode 2102494 (/REMOTE_PARENT_DIR/???)
      Connect to /lost+found? yes
      Unconnected directory inode 2102510 (/REMOTE_PARENT_DIR/???)
      Connect to /lost+found? yes
      Unconnected directory inode 2102514 (/REMOTE_PARENT_DIR/???)
      Connect to /lost+found? yes
      

      Depending on how many unattached entries there are, this might take days, weeks, or even months to complete (1M files might take 2 weeks to repair).

      Attaching ltrace to e2fsck showed that all of the time is spent in ext2fs_get_pathname() opening and iterating through all of the entries in the huge directory (ltrace slowed down the per-file repair time from 1s to 14s but is the same fraction of time):

      1638486316.336885 ext2fs_read_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2d00, 10) = 0 <0.000069>
      1638486316.336977 ext2fs_link(0x18ab2f0, 11, 0x7ffdc97a2d80, 0x261db5f2) = 0 <0.001130>
      1638486316.338130 ext2fs_read_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2bd0, 0xa626870) = 0 <0.000071>
      1638486316.338223 ext2fs_icount_increment(0x383027b0, 0x261db5f2, 0, 0x18ab2b0) = 0 <0.000084>
      1638486316.338329 ext2fs_icount_increment(0x1efa400, 0x261db5f2, 0, 0) = 0 <0.000073>
      1638486316.338425 ext2fs_write_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2bd0, 0) = 0 <0.000094>
      1638486316.338542 ext2fs_u32_list_test(0x1efa310, 0x261db5f2, 11, 0) = 0 <0.000069>
      1638486316.338633 ext2fs_dir_iterate(0x18ab2f0, 0x261db5f2, 1, 0 <unfinished ...>
      1638486316.338727 ext2fs_read_inode(0x18ab2f0, 0x83f7c001, 0x7ffdc97a28f0, 0) = 0 <0.000071>
      1638486316.338819 ext2fs_icount_decrement(0x383027b0, 0x83f7c001, 0, 0x18ab2c0) = 0 <0.000080>
      1638486316.338921 ext2fs_read_inode(0x18ab2f0, 11, 0x7ffdc97a28f0, 0) = 0 <0.000070>
      1638486316.339014 ext2fs_icount_increment(0x383027b0, 11, 0, 0x18ab2a0) = 0 <0.000075>
      1638486316.339111 ext2fs_icount_increment(0x1efa400, 11, 0, 0) = 0 <0.000071>
      1638486316.339205 ext2fs_write_inode(0x18ab2f0, 11, 0x7ffdc97a28f0, 0) = 0 <0.000087>
      1638486316.339313 <... ext2fs_dir_iterate resumed> ) = 0 <0.000679>
      1638486316.339337 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f3, 0x7f527a6cce48, 0x7f52765db010) = 4 <0.000070>
      1638486316.339428 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f3, 4, 2) = 0 <0.000070>
      1638486316.339521 ext2fs_mark_generic_bmap(0x296ee90, 0x83f7c001, 0x261db5f3, 0) = 1 <0.000070>
      1638486316.339614 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f4, 0x7f527a6cce54, 0x7f52765db010) = 8 <0.000069>
      1638486316.339705 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f4, 8, 3) = 0 <0.000070>
      1638486316.339798 ext2fs_mark_generic_bmap(0x296ee90, 0x1a3e72d1, 0x261db5f4, 0) = 1 <0.000073>
      1638486316.339894 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f5, 0x7f527a6cce60, 0x7f52765db010) = 16 <0.000069>
      1638486316.339985 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f5, 16, 4) = 0 <0.000069>
      1638486316.340077 ext2fs_mark_generic_bmap(0x296ee90, 0x83f7c001, 0x261db5f5, 0) = 1 <0.000069>
      1638486316.340168 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f6, 0x7f527a6cce6c, 0x7f52765db010) = 32 <0.000069>
      1638486316.340260 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f6, 32, 5) = 0 <0.000068>
      1638486316.340350 ext2fs_mark_generic_bmap(0x296ee90, 0x545ad40d, 0x261db5f6, 0) = 0 <0.000069>
      1638486316.340443 ext2fs_mark_generic_bmap(0x296ee90, 0x545ad40c, 0x2434746, 0) = 0 <0.000069>
      1638486316.340534 ext2fs_mark_generic_bmap(0x296ee90, 0x1ec6326d, 0x2434743, 0) = 16 <0.000069>
      1638486316.340625 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f7, 0x7f527a6cce78, 0x7f52765db010) = 64 <0.000070>
      1638486316.340717 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f7, 64, 6) = 0 <0.000069>
      1638486316.340811 dcgettext(0, 0x448684, 5, 335) = 0x448684 <0.000079>
      1638486316.340916 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 12) = 12 <0.000095>
      1638486316.341033 dcgettext(0, 0x44cc4d, 5, 12)  = 0x44cc4d <0.000072>
      1638486316.341128 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 9) = 9 <0.000088>
      1638486316.341239 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 1) = 1 <0.000089>
      1638486316.341350 dcgettext(0, 0x44ccb0, 5, 1)   = 0x44ccb0 <0.000071>
      1638486316.341445 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 5) = 5 <0.000089>
      1638486316.341557 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 1) = 1 <0.000092>
      1638486316.341673 __ctype_b_loc()                = 0x7f5364a2d6f0 <0.000062>
      1638486316.341758 __fprintf_chk(0x7f5363936400, 1, 0x44cafd, 0) = 9 <0.000094>
      1638486316.341874 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 2) = 2 <0.000089>
      1638486316.341985 __ctype_b_loc()                = 0x7f5364a2d6f0 <0.000062>
      1638486316.342069 ext2fs_get_pathname(0x18ab2f0, 0x261db5f7, 0, 0x7ffdc97a2ce0) = 0 <13.722823>
      1638486330.064926 strlen("/REMOTE_PARENT_DIR/???") = 22 <0.000133>
      

      It isn't currently possible to reduce the number of unattached inodes (LU-14168 might avoid attaching them to lost+found, see options there), and it isn't possible to reduce the size of REMOTE_PARENT_DIR retroactively (LU-10329 and LU-15314 can avoid it in the future), so the ext2fs_get_pathname() function it self must be sped up by a few orders of magnitude.

      Attachments

        Issue Links

          Activity

            [LU-15330] ext2fs_get_pathname() very slow for large directory
            adilger Andreas Dilger added a comment - - edited

            The current patch avoids the parent lookup for unconnected directories, since that will never succeed. That speeds up e2fsck, but still results in thousands or millions of entries in lost+found. I've filed LU-15383 to understand/fix the root cause, but for filesystems that have this problem already, a better outcome would be to use the trusted.lma xattr to generate the filename (from lma_self_fid) and link the entry back into REMOTE_PARENT_DIR instead of lost+found, as long as e2fsck is properly handling the htree insertion.

            adilger Andreas Dilger added a comment - - edited The current patch avoids the parent lookup for unconnected directories, since that will never succeed. That speeds up e2fsck, but still results in thousands or millions of entries in lost+found. I've filed LU-15383 to understand/fix the root cause, but for filesystems that have this problem already, a better outcome would be to use the trusted.lma xattr to generate the filename (from lma_self_fid ) and link the entry back into REMOTE_PARENT_DIR instead of lost+found , as long as e2fsck is properly handling the htree insertion.

            "Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45875/
            Subject: LU-15330 build: update version to 1.46.2.wc4
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: b7742e844edf00cd8a20af2791d9d0ecb9e96a24

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45875/ Subject: LU-15330 build: update version to 1.46.2.wc4 Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: b7742e844edf00cd8a20af2791d9d0ecb9e96a24

            Will be included in e2fsprogs-1.46.2.wc4.

            adilger Andreas Dilger added a comment - Will be included in e2fsprogs-1.46.2.wc4.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45875
            Subject: LU-15330 build: update version to 1.46.2.wc4
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 71626c6cbc398101feac0648879a66c48864e326

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45875 Subject: LU-15330 build: update version to 1.46.2.wc4 Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 71626c6cbc398101feac0648879a66c48864e326

            "Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45785/
            Subject: LU-15330 e2fsck: no parent lookup in disconnected dir
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: 92846e036c76b820445822edc80fc9cff0310a5d

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45785/ Subject: LU-15330 e2fsck: no parent lookup in disconnected dir Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: 92846e036c76b820445822edc80fc9cff0310a5d

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45785
            Subject: LU-15330 e2fsck: no parent lookup in disconnected dir
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: ee63db45622128b378abf613ac34ada013097344

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45785 Subject: LU-15330 e2fsck: no parent lookup in disconnected dir Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: ee63db45622128b378abf613ac34ada013097344
            adilger Andreas Dilger added a comment - - edited

            Instead of linearly traversing the whole 60M-entry REMOTE_PARENT_DIR directory each time to resolve the non-existent pathname, there are a few optimizations that could be done:

            • since it appears that ext2fs_get_pathname() is purely informative, and e2fsck has already decided to link the entry into lost+found because it wasn't found during the forward traversal, it isn't clear that spending 1.4s per entry to print "REMOTE_PARENT_DIR/???" is worthwhile. In other cases this might be useful, but not here. Add a special case to e2fsck to avoid doing the pathname lookup in this case. The complexity is that the call to ext2fs_get_pathname() is done inside fix_problem(PR_3_UNCONNECTED_DIR) because the "%p" directive is used, so a new PR_3_REMOTE_PARENT_DIR would be needed for this.
            • for Lustre, if the parent directory is REMOTE_PARENT_DIR it could get the FID from the trusted.lma xattr and use that to regenerate the filename and reinsert the entry into REMOTE_PARENT_DIR rather than adding it to lost+found. That saves extra work later to recover the file from lost+found.
            • for Lustre, if the parent directory is not REMOTE_PARENT_DIR, it could use the trusted.link xattr to get the filename. In the common case, there will only be a single link entry, and it can be used to reinsert the file into the parent, if it still exists. If there are multiple link entries, it might be able to match the name against the parent inode FID.
            • for generic e2fsck the ext2fs_get_pathname() function could be changed to use a hash in memory for filename lookups, either the on-disk htree if we assume it is working, or just an in-memory hash from reading and caching the whole directory on first access to do O(1) lookups of the filename instead of O(60M). The directory itself was 2.9GB on disk, so the hash table could be 7-8GB of RAM (at 128 bytes/entry), but is within acceptable memory usage for the server. Not yet sure if this is useful or not...
            adilger Andreas Dilger added a comment - - edited Instead of linearly traversing the whole 60M-entry REMOTE_PARENT_DIR directory each time to resolve the non-existent pathname, there are a few optimizations that could be done: since it appears that ext2fs_get_pathname() is purely informative, and e2fsck has already decided to link the entry into lost+found because it wasn't found during the forward traversal, it isn't clear that spending 1.4s per entry to print " REMOTE_PARENT_DIR/??? " is worthwhile. In other cases this might be useful, but not here. Add a special case to e2fsck to avoid doing the pathname lookup in this case. The complexity is that the call to ext2fs_get_pathname() is done inside fix_problem(PR_3_UNCONNECTED_DIR) because the " %p " directive is used, so a new PR_3_REMOTE_PARENT_DIR would be needed for this. for Lustre, if the parent directory is REMOTE_PARENT_DIR it could get the FID from the trusted.lma xattr and use that to regenerate the filename and reinsert the entry into REMOTE_PARENT_DIR rather than adding it to lost+found . That saves extra work later to recover the file from lost+found . for Lustre, if the parent directory is not REMOTE_PARENT_DIR , it could use the trusted.link xattr to get the filename. In the common case, there will only be a single link entry, and it can be used to reinsert the file into the parent, if it still exists. If there are multiple link entries, it might be able to match the name against the parent inode FID. for generic e2fsck the ext2fs_get_pathname() function could be changed to use a hash in memory for filename lookups, either the on-disk htree if we assume it is working, or just an in-memory hash from reading and caching the whole directory on first access to do O(1) lookups of the filename instead of O(60M). The directory itself was 2.9GB on disk, so the hash table could be 7-8GB of RAM (at 128 bytes/entry), but is within acceptable memory usage for the server. Not yet sure if this is useful or not...

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: