[LU-15330] ext2fs_get_pathname() very slow for large directory Created: 07/Dec/21  Updated: 04/Apr/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Cloners
is cloned by LU-15383 DNE directories not connected to REMO... Open
Related
is related to LU-10329 DNE3: REMOTE_PARENT_DIR scalability Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running e2fsck on an MDT with a very large REMOTE_PARENT_DIR is extremely slow if there are entries in that directory that need to be repaired. In the case of a 60M-entry REMOTE_PARENT_DIR system, each directory entry was taking about 1.4s to repair due to PR_3_UNCONNECTED_DIR:

Unconnected directory inode 2102494 (/REMOTE_PARENT_DIR/???)
Connect to /lost+found? yes
Unconnected directory inode 2102510 (/REMOTE_PARENT_DIR/???)
Connect to /lost+found? yes
Unconnected directory inode 2102514 (/REMOTE_PARENT_DIR/???)
Connect to /lost+found? yes

Depending on how many unattached entries there are, this might take days, weeks, or even months to complete (1M files might take 2 weeks to repair).

Attaching ltrace to e2fsck showed that all of the time is spent in ext2fs_get_pathname() opening and iterating through all of the entries in the huge directory (ltrace slowed down the per-file repair time from 1s to 14s but is the same fraction of time):

1638486316.336885 ext2fs_read_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2d00, 10) = 0 <0.000069>
1638486316.336977 ext2fs_link(0x18ab2f0, 11, 0x7ffdc97a2d80, 0x261db5f2) = 0 <0.001130>
1638486316.338130 ext2fs_read_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2bd0, 0xa626870) = 0 <0.000071>
1638486316.338223 ext2fs_icount_increment(0x383027b0, 0x261db5f2, 0, 0x18ab2b0) = 0 <0.000084>
1638486316.338329 ext2fs_icount_increment(0x1efa400, 0x261db5f2, 0, 0) = 0 <0.000073>
1638486316.338425 ext2fs_write_inode(0x18ab2f0, 0x261db5f2, 0x7ffdc97a2bd0, 0) = 0 <0.000094>
1638486316.338542 ext2fs_u32_list_test(0x1efa310, 0x261db5f2, 11, 0) = 0 <0.000069>
1638486316.338633 ext2fs_dir_iterate(0x18ab2f0, 0x261db5f2, 1, 0 <unfinished ...>
1638486316.338727 ext2fs_read_inode(0x18ab2f0, 0x83f7c001, 0x7ffdc97a28f0, 0) = 0 <0.000071>
1638486316.338819 ext2fs_icount_decrement(0x383027b0, 0x83f7c001, 0, 0x18ab2c0) = 0 <0.000080>
1638486316.338921 ext2fs_read_inode(0x18ab2f0, 11, 0x7ffdc97a28f0, 0) = 0 <0.000070>
1638486316.339014 ext2fs_icount_increment(0x383027b0, 11, 0, 0x18ab2a0) = 0 <0.000075>
1638486316.339111 ext2fs_icount_increment(0x1efa400, 11, 0, 0) = 0 <0.000071>
1638486316.339205 ext2fs_write_inode(0x18ab2f0, 11, 0x7ffdc97a28f0, 0) = 0 <0.000087>
1638486316.339313 <... ext2fs_dir_iterate resumed> ) = 0 <0.000679>
1638486316.339337 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f3, 0x7f527a6cce48, 0x7f52765db010) = 4 <0.000070>
1638486316.339428 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f3, 4, 2) = 0 <0.000070>
1638486316.339521 ext2fs_mark_generic_bmap(0x296ee90, 0x83f7c001, 0x261db5f3, 0) = 1 <0.000070>
1638486316.339614 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f4, 0x7f527a6cce54, 0x7f52765db010) = 8 <0.000069>
1638486316.339705 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f4, 8, 3) = 0 <0.000070>
1638486316.339798 ext2fs_mark_generic_bmap(0x296ee90, 0x1a3e72d1, 0x261db5f4, 0) = 1 <0.000073>
1638486316.339894 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f5, 0x7f527a6cce60, 0x7f52765db010) = 16 <0.000069>
1638486316.339985 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f5, 16, 4) = 0 <0.000069>
1638486316.340077 ext2fs_mark_generic_bmap(0x296ee90, 0x83f7c001, 0x261db5f5, 0) = 1 <0.000069>
1638486316.340168 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f6, 0x7f527a6cce6c, 0x7f52765db010) = 32 <0.000069>
1638486316.340260 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f6, 32, 5) = 0 <0.000068>
1638486316.340350 ext2fs_mark_generic_bmap(0x296ee90, 0x545ad40d, 0x261db5f6, 0) = 0 <0.000069>
1638486316.340443 ext2fs_mark_generic_bmap(0x296ee90, 0x545ad40c, 0x2434746, 0) = 0 <0.000069>
1638486316.340534 ext2fs_mark_generic_bmap(0x296ee90, 0x1ec6326d, 0x2434743, 0) = 16 <0.000069>
1638486316.340625 ext2fs_test_generic_bmap(0x1efa870, 0x261db5f7, 0x7f527a6cce78, 0x7f52765db010) = 64 <0.000070>
1638486316.340717 ext2fs_mark_generic_bmap(0x296ee90, 0x261db5f7, 64, 6) = 0 <0.000069>
1638486316.340811 dcgettext(0, 0x448684, 5, 335) = 0x448684 <0.000079>
1638486316.340916 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 12) = 12 <0.000095>
1638486316.341033 dcgettext(0, 0x44cc4d, 5, 12)  = 0x44cc4d <0.000072>
1638486316.341128 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 9) = 9 <0.000088>
1638486316.341239 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 1) = 1 <0.000089>
1638486316.341350 dcgettext(0, 0x44ccb0, 5, 1)   = 0x44ccb0 <0.000071>
1638486316.341445 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 5) = 5 <0.000089>
1638486316.341557 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 1) = 1 <0.000092>
1638486316.341673 __ctype_b_loc()                = 0x7f5364a2d6f0 <0.000062>
1638486316.341758 __fprintf_chk(0x7f5363936400, 1, 0x44cafd, 0) = 9 <0.000094>
1638486316.341874 __fprintf_chk(0x7f5363936400, 1, 0x44cb40, 2) = 2 <0.000089>
1638486316.341985 __ctype_b_loc()                = 0x7f5364a2d6f0 <0.000062>
1638486316.342069 ext2fs_get_pathname(0x18ab2f0, 0x261db5f7, 0, 0x7ffdc97a2ce0) = 0 <13.722823>
1638486330.064926 strlen("/REMOTE_PARENT_DIR/???") = 22 <0.000133>

It isn't currently possible to reduce the number of unattached inodes (LU-14168 might avoid attaching them to lost+found, see options there), and it isn't possible to reduce the size of REMOTE_PARENT_DIR retroactively (LU-10329 and LU-15314 can avoid it in the future), so the ext2fs_get_pathname() function it self must be sped up by a few orders of magnitude.



 Comments   
Comment by Andreas Dilger [ 07/Dec/21 ]

Instead of linearly traversing the whole 60M-entry REMOTE_PARENT_DIR directory each time to resolve the non-existent pathname, there are a few optimizations that could be done:

  • since it appears that ext2fs_get_pathname() is purely informative, and e2fsck has already decided to link the entry into lost+found because it wasn't found during the forward traversal, it isn't clear that spending 1.4s per entry to print "REMOTE_PARENT_DIR/???" is worthwhile. In other cases this might be useful, but not here. Add a special case to e2fsck to avoid doing the pathname lookup in this case. The complexity is that the call to ext2fs_get_pathname() is done inside fix_problem(PR_3_UNCONNECTED_DIR) because the "%p" directive is used, so a new PR_3_REMOTE_PARENT_DIR would be needed for this.
  • for Lustre, if the parent directory is REMOTE_PARENT_DIR it could get the FID from the trusted.lma xattr and use that to regenerate the filename and reinsert the entry into REMOTE_PARENT_DIR rather than adding it to lost+found. That saves extra work later to recover the file from lost+found.
  • for Lustre, if the parent directory is not REMOTE_PARENT_DIR, it could use the trusted.link xattr to get the filename. In the common case, there will only be a single link entry, and it can be used to reinsert the file into the parent, if it still exists. If there are multiple link entries, it might be able to match the name against the parent inode FID.
  • for generic e2fsck the ext2fs_get_pathname() function could be changed to use a hash in memory for filename lookups, either the on-disk htree if we assume it is working, or just an in-memory hash from reading and caching the whole directory on first access to do O(1) lookups of the filename instead of O(60M). The directory itself was 2.9GB on disk, so the hash table could be 7-8GB of RAM (at 128 bytes/entry), but is within acceptable memory usage for the server. Not yet sure if this is useful or not...
Comment by Gerrit Updater [ 08/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45785
Subject: LU-15330 e2fsck: no parent lookup in disconnected dir
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: ee63db45622128b378abf613ac34ada013097344

Comment by Gerrit Updater [ 17/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45785/
Subject: LU-15330 e2fsck: no parent lookup in disconnected dir
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 92846e036c76b820445822edc80fc9cff0310a5d

Comment by Gerrit Updater [ 17/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45875
Subject: LU-15330 build: update version to 1.46.2.wc4
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 71626c6cbc398101feac0648879a66c48864e326

Comment by Andreas Dilger [ 17/Dec/21 ]

Will be included in e2fsprogs-1.46.2.wc4.

Comment by Gerrit Updater [ 17/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45875/
Subject: LU-15330 build: update version to 1.46.2.wc4
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: b7742e844edf00cd8a20af2791d9d0ecb9e96a24

Comment by Andreas Dilger [ 17/Dec/21 ]

The current patch avoids the parent lookup for unconnected directories, since that will never succeed. That speeds up e2fsck, but still results in thousands or millions of entries in lost+found. I've filed LU-15383 to understand/fix the root cause, but for filesystems that have this problem already, a better outcome would be to use the trusted.lma xattr to generate the filename (from lma_self_fid) and link the entry back into REMOTE_PARENT_DIR instead of lost+found, as long as e2fsck is properly handling the htree insertion.

Generated at Sat Feb 10 03:17:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.