[LU-7381] "e2fsck -fD" on directory may cause extent tree corruption Created: 04/Nov/15  Updated: 13/Oct/16  Resolved: 14/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.5
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: e2fsck, e2fsprogs

Attachments: Text File LU7381-ost_scratch_61-d0.tar.gz     File LU7381-ost_scratch_73-dump_htree.tar.gz     File list_ost_objs.sh     File ll_recover_zero_length.sh    
Issue Links:
Related
is related to LU-7368 e2fsck unsafe to interrupt with quota... Resolved
is related to LU-8706 e2fsck -fDy running forever Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running e2fsck -fD on an OST upgraded from Lustre 1.8 with large O/0/d* directories (> 300k objects, 1600+ filesystem blocks) may result in the directory becoming corrupted. As yet the reason and mechanism has not been determined, but it may relate to the filesystem upgrade history (Lustre 1.8>2.1->2.5 and/or e2fsck versions), and possibly if the original directories were created as block-mapped directories and later upgraded to extent-mapped directories. The corruption itself is that the extent index block logical number (always for block 4 / 5) was too large, and an extent block was missing. In all observed cases, the extent tree was 5 blocks long (possibly a result of 4 extent blocks being moved out of the in-inode i_block[] array and into an external second-level index block).

e2fsck 1.42.12.wc1 (15-Sep-2014)
MMP interval is 7 seconds and total wait time is 30 seconds. Please wait...
Pass 1: Checking inodes, blocks, and sizes
Inode 17825800, end of extent exceeds allowed value
        (logical block 710, physical block 570459684, len 1019)
Clear? no

Inode 17825800, end of extent exceeds allowed value
        (logical block 1729, physical block 570493888, len 4294966836)
Clear? no

Inode 17825800, i_size is 5197824, should be 2908160.  Fix? no

Inode 17825800, i_blocks is 10192, should be 5704.  Fix? no

Inode 17825801, end of extent exceeds allowed value
        (logical block 711, physical block 570459691, len 966)
Clear? no

There doesn't appear to have been any other data corruption on the OST besides the directory extent blocks, but this resulted in several hundred directory leaf blocks being lost, either because the extent index block was already corrupt and not referencing the required blocks, and because e2fsck considered the last extent index blocks corrupt and discarded the contents.

In some cases, it appears that 100% of files were readable from the corrupted directory using debugfs:

debugfs -c -R "ls -l O/0/$DIR" $DEV

even though e2fsck was unhappy with the extent structure and cleared part of the extent tree and dumped the files into lost+found. This was consistent across a large number of OST object (O/0/d*) directories and was not a sign of external corruption or hardware problems. This implies that the directory entries were all moved into the first blocks of the directory, and the blocks in the corrupt part of the directory were somehow "extra" and the bug lies in the extent handling when shrinking the directory.

During recovery, e2fsck -fyv deleted all the zero-length files that had not had the "lma" FID set on them (i.e. they had never been accessed). To avoid this, the list_ost_objs.sh script was run on all affected OSTs before e2fsck, and then ll_recover_zero_length.sh was run to recreate the zero-length objects after ll_recover_lost_found_objs, and before the filesystem was mounted.



 Comments   
Comment by Chris Hunter (Inactive) [ 04/Nov/15 ]

uploaded debugfs output for ost_scratch_73(/O/0/d2?) & ost_scratch_61 (/O/0/d0)

Comment by Andreas Dilger [ 05/Nov/15 ]

The interesting part is the extent tree dump (not the htree index) from debugfs:

 :
 :
2/ 2 336/339  1052 -  1052 1258365348 - 1258365348      1 
 2/ 2 337/339  1053 -  1053 1258365355 - 1258365355      1 
 2/ 2 338/339  1054 -  1054 1258365417 - 1258365417      1 
 2/ 2 339/339  1055 -  1055 1258365432 - 1258365432      1 
 1/ 2   4/  5  1056 -  1874 1258324458                 819
 2/ 2   1/340  1056 -  1056 1258365435 - 1258365435      1 
 2/ 2   2/340  1057 -  1057 1258366983 - 1258366983      1
 2/ 2   3/340  1058 -  1059 1258366993 - 1258366994      2 
 :
 :
 2/ 2 338/340  1427 -  1427 1258379312 - 1258379312      1  
 2/ 2 339/340  1428 -  1428 1258379117 - 1258379117      1 
 2/ 2 340/340  1429 -  1429 1258379133 - 1258379133      1 
 1/ 2   5/  5  1875 - 4294968943 1258406330              4294967069
 2/ 2   1/  1  1875 -  1875 1258402260 - 1258402260      1 

The 4/5 extent index block is showing an extent length of 819 blocks, but the extent block only has 373 blocks in the extent, and there appears to be one block missing from the extent tree. The final 1-block extent might have been caused by a later change to the directory after the corruption was originally hit, or may just be using an incorrect logical starting block for the index. In either case, there are not enough blocks to account for the current file size.

Comment by Andreas Dilger [ 05/Nov/15 ]

Stat data for the corrupted directory inode:

Inode: 39321606   Type: directory    Mode:  0700   Flags: 0x81000
Generation: 2310511783    Version: 0x00000000:00000000
User:     0   Group:     0   Size: 6750208
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 13232
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x563111cf:15fb2694 -- Wed Oct 28 14:19:59 2015
 atime: 0x52f30c97:9fe5c3ac -- Wed Feb  5 23:16:23 2014
 mtime: 0x563111cf:15fb2694 -- Wed Oct 28 14:19:59 2015
crtime: 0x52f30c97:9fe5c3ac -- Wed Feb  5 23:16:23 2014
Size of extra inode fields: 28
Extended attributes stored in inode body: 
invalid EA entry in inode
EXTENTS:
[better shown by dump_extents above]
Comment by Andreas Dilger [ 09/Nov/15 ]

As yet, I haven't been able to reproduce this problem, but I've been investigating the code to see if the bug can be found this way.

Looking into e2fsck/pass1.c::check_blocks() it is adding directories with 3 or more blocks but no EXT2_INDEX_FL into the "rehash" list, even if the -D option is not specified. This means that corrupted object directories, or directories modified during e2fsck that have the EXT2_INDEX_FL cleared will be rehashed automatically. This is not the same as the -D flag, since it does not attempt to pack/rehash existing directories that already have the EXT2_INDEX_FL set, which would make up the majority of directories.

The e2fsck_rehash_dir() first reads all directory entries into memory, hashes them and sorts by the hash, then calls write_directory() to allocate new directory blocks (if needed) to cover the directory entries, and calls ext2fs_block_iterate3->write_dir_block() to iterate over the directory entries and pack them into blocks, writing each one into the previously-allocated blocks of the directory. Once all of the blocks have been written, the remaining blocks of the file are freed from the filesystem by write_dir_block(), and finally updates the inode with the new block count. I don't yet see where the block count of the file is reduced.

Comment by Charles Wright [ 10/Nov/15 ]

Hi Andreas,
Do you think it would it help if DDN helped setup a test environment and installed their older software stacks and then started stepping through the upgrade path we took?
Thanks.

Comment by Andreas Dilger [ 10/Nov/15 ]

Charles, I think the problem is less related to the specific software versions and more specific to the size of the directories created and the pattern in which they are created in. As yet, I don't have any easy way to reproduce the large number of separate blocks (over 1600) that fill out the extent tree in 2 level to be at least 5-6 index blocks that were allocated to the directories. In my test scripts I'm only ever able to get the code to allocate nicely contiguous ranges of blocks for the directory that stay within a single extent, so I'll have to resort to something different to try and create the large extent tree that I think is needed to reproduce the bug.

I've also been looking at the code to see if I can spot a bug in this area, and at one point I thought I had a lead, but I'm not certain anymore and don't have a way to test it.

Comment by Andreas Dilger [ 11/Nov/15 ]

I was able to reproduce this problem with an e2fsck test script (attached) when shrinking an htree extent directory with only 3 index blocks referenced directly by the inode. The problem is not present on block-mapped directories but looks to be a danger for any user of the "-fD" option with extent-mapped directories.

It looks like the problem is if the inode shrinks enough that one of the index blocks is dropped from the end of the file (blocks after logical block 114 were freed), but the write_directory() write_dir_block() iterator doesn't free the index block 800:

    :
    write_dir_block 113:583 - write
    write_dir_block 114:587 - write
    write_dir_block 115:591 - free
    write_dir_block 116:595 - free
    :
    :
    write_dir_block 165:791 - free
    write_dir_block -1:800 - skip
    write_dir_block 166:795 - free
    write_dir_block 167:799 - free
    write_dir_block 168:804 - free
    write_dir_block 169:808 - free
    write_dir_block 170:812 - free
    write_dir_block 171:813 - free
    write_dir_block 172:814 - free
    write_dir_block -1:800 - skip
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information

The extent tree now has a bogus index block at the end, but somehow is
also missing the valid extent block that was holding the rest of the
file, as shown by debugfs (after "e2fsck -fD" but before the second
e2fsck that detects the corruption) and logical blocks 83-114 are lost:

    debugfs: stat subdir
    Inode: 12 Type: directory Mode: 0755 Flags: 0x81000
    Generation: 0 Version: 0x00000000
    User: 0 Group: 0 Size: 117760
    File ACL: 0 Directory ACL: 0
    Links: 2 Blockcount: 238
    Fragment: Address: 0 Number: 0 Size: 0
    ctime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    atime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    mtime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
    EXTENTS:
    (ETB0):146, (0):129, (1):133, (2):137, (3):141, (4):145, (5):150,
    (6):154, (7):158, (8):162, (9):166, (10):170, (11):174, (12):178,
    (13):182, (14):186, (15):190, (16):194, (17):198, (18):202,
    (19):206, (20):210, (21):214, (22):218, (23):222, (24):226,
    (25):230, (26):234, (27):238, (28):242, (29):246, (30):250,
    (31):254, (32):258, (33):262, (34):266, (35):270, (36):274,
    (37):278, (38):282, (39):286, (40):290, (41):294, (42):298,
    (43):302, (44):306, (45):310, (46):314, (47):318, (48):322,
    (49):326, (50):330, (51):334, (52):338, (53):342, (54):346,
    (55):350, (56):354, (57):358, (58):362, (59):366, (60):370,
    (61):374, (62):378, (63):382, (64):386, (65):390, (66):394,
    (67):398, (68):402, (69):406, (70):410, (71):414, (72):418,
    (73):422, (74):426, (75):430, (76):434, (77):438, (78):442,
    (79):446, (80):450, (81):454, (82):458, (ETB0):800, (172):814
    debugfs: extents subdir
    :
    :
    1/ 1 82/ 83 81 - 81 454 - 454 1
    1/ 1 83/ 83 82 - 82 458 - 458 1
    0/ 1 2/ 2 170 - 4294967410 800 4294967241
    1/ 1 1/ 1 172 - 172 814 - 814 1

The i_size is correct for 115 data blocks written, and i_blocks would
be correct if the second index block wouldn't have been lost. It seems
the bug is in the extent handling code, but I haven't yet dug into why
the last extent is kept. I tried deleting it like the other blocks,
but the iteration immediately stops with an error that the index block
is corrupted.

Comment by Chris Hunter (Inactive) [ 11/Nov/15 ]

Hi Andreas,
We appreciate the update; you mention "..was able to reproduce this problem with an e2fsck test script (attached) when shrinking an htree extent directory with only 3 index blocks referenced directly by the inode". Are you referring to directory entries that are inline in the inode (ie. https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inline_Directories )?

I don't know how the transition from inline entries to an external extent map is done, but can we assume if there is an extent map (ie extent node depth>0) this bug will not be triggered ?

Thanks,
Chris

Comment by Andreas Dilger [ 12/Nov/15 ]

Chris, this is not using the inline data feature, which is not yet enabled for any Lustre filesystems. This is a problem with how e2fsck_rehash_dir() is processing the extra blocks of the directory in an ext4 extent tree after compacting the directory entries. I'm able to reproduce this with as few as 3 index blocks shrinking to 2 index blocks in my test case. For Lustre OSTs this would work out to directories with approximately (3 * (4096 / 24 - 1) * (4096 / 16)) ~= 260k entries (3 index blocks * number of leaf blocks per index * number of entries per leaf block) iff they are shrunk during the directory rehashing stage to need one fewer index blocks.

I was discussing this problem with Ted Ts'o (e2fsprogs author) this morning and have some ideas of how to fix it.

Comment by Gerrit Updater [ 13/Nov/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17152
Subject: LU-7381 libext2fs: fix block-mapped file punch
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 6f983f31d6b9ef5d3c951088da9c4cfa18c57832

Comment by Gerrit Updater [ 13/Nov/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17153
Subject: LU-7381 e2fsck: fix e2fsck -fD directory truncation
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: b8b5f3f0cf69ead19defe8104fde5dbf384059dc

Comment by Andreas Dilger [ 14/Nov/15 ]

After moving the e2fsck_rehash() code over to using ext2fs_punch() to truncate the now-smaller directory, it allowed my new f_extent_htree test case to pass, but it caused test failures in other regression tests. It turns out that there were existing bugs in the ext2fs_punch_ind() handling of indirect-block mapped files, and a known bug in ext2fs_punch_ext() (which I didn't hit, but found a patch on e2fsprogs master which seems prudent to port to maint).

I've pushed 4 patches into our local regression testing, which runs the e2fsprogs regression tests on all the server platforms (RHEL/SLES) and then tests the new e2fsprogs with Lustre as well. I've also pushed the patches to the linux-ext4 mailing list for external review and inclusion into the upstream repository.

Comment by Gerrit Updater [ 02/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17431
Subject: LU-7381 e2fsprogs: update build version to 1.42.13.wc4
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 7af2dd90d352a1c07abb159b6752b9d66ed9257c

Comment by Andreas Dilger [ 03/Dec/15 ]

Patches have all been accepted into upstream e2fsprogs. Working on a -wc4 release for this as well.

Comment by Gerrit Updater [ 09/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17152/
Subject: LU-7381 libext2fs: fix block-mapped file punch
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 229a4739bd8d68192c669e13c411d57575cdc632

Comment by Gerrit Updater [ 11/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17153/
Subject: LU-7381 e2fsck: fix e2fsck -fD directory truncation
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 7cb8130c79fa80b87c1406056221fc3151184862

Comment by Gerrit Updater [ 11/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17431/
Subject: LU-7381 e2fsprogs: update build version to 1.42.13.wc4
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: bc29f4330fc74836ea7b76e9f0adcd2f59fd9660

Comment by Gerrit Updater [ 11/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17572
Subject: LU-7381 e2fsck: update recommended e2fsprogs version
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e334fe27c9b04cd6052f988613bd55e2b679d3ae

Comment by Andreas Dilger [ 11/Dec/15 ]

The e2fsprogs-1.42.13.wc4 release should also be recommended for other maintenance releases.

Comment by Gerrit Updater [ 14/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17572/
Subject: LU-7381 e2fsck: update recommended e2fsprogs version
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b3caa5019b8c781499c32a79b2d33a8929f2c045

Comment by Andreas Dilger [ 14/Dec/15 ]

Patch updating lustre/ChangeLog to reference new release has been landed to master for 2.8.0.

Generated at Sat Feb 10 02:08:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.