[LU-7381] "e2fsck -fD" on directory may cause extent tree corruption Created: 04/Nov/15 Updated: 13/Oct/16 Resolved: 14/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.5 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | e2fsck, e2fsprogs | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Running e2fsck -fD on an OST upgraded from Lustre 1.8 with large O/0/d* directories (> 300k objects, 1600+ filesystem blocks) may result in the directory becoming corrupted. As yet the reason and mechanism has not been determined, but it may relate to the filesystem upgrade history (Lustre 1.8>2.1->2.5 and/or e2fsck versions), and possibly if the original directories were created as block-mapped directories and later upgraded to extent-mapped directories. The corruption itself is that the extent index block logical number (always for block 4 / 5) was too large, and an extent block was missing. In all observed cases, the extent tree was 5 blocks long (possibly a result of 4 extent blocks being moved out of the in-inode i_block[] array and into an external second-level index block). e2fsck 1.42.12.wc1 (15-Sep-2014)
MMP interval is 7 seconds and total wait time is 30 seconds. Please wait...
Pass 1: Checking inodes, blocks, and sizes
Inode 17825800, end of extent exceeds allowed value
(logical block 710, physical block 570459684, len 1019)
Clear? no
Inode 17825800, end of extent exceeds allowed value
(logical block 1729, physical block 570493888, len 4294966836)
Clear? no
Inode 17825800, i_size is 5197824, should be 2908160. Fix? no
Inode 17825800, i_blocks is 10192, should be 5704. Fix? no
Inode 17825801, end of extent exceeds allowed value
(logical block 711, physical block 570459691, len 966)
Clear? no
There doesn't appear to have been any other data corruption on the OST besides the directory extent blocks, but this resulted in several hundred directory leaf blocks being lost, either because the extent index block was already corrupt and not referencing the required blocks, and because e2fsck considered the last extent index blocks corrupt and discarded the contents. In some cases, it appears that 100% of files were readable from the corrupted directory using debugfs: debugfs -c -R "ls -l O/0/$DIR" $DEV even though e2fsck was unhappy with the extent structure and cleared part of the extent tree and dumped the files into lost+found. This was consistent across a large number of OST object (O/0/d*) directories and was not a sign of external corruption or hardware problems. This implies that the directory entries were all moved into the first blocks of the directory, and the blocks in the corrupt part of the directory were somehow "extra" and the bug lies in the extent handling when shrinking the directory. During recovery, e2fsck -fyv deleted all the zero-length files that had not had the "lma" FID set on them (i.e. they had never been accessed). To avoid this, the list_ost_objs.sh script was run on all affected OSTs before e2fsck, and then ll_recover_zero_length.sh was run to recreate the zero-length objects after ll_recover_lost_found_objs, and before the filesystem was mounted. |
| Comments |
| Comment by Chris Hunter (Inactive) [ 04/Nov/15 ] |
|
uploaded debugfs output for ost_scratch_73(/O/0/d2?) & ost_scratch_61 (/O/0/d0) |
| Comment by Andreas Dilger [ 05/Nov/15 ] |
|
The interesting part is the extent tree dump (not the htree index) from debugfs: : : 2/ 2 336/339 1052 - 1052 1258365348 - 1258365348 1 2/ 2 337/339 1053 - 1053 1258365355 - 1258365355 1 2/ 2 338/339 1054 - 1054 1258365417 - 1258365417 1 2/ 2 339/339 1055 - 1055 1258365432 - 1258365432 1 1/ 2 4/ 5 1056 - 1874 1258324458 819 2/ 2 1/340 1056 - 1056 1258365435 - 1258365435 1 2/ 2 2/340 1057 - 1057 1258366983 - 1258366983 1 2/ 2 3/340 1058 - 1059 1258366993 - 1258366994 2 : : 2/ 2 338/340 1427 - 1427 1258379312 - 1258379312 1 2/ 2 339/340 1428 - 1428 1258379117 - 1258379117 1 2/ 2 340/340 1429 - 1429 1258379133 - 1258379133 1 1/ 2 5/ 5 1875 - 4294968943 1258406330 4294967069 2/ 2 1/ 1 1875 - 1875 1258402260 - 1258402260 1 The 4/5 extent index block is showing an extent length of 819 blocks, but the extent block only has 373 blocks in the extent, and there appears to be one block missing from the extent tree. The final 1-block extent might have been caused by a later change to the directory after the corruption was originally hit, or may just be using an incorrect logical starting block for the index. In either case, there are not enough blocks to account for the current file size. |
| Comment by Andreas Dilger [ 05/Nov/15 ] |
|
Stat data for the corrupted directory inode: Inode: 39321606 Type: directory Mode: 0700 Flags: 0x81000 Generation: 2310511783 Version: 0x00000000:00000000 User: 0 Group: 0 Size: 6750208 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 13232 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x563111cf:15fb2694 -- Wed Oct 28 14:19:59 2015 atime: 0x52f30c97:9fe5c3ac -- Wed Feb 5 23:16:23 2014 mtime: 0x563111cf:15fb2694 -- Wed Oct 28 14:19:59 2015 crtime: 0x52f30c97:9fe5c3ac -- Wed Feb 5 23:16:23 2014 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: [better shown by dump_extents above] |
| Comment by Andreas Dilger [ 09/Nov/15 ] |
|
As yet, I haven't been able to reproduce this problem, but I've been investigating the code to see if the bug can be found this way. Looking into e2fsck/pass1.c::check_blocks() it is adding directories with 3 or more blocks but no EXT2_INDEX_FL into the "rehash" list, even if the -D option is not specified. This means that corrupted object directories, or directories modified during e2fsck that have the EXT2_INDEX_FL cleared will be rehashed automatically. This is not the same as the -D flag, since it does not attempt to pack/rehash existing directories that already have the EXT2_INDEX_FL set, which would make up the majority of directories. The e2fsck_rehash_dir() first reads all directory entries into memory, hashes them and sorts by the hash, then calls write_directory() to allocate new directory blocks (if needed) to cover the directory entries, and calls ext2fs_block_iterate3->write_dir_block() to iterate over the directory entries and pack them into blocks, writing each one into the previously-allocated blocks of the directory. Once all of the blocks have been written, the remaining blocks of the file are freed from the filesystem by write_dir_block(), and finally updates the inode with the new block count. I don't yet see where the block count of the file is reduced. |
| Comment by Charles Wright [ 10/Nov/15 ] |
|
Hi Andreas, |
| Comment by Andreas Dilger [ 10/Nov/15 ] |
|
Charles, I think the problem is less related to the specific software versions and more specific to the size of the directories created and the pattern in which they are created in. As yet, I don't have any easy way to reproduce the large number of separate blocks (over 1600) that fill out the extent tree in 2 level to be at least 5-6 index blocks that were allocated to the directories. In my test scripts I'm only ever able to get the code to allocate nicely contiguous ranges of blocks for the directory that stay within a single extent, so I'll have to resort to something different to try and create the large extent tree that I think is needed to reproduce the bug. I've also been looking at the code to see if I can spot a bug in this area, and at one point I thought I had a lead, but I'm not certain anymore and don't have a way to test it. |
| Comment by Andreas Dilger [ 11/Nov/15 ] |
|
I was able to reproduce this problem with an e2fsck test script (attached) when shrinking an htree extent directory with only 3 index blocks referenced directly by the inode. The problem is not present on block-mapped directories but looks to be a danger for any user of the "-fD" option with extent-mapped directories. It looks like the problem is if the inode shrinks enough that one of the index blocks is dropped from the end of the file (blocks after logical block 114 were freed), but the write_directory() write_dir_block() iterator doesn't free the index block 800: :
write_dir_block 113:583 - write
write_dir_block 114:587 - write
write_dir_block 115:591 - free
write_dir_block 116:595 - free
:
:
write_dir_block 165:791 - free
write_dir_block -1:800 - skip
write_dir_block 166:795 - free
write_dir_block 167:799 - free
write_dir_block 168:804 - free
write_dir_block 169:808 - free
write_dir_block 170:812 - free
write_dir_block 171:813 - free
write_dir_block 172:814 - free
write_dir_block -1:800 - skip
Pass 4: Checking reference counts
Pass 5: Checking group summary information
The extent tree now has a bogus index block at the end, but somehow is debugfs: stat subdir
Inode: 12 Type: directory Mode: 0755 Flags: 0x81000
Generation: 0 Version: 0x00000000
User: 0 Group: 0 Size: 117760
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 238
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
atime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
mtime: 0x5642e764 -- Tue Nov 10 23:59:48 2015
EXTENTS:
(ETB0):146, (0):129, (1):133, (2):137, (3):141, (4):145, (5):150,
(6):154, (7):158, (8):162, (9):166, (10):170, (11):174, (12):178,
(13):182, (14):186, (15):190, (16):194, (17):198, (18):202,
(19):206, (20):210, (21):214, (22):218, (23):222, (24):226,
(25):230, (26):234, (27):238, (28):242, (29):246, (30):250,
(31):254, (32):258, (33):262, (34):266, (35):270, (36):274,
(37):278, (38):282, (39):286, (40):290, (41):294, (42):298,
(43):302, (44):306, (45):310, (46):314, (47):318, (48):322,
(49):326, (50):330, (51):334, (52):338, (53):342, (54):346,
(55):350, (56):354, (57):358, (58):362, (59):366, (60):370,
(61):374, (62):378, (63):382, (64):386, (65):390, (66):394,
(67):398, (68):402, (69):406, (70):410, (71):414, (72):418,
(73):422, (74):426, (75):430, (76):434, (77):438, (78):442,
(79):446, (80):450, (81):454, (82):458, (ETB0):800, (172):814
debugfs: extents subdir
:
:
1/ 1 82/ 83 81 - 81 454 - 454 1
1/ 1 83/ 83 82 - 82 458 - 458 1
0/ 1 2/ 2 170 - 4294967410 800 4294967241
1/ 1 1/ 1 172 - 172 814 - 814 1
The i_size is correct for 115 data blocks written, and i_blocks would |
| Comment by Chris Hunter (Inactive) [ 11/Nov/15 ] |
|
Hi Andreas, I don't know how the transition from inline entries to an external extent map is done, but can we assume if there is an extent map (ie extent node depth>0) this bug will not be triggered ? Thanks, |
| Comment by Andreas Dilger [ 12/Nov/15 ] |
|
Chris, this is not using the inline data feature, which is not yet enabled for any Lustre filesystems. This is a problem with how e2fsck_rehash_dir() is processing the extra blocks of the directory in an ext4 extent tree after compacting the directory entries. I'm able to reproduce this with as few as 3 index blocks shrinking to 2 index blocks in my test case. For Lustre OSTs this would work out to directories with approximately (3 * (4096 / 24 - 1) * (4096 / 16)) ~= 260k entries (3 index blocks * number of leaf blocks per index * number of entries per leaf block) iff they are shrunk during the directory rehashing stage to need one fewer index blocks. I was discussing this problem with Ted Ts'o (e2fsprogs author) this morning and have some ideas of how to fix it. |
| Comment by Gerrit Updater [ 13/Nov/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17152 |
| Comment by Gerrit Updater [ 13/Nov/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17153 |
| Comment by Andreas Dilger [ 14/Nov/15 ] |
|
After moving the e2fsck_rehash() code over to using ext2fs_punch() to truncate the now-smaller directory, it allowed my new f_extent_htree test case to pass, but it caused test failures in other regression tests. It turns out that there were existing bugs in the ext2fs_punch_ind() handling of indirect-block mapped files, and a known bug in ext2fs_punch_ext() (which I didn't hit, but found a patch on e2fsprogs master which seems prudent to port to maint). I've pushed 4 patches into our local regression testing, which runs the e2fsprogs regression tests on all the server platforms (RHEL/SLES) and then tests the new e2fsprogs with Lustre as well. I've also pushed the patches to the linux-ext4 mailing list for external review and inclusion into the upstream repository. |
| Comment by Gerrit Updater [ 02/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17431 |
| Comment by Andreas Dilger [ 03/Dec/15 ] |
|
Patches have all been accepted into upstream e2fsprogs. Working on a -wc4 release for this as well. |
| Comment by Gerrit Updater [ 09/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17152/ |
| Comment by Gerrit Updater [ 11/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17153/ |
| Comment by Gerrit Updater [ 11/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17431/ |
| Comment by Gerrit Updater [ 11/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17572 |
| Comment by Andreas Dilger [ 11/Dec/15 ] |
|
The e2fsprogs-1.42.13.wc4 release should also be recommended for other maintenance releases. |
| Comment by Gerrit Updater [ 14/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17572/ |
| Comment by Andreas Dilger [ 14/Dec/15 ] |
|
Patch updating lustre/ChangeLog to reference new release has been landed to master for 2.8.0. |