[LU-12913] fsck found > 1M multilply-claimed blocks Created: 29/Oct/19 Updated: 19/Sep/22 Resolved: 01/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Joe Mervini | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 2.10.5/toss3.5 (rhel 7.5) running on Dell R730 servers and DDN SFA12K hardware. |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 1 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
On another OST that went read-only on the same file system that I reported in I have never encountered this before so I'm not sure how to proceed. Any advise will be welcome. |
| Comments |
| Comment by Andreas Dilger [ 29/Oct/19 ] |
|
The duplicate block passes are often spurious these days in that any kind of random corruption of an inode in a large filesystem results in "duplicate blocks", because any random 32-bit number is also a valid block number in a 16TB filesystem. I don't have the e2fsck output to reference, so I can't say for sure, but my thought is that there are probably a small number of inodes (possibly close together on disk) that are claiming the huge majority of duplicate blocks, along with a bunch of other errors, and they are just random garbage instead of valid inodes. Another possibility is if some inode table blocks were written to the wrong location on disk and are "duplicated" with the original copies of those inodes. This could be determined by using "debugfs -c -R 'stat <inode_number>' /dev/ostdev" (include angle brackets around inode_number) to dump the attributes of inodes with duplicate blocks like: Inode: 130778 Type: regular Mode: 0666 Flags: 0x80000 Generation: 591382355 Version: 0x00000072:0001bd33 User: 1001 Group: 1001 Size: 60870 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 120 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x57618035:00000000 -- Wed Jun 15 10:20:05 2016 atime: 0x57618035:00000000 -- Wed Jun 15 10:20:05 2016 mtime: 0x56a3c350:00000000 -- Sat Jan 23 11:15:44 2016 crtime: 0x57617fff:d0a68f28 -- Wed Jun 15 10:19:11 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 60 03 14 00 00 00 00 00 " (24) lma: fid=[0x100000000:0x140360:0x0] compat=8 incompat=0 fid = "1b 54 01 00 02 00 00 00 f5 fb 01 00 00 00 00 00 " (16) fid: parent=[0x20001541b:0x1fbf5:0x0] stripe=0 EXTENTS: (0-14):133924095-133924109 and check the Lustre FID stored in the "lma" xattr against which inode is referenced by the object directories in "ls -li O/0/d$((oid % 32))/$((oid))", where oid is the second number in the FID, 0x140360 in this case. The inode number referenced from the object directory should be the same for the "good" inode (130778 in this case), and will not match for the "bad" inode. Rather than go through a very lengthy duplicate blocks phase only to find those inodes are useless, you could kill e2fsck and manually clear the inode with "debugfs -w -R 'clri <inode_number> [<inode_number> ...]' /dev/ostdev" or you can start an interactive session with "debugfs -w /dev/ostdev" and use stat <ino> and ls -l /O/0/dNN/nnnnnnn manually to look at "bad" inodes (e.g. to check if it has valid UID/GID/timestamps) and clri <ino> to erase them. There are other e2fsck options for dealing with duplicate blocks on a large scale: -E extended_options
Set e2fsck extended options. Extended options are comma sepa-
rated, and may take an argument using the equals ('=') sign.
The following options are supported:
clone=dup|zero
Resolve files with shared blocks in pass 1D by giv-
ing each file a private copy of the blocks (dup); or
replacing the shared blocks with private, zero-
filled blocks (zero). The default is dup.
shared=preserve|lost+found|delete
Files with shared blocks discovered in pass 1D are
cloned and then left in place (preserve); cloned and
then disconnected from their parent directory and
reconnected to /lost+found in pass 3 (lost+found);
or simply deleted (delete). The default is preserve.
However, this affects both the "good" and "bad" inodes involved in a duplicate blocks case, since it can't make a good decision on how to handle them automatically (e2fsck does not understand the extra Lustre metadata). This is not the best from a data availability POV, but the "lost+found" or "delete" options are safer if the data should not be potentially visible to other users on the system. Clearing the "bad" inode(s) manually with debugfs will preserve the original data and should not result in any data leaks, but is only practical for a smaller number of inodes. There is also a "bad inode" detection mechanism in e2fsck that should be detecting the case of inodes with a lot of duplicate blocks and corrupt metadata, but it doesn't appear to be working in your case. If you are running "e2fsck -n" then this is understandable. If you are running "e2fsck -y" or "-p" then it should normally have prevented the duplicate blocks from being an issue at all. Apologies for the long post and not being very precise, but I can't make a better assessment without the e2fsck output. Hopefully this provides enough information for you to make an assessment onsite and deal with the issue in a reasonable manner. |
| Comment by Joe Mervini [ 29/Oct/19 ] |
|
Andreas, In this case the fsck was started with -fy flags. Ruth and I were discussing you comments and a path forward. With regard to the bad inode detector we're wondering whether our first course of action so be to try and restart the fsck with the new version of e2fsprogs that you suggested in the I am logging the fsck run right now but it would take me some time to get it moved off the network that it is on so I thought that I'd just synopsize the output below. I copied verbatum (with exception of the sizes) the meaningful output after the fsck header info up until the Multiply-claimed blocks output:
Inode 33073713 is in use, but has dtime set. Fix? yes Inode 33073713 has extra size (size) which is invalid. Fix? yes Inode 33073713 i_size is (size), should be 0. Fix? yes Inode 33073713 i_blocks is (size), should be 0. Fix? yes Inode 33073714 is in use, but has dtime set. Fix? yes Inode 33073714 has extra size (size) which is invalid. Fix? yes Inode 33073715 is in use, but has dtime set. Fix? yes Inode 33073715 has imagic flag set. Clear? yes Inode 33073715 has extra size (size) which is invalid. Fix? yes Inode 33073716 is in use, but has dtime set. Fix? yes Inode 33073716 has imagic flag set. Clear? yes Inode 33073716 has extra size (size) which is invalid. Fix? yes Inode 33073717 is in use, but has dtime set. Fix? yes Inode 33073717 has extra size (size) which is invalid. Fix? yes Inode 33073718 is in use, but has dtime set. Fix? yes Inode 33073718 has extra size (size) which is invalid. Fix? yes Inode 33073718 has compression flag set on filesystem without compression support. Clear? yes Inode 33073718 i_size is (size), should be 0. Fix? yes Inode 33073718 i_blocks is (size), should be 0. Fix? yes Inode 33073719 is in use, but has dtime set. Fix? yes Inode 33073719 has extra size (size) which is invalid. Fix? yes Inode 33073719 has compression flag set on filesystem without compression support. Clear? yes Inode 33073719 has INDEX_FL set but is not a directory. Clear HTree Index? yes Inode 33073719 has INDEX_FL set on a filesystem without htree support. Clear HTree Index? yes Inode 33073720 is in use, but has dtime set. Fix? yes Inode 33073720 has imagic flag set. Clear? yes Inode 33073720 has extra size (size) which is invalid. Fix? yes Inode 33073715 has compression flag set on filesystem without compression support. Clear? yes Inode 33073715 i_size is (size), should be 0. Fix? yes Inode 33073715 i_blocks is (size), should be 0. Fix? yes Inode 33073716 i_size is (size), should be (other size). Fix? yes Inode 33073716 i_blocks is (size), should be (other size). Fix? yes Inode 33073714 has compression flag set on filesystem without compression support. Clear? yes Inode 33073714 i_size is (size), should be 0. Fix? yes Inode 33073714 i_blocks is (size), should be 0. Fix? yes Inode 33073720 has compression flag set on filesystem without compression support. Clear? yes Inode 33073720 i_size is (size), should be 0. Fix? yes Inode 33073720 i_blocks is (size), should be 0. Fix? yes node 33073717 has INDEX_FL set but is not a directory. Clear HTree Index? yes Inode 33073717 has INDEX_FL set on a filesystem without htree support. Clear HTree Index? yes Inode 33073717 i_size is (size), should be 0. Fix? yes Inode 33073717 i_blocks is (size), should be 0. Fix? yes Deleted inode 400982002 has zero dtime. Fix? yes Deleted inode 400982003 has zero dtime. Fix? yes
Running additional passes to resolve blocks claimed by more than one inode... Pass1B: Rescanning for multiply-claimed blocks (1.3M reported blocks. then cloning operation starts.)
|
| Comment by Joe Mervini [ 29/Oct/19 ] |
|
One other thing is that Ruth posted in the other issue that the version of e2fsprogs that is on the download site is a rev back. Is there another place we can get the most recent? |
| Comment by Andreas Dilger [ 30/Oct/19 ] |
|
The e2fsck output looks as I would expect - a series of inodes (33073713-33073720) that are corrupted (showing random errors such as invalid size, flags, blocks). I would guess that e2fsck also reported those inodes as the source of the duplicated blocks. If e2fsck is doing the pass1b block cloning, then interrupting it should have no ill effect. Using the debugfs "clri" command for inodes <33073713> through <33073720> should erase the corrupt inodes and also avoid the duplicate blocks. |
| Comment by Joe Mervini [ 30/Oct/19 ] |
|
Thanks Andreas. We are preparing to do these operations shortly. Would you recommend running another fsck against the file system once we're done or should we be good just mounting it? |
| Comment by Joe Mervini [ 30/Oct/19 ] |
|
Andreas - I did have to kill the fsck with a -9. When I try to clear one of the inodes I am getting this message back: debugfs 1.45.2.wc1 (27-May-2019 debugfs: MMP: e2fsck being run while trying to open /dev/mapper/<mapper device> clri: Filesystem not open
I checked and there are no remnant fsck processes. What do I do?
|
| Comment by Andreas Dilger [ 30/Oct/19 ] |
|
Joe, sorry that I did not see your message sooner. You can clear the e2fsck MMP state with "tune2fs -E clear_mmp -f /dev/ostdev" if you are sure nothing else is using the OST. You should run another e2fsck after the bad inodes are cleared, but it should be much faster without the badblocks pass. |
| Comment by Joe Mervini [ 30/Oct/19 ] |
|
Andreas - I cleared the mmp flag and clearing the inodes one at a time. (I was surprised it actually took quite a while for the first one to clear.) If you see this before I get to the point where I run fsck again, should I use the "-p" or "-y" option? |
| Comment by Andreas Dilger [ 30/Oct/19 ] |
|
Probably the first one was loading the metadata from disk, and the later ones had it cached already. You should probably use "-y" since that will fix any problems found, while "-p" may abort if there are unexpected problems. |
| Comment by Joe Mervini [ 30/Oct/19 ] |
|
I just tried the fsck with the -p option. I got: mscratch-OST0023: Inode <496> extent tree (at level 2) could be narrower, IGNORED. This is repeat for 23 additional inodes. Then I got: mscratch-OST0023: Inode 919 has an invalid extent node (blk 119809, lblk 0)
mscratch-OST0023: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
I'm guessing I am safe to proceed but am curious what these messages mean. |
| Comment by Joe Mervini [ 30/Oct/19 ] |
|
Oh - just saw your message and started the -fy fsck. It shows those inodes being optimized. It's progressing so I'll post an update when it finish. Thanks a lot for your help! |
| Comment by Peter Jones [ 31/Oct/19 ] |
|
Hey jamervi any news? |
| Comment by Joe Mervini [ 31/Oct/19 ] |
|
Oh shoot! I thought I sent this out last night...
The fsck completed and I ran another for good measure. There were 461 instances of inode extent trees being optimized. It then went though Pass 1E Optimizing extent trees. Pass2 exposed those inodes that were cleared. We're going to see if we can determine the affected directories tomorrow. Pass5 had an enormous amount of Block bitmap differences.
The file system is back online and everything looks ok. |
| Comment by Joe Mervini [ 01/Nov/19 ] |
|
Just wanted to post that this case has been resolved. Andreas, if the opportunity comes up at SC I'd like to buy you a few beers! Maybe pick your brain about the mysteries of fsck... |
| Comment by Peter Jones [ 01/Nov/19 ] |
|
Excellent news! Joe, I hope to see you at SC19 too |