[LU-8706] e2fsck -fDy running forever Created: 13/Oct/16  Updated: 14/Oct/16  Resolved: 14/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Joe Mervini Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

toss-2.4-2


Issue Links:
Related
is related to LU-7368 e2fsck unsafe to interrupt with quota... Resolved
is related to LU-7381 "e2fsck -fD" on directory may cause e... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We had several OSSes crash yesterday that appear to be related to quotas. It was exposed when I ran e2fsck against one of the OSTs and on a subsequent run after it was repaired, it started complaining about the htree and has been spewing "Unattached inode <num> connect to lost+found (similar to what was reported in LU-3542) for the past couple of hours. All the other OSTs checked fine when I just ran e2fsck without the -D option.

Do I just let this run to completion or do I have alternatives?



 Comments   
Comment by Andreas Dilger [ 13/Oct/16 ]

What version of e2fsck are you running? There was a bug in e2fsck (LU-7368) that if you interrupted it when quotas was enabled that it could corrupt the filesystem. There was a second bug in the e2fsck -fD option (LU-7381) that was also fixed in e2fsprogs-1.42.13.wc5, which is the latest release.

I suspect that there isn't anything to be done at this stage, and you need to let the e2fsck run to completion. If you are getting entries moved into lost+found, then you will need to run ll_recover_lost_found_objs on the OSTs to move the OST objects from lost+found back into the O/0/d* directories.

Comment by Joe Mervini [ 13/Oct/16 ]

Andres - On this particular stack we're at 1.42.9-wc1-7 so I'll let it run out. Thank God for ll_recover_lost_found_objs! It's definitely saved me in the past...

Comment by Joe Mervini [ 13/Oct/16 ]

The fsck has been running now for 6.5 hours and the inode count that has been moved to lost+found is ~2.4M. I'm concerned about that directory space. Should I be?

Comment by Joe Mervini [ 13/Oct/16 ]

Another point - this file system is relatively new (just over a year old and I believe that it was formatted against 2.5.3). So I'm hoping that LU-7381 is not relevant.

Comment by Andreas Dilger [ 14/Oct/16 ]

Depending on the amount of corruption, you may have a substantial fraction of all OST objects moved into lost+found. The directory size itself is unlikely to be a problem, since it is grown as more entries are added. However, I believe the insertion is a linear search for each new object, so this has the potential to take a long time. How many objects are on your other OSTs?

Comment by Joe Mervini [ 14/Oct/16 ]

The fsck completed sometime over and ll_recover_lost_found_objs was successful and our file system is back online. I was surprised that the object recovery took only seconds to perform, especially since the fsck to SO long to connect what appeared to be every object to lost+found. I anticipated it would have taken hours.

In any event, this ticket can be closed. Thanks for your support.

Comment by Andreas Dilger [ 14/Oct/16 ]

Glad to hear that this problem was resolved.

The ll_recover_lost_found_objs code runs much faster because it is using the hashed directory code in the kernel to insert objects into the O/0/d* directories, and since there are 32 of them they are 1/32 as large. The e2fsck code is doing a linear search and insertion for each entry in O(n^2) time, and this is a single directory that is 32x as large.

Generated at Sat Feb 10 02:19:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.