[LU-8706] e2fsck -fDy running forever Created: 13/Oct/16 Updated: 14/Oct/16 Resolved: 14/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Joe Mervini | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
toss-2.4-2 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We had several OSSes crash yesterday that appear to be related to quotas. It was exposed when I ran e2fsck against one of the OSTs and on a subsequent run after it was repaired, it started complaining about the htree and has been spewing "Unattached inode <num> connect to lost+found (similar to what was reported in Do I just let this run to completion or do I have alternatives? |
| Comments |
| Comment by Andreas Dilger [ 13/Oct/16 ] |
|
What version of e2fsck are you running? There was a bug in e2fsck ( I suspect that there isn't anything to be done at this stage, and you need to let the e2fsck run to completion. If you are getting entries moved into lost+found, then you will need to run ll_recover_lost_found_objs on the OSTs to move the OST objects from lost+found back into the O/0/d* directories. |
| Comment by Joe Mervini [ 13/Oct/16 ] |
|
Andres - On this particular stack we're at 1.42.9-wc1-7 so I'll let it run out. Thank God for ll_recover_lost_found_objs! It's definitely saved me in the past... |
| Comment by Joe Mervini [ 13/Oct/16 ] |
|
The fsck has been running now for 6.5 hours and the inode count that has been moved to lost+found is ~2.4M. I'm concerned about that directory space. Should I be? |
| Comment by Joe Mervini [ 13/Oct/16 ] |
|
Another point - this file system is relatively new (just over a year old and I believe that it was formatted against 2.5.3). So I'm hoping that |
| Comment by Andreas Dilger [ 14/Oct/16 ] |
|
Depending on the amount of corruption, you may have a substantial fraction of all OST objects moved into lost+found. The directory size itself is unlikely to be a problem, since it is grown as more entries are added. However, I believe the insertion is a linear search for each new object, so this has the potential to take a long time. How many objects are on your other OSTs? |
| Comment by Joe Mervini [ 14/Oct/16 ] |
|
The fsck completed sometime over and ll_recover_lost_found_objs was successful and our file system is back online. I was surprised that the object recovery took only seconds to perform, especially since the fsck to SO long to connect what appeared to be every object to lost+found. I anticipated it would have taken hours. In any event, this ticket can be closed. Thanks for your support. |
| Comment by Andreas Dilger [ 14/Oct/16 ] |
|
Glad to hear that this problem was resolved. The ll_recover_lost_found_objs code runs much faster because it is using the hashed directory code in the kernel to insert objects into the O/0/d* directories, and since there are 32 of them they are 1/32 as large. The e2fsck code is doing a linear search and insertion for each entry in O(n^2) time, and this is a single directory that is 32x as large. |