[LU-6945] Clients reporting missing files Created: 03/Aug/15 Updated: 13/Aug/15 Resolved: 13/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Joe Mervini | Assignee: | Oleg Drokin |
| Resolution: | Done | Votes: | 0 |
| Labels: | mdt | ||
| Environment: |
Toss 2.13 - Lustre 2.1.4 |
||
| Attachments: |
|
| Severity: | 4 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update. This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal. When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check. Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length. An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message. lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read. Right now we are in a down and critical state. |
| Comments |
| Comment by Ruth Klundt (Inactive) [ 03/Aug/15 ] |
|
debug output from cat of one of the missing files, on the 2.5.4 lustre client. |
| Comment by John Fuchs-Chesney (Inactive) [ 03/Aug/15 ] |
|
Hello Joe, I am reaching out to our experts here and also have alerted DDN with whom you have been working. We will monitor this closely, and provide more information as soon as we can. Thanks, |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
Hm, it's kind of a pity you chose 2.5 client for the debug output from the client and then used a file that was already cached, so all I see in this log is: Are there any ost messages you are seeing at all while there is this ongoing client side trouble? |
| Comment by Ruth Klundt (Inactive) [ 03/Aug/15 ] |
|
fyi the response to cat is no such file or directory, first time and every time. However the data is present in the object listed by lfs getstripe. I'll re-run it on a fresh mount to verify. there are some lustre errors server side. I can collect debug there too if it helps. These on the OSTs: |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
"lvbo_init failed for resource 15976682 " basically tells us the lock failed because the object does not exist (-2). |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
and when I say mount ost, I mean mount it as ldiskfs - this is safe to do even while the ost itself is up - kernel knows how to moderate those accesses. |
| Comment by Ruth Klundt (Inactive) [ 03/Aug/15 ] |
|
That object is from another similarly missing file in the same dir. And the object 15976682 does not exist, it is the 4th of 4 stripes. on 2.5 client: On 2.1 client: cat reports no such file or directory for both the 2.1 and 2.5 clients. We apparently did not check for every object when we were looking at small text files with debugfs. |
| Comment by John Fuchs-Chesney (Inactive) [ 03/Aug/15 ] |
|
Ruth, I don't want to distract you from the main dialog you are having – but if you can give us some details of the hardware you are using this will be helpful. Is this a DDN Exascaler system, or something different? Thanks, |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
Does that mean there are files that cat works on in 2.1 and not in 2.5 clients? identical lfs getstripe output there too? |
| Comment by Ruth Klundt (Inactive) [ 03/Aug/15 ] |
|
It is confusing, for sure. The files reported by the user as 'could not open' behave this way, from what we have seen so far. There are many - we've only looked at a couple of them: In 2.1 client: size appears correct (non-zero), stat shows it as a regular file /bin/cat at the cmd line gets 'no such file or directory'. In 2.5 client: size is 0, stat shows it as empty regular file, /bin/cat at the cmd line reports 'no such file or directory'. lfs getstripe looks ok both places, and the first couple of objects exist. The text can be dumped with debugfs from those objects. I can double check, perhaps they are all missing one of their objects. |
| Comment by Joe Mervini [ 03/Aug/15 ] |
|
John - It is not an appliance. The file system consists of 1 SFA12K 5 stack front-ended by 6 Dell R720's (2 MDSs/4 OSSs). There are 28 22TB OSTs. The storage system was purchase through DDN. |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
I imagine the size might be received from MDS because we started to store size on the mds some time ago, but 2.5 might be disregarding this info and 2.1 did not in the face of missing objects (I can probably check this). The more important issue is - if all the files are missign at leat one object - where did all of those object disappeared and how to return them back? |
| Comment by Ruth Klundt (Inactive) [ 03/Aug/15 ] |
|
yes that seems to be the question. the next file I checked, size 2321, was there in the first stripe but all other stripes did not exist on the OSTs. There were several fsck's run, I'll defer to Joe for questions about them since I wasn't around for that. I believe he has stored output from them. Is there a chance that he hit a version of fsck with a bug in it? |
| Comment by Oleg Drokin [ 03/Aug/15 ] |
|
So with ll_recover_lost_found_objs - you did not happen to run it in -v mode and save output, did you? I just want to see if any of the now missing objects were in fact recovered and then deleted by something again. Additionally, did you see that lost_found is now empty after ll_recover_lost_found_objs was run? |
| Comment by Joe Mervini [ 03/Aug/15 ] |
|
I did not log the output from the recovery process. But every file that was in lost+found was restored leaving nothing behind. And I was watching the recovery as it was happen. I was able to scroll back through one of my screens to get to some of the output before it ran out of buffer. Here is a sample: Object /mnt/lustre/local/scratch1-OST0013/O/0/d18/2326194 restored. And that's pretty much what I observed throughout the process. Didn't see any messages other than restored. |
| Comment by Oleg Drokin [ 04/Aug/15 ] |
|
It's a pity you did not save the output. I guess you still can perform the check - all the objects in your buffer - are they sill present to where they were moved? |
| Comment by Ruth Klundt (Inactive) [ 04/Aug/15 ] |
|
The objects in the screen were all verified to be restored, for that OST. However, we find that although /bin/ls of lost+found (mounted ldiskfs) shows no files, debugfs shows items in that directory, e.g.: 11 (12) . 2 (4084) .. 0 (4096) #524990 0 (4096) #525635 we haven't cross-ref'd the objects against missing ones yet, but wondering if that is expected? |
| Comment by Ruth Klundt (Inactive) [ 04/Aug/15 ] |
|
we've checked all the OSTs and they all report 4-5k items in lost+found via debugfs, but nothing shows up via ls -l. |
| Comment by Oleg Drokin [ 04/Aug/15 ] |
|
It's ok to see the filenames in ebugfs output in lost+found, but because inode field is zero, that just means they are deleted there. It's just because lost+found is special and we never want it to be truncated (not to need to allocate any data blocks when we want to add stuff there) that the names remain in this deleted state there. |
| Comment by Andreas Dilger [ 04/Aug/15 ] |
|
The inodes being reported by debugfs in lost+found can be ignored. They all show a single entry covering the whole block (4096 bytes in size) with inode number 0, which means the entry is unused and should not show up via ls. The lost+found directory is increased in size during e2fsck to hold unreferenced inodes as needed (using the ldiskfs inode number as the filename) but is never shrunk as the files are moved out of the directory, in case it needs to be used again. That is a safety measure on behalf of e2fsck, which tries to avoid allocating new blocks for lost+found during recovery to avoid the potential for further corruption. The discrepancy between 2.1 and 2.5 clients on accessing files with missing objects may be due to changes in the client code. For "small files" (i.e. those with size below the stripe of the missing object) it may be that 2.1 will return the size via stat() as computed from the available objects and ignore the fact that one of the objects is missing until it is read. However, if the object is actually missing then the 2.5 behaviour is "more correct" in that it would be possible to have a sparse file that had part of the data on the missing object. It may be possible to recover some the data from small files with missing objects if they are actually small files that just happen to be striped over 4 OSTs (== default striping?). On a 2.1 client, which reports the file size via stat instead of returning an error, it would be possible to run something like (untested, for example only): #!/bin/bash
for F in "$@"; do
[ -f "$F.recov" ] && echo "$F.recov: already exists" && continue
SIZE=$(stat -c%s "$F")
STRIPE_SZ=$(lfs getstripe -S "$F")
# to be more safe we could assume only the first stripe is valid:
# STRIPE_CT=1
# allowing the full stripe count will still eliminate large files that are definitely missing data
STRIPE_CT=$(lfs getstripe -c "$F")
(( $SIZE >= $STRIPE_CT * $STRIPE_SZ)) && echo "$F: may be missing data" && continue
# such small files do not need multiple stripes
lfs setstripe -c 1 "$F.recov"
dd if="$F" of="$F.recov" bs=$SIZE count=1 conv=noerror
done
This would try to repair specified files that have a size below the stripe width and copy them to a new temporary file. It isn't 100% foolproof since it isn't easy to figure out which object is missing, so there may be some class of files in the 1-4MB size range that have a hole where the missing object is. The other issue that hasn't been discussed here is why the OST was corrupted after the upgrade in the first place. Oleg mentioned that this has happened before with a 2.1->2.5 upgrade, and I'm wondering if there is some ldiskfs patch in the TOSS release that needs to be updated, or some bug in e2fsprogs? What version of e2fsprogs is being used with 2.5? |
| Comment by Ruth Klundt (Inactive) [ 04/Aug/15 ] |
|
The upgrade on the server was from lustre 2.1.2 -> 2.1.4. The clients are running 2.5.4 llnl version generally, we have a 2.1.4 client off to the side. The version of e2fsprogs on the servers right now is: I also have a script that is using debugfs to dump objects and note the missing ones, on the back end. Joe mentioned that the fsck's appeared to succeed, so we're puzzled also about where did the objects go. They don't show up in lost+found as having been in there before and deleted. Is it possible that last_id's were out of order at some point, and the empty objects were deleted as orphans? But in that case it should affect only newish files? |
| Comment by Ruth Klundt (Inactive) [ 04/Aug/15 ] |
|
The first 75 files I ran through with debugfs dump script, were all either 1 or 2 stripes, total size < 2MiB. I'll need to get the user to verify the sanity of the files. ps dd gets no such file or directory on these files. |
| Comment by Oleg Drokin [ 04/Aug/15 ] |
|
if it's the first stripe, I imagine you can just copy out the object file from ost FS directly and that would be the content. |
| Comment by Joe Mervini [ 05/Aug/15 ] |
|
The version of e2fsprogs that were in the image that was running 2.1.2 was e2fsprogs-1.42.3.wc3-7.el6.x86_64. Don't know if that would explain why the OSTs got corrupted. |
| Comment by Andreas Dilger [ 10/Aug/15 ] |
|
I see from one of the earlier comments that these are "28 22TB OSTs". In this case, I'd recommend to update to the latest e2fsprogs-1.42.12.wc1 since this includes a large number of fixes since 1.42.3.wc3 was released 3 years ago. There definitely were bugs fixed related to filesystem sizes over 16TB since that time. |
| Comment by Ruth Klundt (Inactive) [ 10/Aug/15 ] |
|
Thanks for the advice Andreas. We have found only small files in this condition so far, and we are slowly restoring the items users request. The file system is up and running so we're probably not critical anymore. I'll leave it to Joe if there is more he would like to investigate with regard to root cause before closing the ticket. |
| Comment by Joe Mervini [ 11/Aug/15 ] |
|
Yes - we might as well close it. I was hoping that Intel might have an idea as to the root cause. My theory is that something changed fundamentally in the way that MDS treats files that don't fill an entire stripe since it only presented the situation after bringing the file system back online under the 2.1.4 version. That isn't something I would have expected in a minor version update. In any event, since this was the last of the file systems running the old code we should not encounter the same problem in the future. |
| Comment by John Fuchs-Chesney (Inactive) [ 13/Aug/15 ] |
|
Joe, There are a number of fixes for large file systems that have been applied in more recent Lustre versions, and it would be quite time-consuming to try to identify exactly what was the cause here. Thanks, |