Details
-
Bug
-
Resolution: Done
-
Critical
-
None
-
Lustre 2.1.6
-
Toss 2.13 - Lustre 2.1.4
-
4
-
9223372036854775807
Description
We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update.
This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal.
When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check.
Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length.
An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message.
lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read.
Right now we are in a down and critical state.
I did not log the output from the recovery process. But every file that was in lost+found was restored leaving nothing behind. And I was watching the recovery as it was happen. I was able to scroll back through one of my screens to get to some of the output before it ran out of buffer. Here is a sample:
Object /mnt/lustre/local/scratch1-OST0013/O/0/d18/2326194 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d19/2326195 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d20/2326196 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d21/2326197 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d22/2326198 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d23/2326199 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d24/2326200 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d25/2326201 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d26/2326202 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d27/2326203 restored.
Object /mnt/lustre/local/scratch1-OST0013/O/0/d28/2326204 restored.
And that's pretty much what I observed throughout the process. Didn't see any messages other than restored.