[LU-6945] Clients reporting missing files - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
- mdt
Environment:
Toss 2.13 - Lustre 2.1.4

Severity:
4
Rank (Obsolete):
9223372036854775807

Description

We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update.

This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal.

When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check.

Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length.

An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message.

lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read.

Right now we are in a down and critical state.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cat-lustre-log.txt
03/Aug/15 9:11 PM
0.2 kB
Ruth Klundt
debug.txt
03/Aug/15 8:25 PM
4 kB
Joe Mervini
lustre-log.txt
03/Aug/15 8:25 PM
0.2 kB
Joe Mervini

Activity

People

Assignee:: Oleg Drokin

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/Aug/15 8:25 PM

Updated:: 13/Aug/15 1:03 AM

Resolved:: 13/Aug/15 1:03 AM