Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6945

Clients reporting missing files

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • None
    • Lustre 2.1.6
    • Toss 2.13 - Lustre 2.1.4
    • 4
    • 9223372036854775807

    Description

      We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update.

      This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal.

      When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check.

      Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length.

      An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message.

      lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read.

      Right now we are in a down and critical state.

      Attachments

        1. cat-lustre-log.txt
          0.2 kB
        2. debug.txt
          4 kB
        3. lustre-log.txt
          0.2 kB

        Activity

          People

            green Oleg Drokin
            jamervi Joe Mervini
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: