Details

    • Bug
    • Resolution: Done
    • Critical
    • None
    • Lustre 2.1.6
    • Toss 2.13 - Lustre 2.1.4
    • 4
    • 9223372036854775807

    Description

      We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update.

      This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal.

      When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check.

      Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length.

      An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message.

      lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read.

      Right now we are in a down and critical state.

      Attachments

        1. cat-lustre-log.txt
          0.2 kB
        2. debug.txt
          4 kB
        3. lustre-log.txt
          0.2 kB

        Activity

          [LU-6945] Clients reporting missing files

          Joe,
          We are going to close this out as you suggest.

          There are a number of fixes for large file systems that have been applied in more recent Lustre versions, and it would be quite time-consuming to try to identify exactly what was the cause here.

          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Joe, We are going to close this out as you suggest. There are a number of fixes for large file systems that have been applied in more recent Lustre versions, and it would be quite time-consuming to try to identify exactly what was the cause here. Thanks, ~ jfc.
          jamervi Joe Mervini added a comment -

          Yes - we might as well close it. I was hoping that Intel might have an idea as to the root cause. My theory is that something changed fundamentally in the way that MDS treats files that don't fill an entire stripe since it only presented the situation after bringing the file system back online under the 2.1.4 version. That isn't something I would have expected in a minor version update.

          In any event, since this was the last of the file systems running the old code we should not encounter the same problem in the future.

          jamervi Joe Mervini added a comment - Yes - we might as well close it. I was hoping that Intel might have an idea as to the root cause. My theory is that something changed fundamentally in the way that MDS treats files that don't fill an entire stripe since it only presented the situation after bringing the file system back online under the 2.1.4 version. That isn't something I would have expected in a minor version update. In any event, since this was the last of the file systems running the old code we should not encounter the same problem in the future.

          Thanks for the advice Andreas. We have found only small files in this condition so far, and we are slowly restoring the items users request. The file system is up and running so we're probably not critical anymore.

          I'll leave it to Joe if there is more he would like to investigate with regard to root cause before closing the ticket.

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Thanks for the advice Andreas. We have found only small files in this condition so far, and we are slowly restoring the items users request. The file system is up and running so we're probably not critical anymore. I'll leave it to Joe if there is more he would like to investigate with regard to root cause before closing the ticket.

          I see from one of the earlier comments that these are "28 22TB OSTs". In this case, I'd recommend to update to the latest e2fsprogs-1.42.12.wc1 since this includes a large number of fixes since 1.42.3.wc3 was released 3 years ago. There definitely were bugs fixed related to filesystem sizes over 16TB since that time.

          adilger Andreas Dilger added a comment - I see from one of the earlier comments that these are "28 22TB OSTs". In this case, I'd recommend to update to the latest e2fsprogs-1.42.12.wc1 since this includes a large number of fixes since 1.42.3.wc3 was released 3 years ago. There definitely were bugs fixed related to filesystem sizes over 16TB since that time.
          jamervi Joe Mervini added a comment -

          The version of e2fsprogs that were in the image that was running 2.1.2 was e2fsprogs-1.42.3.wc3-7.el6.x86_64.

          Don't know if that would explain why the OSTs got corrupted.

          jamervi Joe Mervini added a comment - The version of e2fsprogs that were in the image that was running 2.1.2 was e2fsprogs-1.42.3.wc3-7.el6.x86_64. Don't know if that would explain why the OSTs got corrupted.
          green Oleg Drokin added a comment -

          if it's the first stripe, I imagine you can just copy out the object file from ost FS directly and that would be the content.

          green Oleg Drokin added a comment - if it's the first stripe, I imagine you can just copy out the object file from ost FS directly and that would be the content.

          The first 75 files I ran through with debugfs dump script, were all either 1 or 2 stripes, total size < 2MiB. I'll need to get the user to verify the sanity of the files.

          ps dd gets no such file or directory on these files.

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - The first 75 files I ran through with debugfs dump script, were all either 1 or 2 stripes, total size < 2MiB. I'll need to get the user to verify the sanity of the files. ps dd gets no such file or directory on these files.

          The upgrade on the server was from lustre 2.1.2 -> 2.1.4. The clients are running 2.5.4 llnl version generally, we have a 2.1.4 client off to the side.

          The version of e2fsprogs on the servers right now is:
          e2fsprogs-1.42.7.wc2-7.el6.x86_64

          I also have a script that is using debugfs to dump objects and note the missing ones, on the back end. Joe mentioned that the fsck's appeared to succeed, so we're puzzled also about where did the objects go. They don't show up in lost+found as having been in there before and deleted.

          Is it possible that last_id's were out of order at some point, and the empty objects were deleted as orphans? But in that case it should affect only newish files?

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - The upgrade on the server was from lustre 2.1.2 -> 2.1.4. The clients are running 2.5.4 llnl version generally, we have a 2.1.4 client off to the side. The version of e2fsprogs on the servers right now is: e2fsprogs-1.42.7.wc2-7.el6.x86_64 I also have a script that is using debugfs to dump objects and note the missing ones, on the back end. Joe mentioned that the fsck's appeared to succeed, so we're puzzled also about where did the objects go. They don't show up in lost+found as having been in there before and deleted. Is it possible that last_id's were out of order at some point, and the empty objects were deleted as orphans? But in that case it should affect only newish files?

          People

            green Oleg Drokin
            jamervi Joe Mervini
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: