Details

    • Bug
    • Resolution: Done
    • Critical
    • None
    • Lustre 2.1.6
    • Toss 2.13 - Lustre 2.1.4
    • 4
    • 9223372036854775807

    Description

      We recently ran into LBUG errors with running the 2.5.x Lustre client against Lustre 2.1.2 that’s resolution was to update the version to 2.1.4. In all cases we encountered data loss in that files that previously existed show zero file length. The assumption at the time was that this file loss was due to numerous file system crashes that we encountered prior to the the software update.

      This past Friday our last file system running 2.1.2 went down unexpectedly. Since we do not routinely take our file systems down due to demand, and a desire to preemptively prevent the issues that we encountered on the other file systems I update the file system during the outage. Because the OSTs went read-only I performed fsck’s on all the targets as well as the MDT as I routinely do, and they came back cleanly with the exception of a number of free inode count wrong and free block count wrong messages - which in my experience is normal.

      When the file system was returned to service everything appeared fine but users started reporting that even though they could stat files, when trying to open them they came back as “no such file or directory”. The file system was immediately taken down and a subsequent fsck of the OSTs - which took several hours - put millions of files into lost+found. The MDT came back clean as before. This was the same scenario as was experienced the file systems that encountered the crashes. As was the case on the other file systems I need to use ll_recover_lost_found_objs to restore the objects and then ran another fsck as a sanity check.

      Remounting the file system on a 2.1.4 client show file sizes but can not be opened. On a 2.5.4 client the files show zero file length.

      An attempt was made to go back to 2.1.2 but that was impossible because mounting the MDT under lustre product a “Stale NFS file handle” message.

      lfs getstripe on a sampling files that are inaccessible shows the objects and using debugfs to examine the objects show data in the objects and in the case of text/ascii files they can be easily read.

      Right now we are in a down and critical state.

      Attachments

        1. cat-lustre-log.txt
          0.2 kB
        2. debug.txt
          4 kB
        3. lustre-log.txt
          0.2 kB

        Activity

          [LU-6945] Clients reporting missing files
          jamervi Joe Mervini added a comment -

          I did not log the output from the recovery process. But every file that was in lost+found was restored leaving nothing behind. And I was watching the recovery as it was happen. I was able to scroll back through one of my screens to get to some of the output before it ran out of buffer. Here is a sample:

          Object /mnt/lustre/local/scratch1-OST0013/O/0/d18/2326194 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d19/2326195 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d20/2326196 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d21/2326197 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d22/2326198 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d23/2326199 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d24/2326200 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d25/2326201 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d26/2326202 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d27/2326203 restored.
          Object /mnt/lustre/local/scratch1-OST0013/O/0/d28/2326204 restored.

          And that's pretty much what I observed throughout the process. Didn't see any messages other than restored.

          jamervi Joe Mervini added a comment - I did not log the output from the recovery process. But every file that was in lost+found was restored leaving nothing behind. And I was watching the recovery as it was happen. I was able to scroll back through one of my screens to get to some of the output before it ran out of buffer. Here is a sample: Object /mnt/lustre/local/scratch1-OST0013/O/0/d18/2326194 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d19/2326195 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d20/2326196 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d21/2326197 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d22/2326198 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d23/2326199 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d24/2326200 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d25/2326201 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d26/2326202 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d27/2326203 restored. Object /mnt/lustre/local/scratch1-OST0013/O/0/d28/2326204 restored. And that's pretty much what I observed throughout the process. Didn't see any messages other than restored.
          green Oleg Drokin added a comment -

          So with ll_recover_lost_found_objs - you did not happen to run it in -v mode and save output, did you? I just want to see if any of the now missing objects were in fact recovered and then deleted by something again.

          Additionally, did you see that lost_found is now empty after ll_recover_lost_found_objs was run?

          green Oleg Drokin added a comment - So with ll_recover_lost_found_objs - you did not happen to run it in -v mode and save output, did you? I just want to see if any of the now missing objects were in fact recovered and then deleted by something again. Additionally, did you see that lost_found is now empty after ll_recover_lost_found_objs was run?

          yes that seems to be the question. the next file I checked, size 2321, was there in the first stripe but all other stripes did not exist on the OSTs.

          There were several fsck's run, I'll defer to Joe for questions about them since I wasn't around for that. I believe he has stored output from them. Is there a chance that he hit a version of fsck with a bug in it?

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - yes that seems to be the question. the next file I checked, size 2321, was there in the first stripe but all other stripes did not exist on the OSTs. There were several fsck's run, I'll defer to Joe for questions about them since I wasn't around for that. I believe he has stored output from them. Is there a chance that he hit a version of fsck with a bug in it?
          green Oleg Drokin added a comment -

          I imagine the size might be received from MDS because we started to store size on the mds some time ago, but 2.5 might be disregarding this info and 2.1 did not in the face of missing objects (I can probably check this).

          The more important issue is - if all the files are missign at leat one object - where did all of those object disappeared and how to return them back?
          I know you already did e2fsck and relinked all the files back into place so supposedly nothing should be lost anymore?

          green Oleg Drokin added a comment - I imagine the size might be received from MDS because we started to store size on the mds some time ago, but 2.5 might be disregarding this info and 2.1 did not in the face of missing objects (I can probably check this). The more important issue is - if all the files are missign at leat one object - where did all of those object disappeared and how to return them back? I know you already did e2fsck and relinked all the files back into place so supposedly nothing should be lost anymore?
          jamervi Joe Mervini added a comment -

          John - It is not an appliance. The file system consists of 1 SFA12K 5 stack front-ended by 6 Dell R720's (2 MDSs/4 OSSs). There are 28 22TB OSTs. The storage system was purchase through DDN.

          jamervi Joe Mervini added a comment - John - It is not an appliance. The file system consists of 1 SFA12K 5 stack front-ended by 6 Dell R720's (2 MDSs/4 OSSs). There are 28 22TB OSTs. The storage system was purchase through DDN.

          It is confusing, for sure. The files reported by the user as 'could not open' behave this way, from what we have seen so far. There are many - we've only looked at a couple of them:

          In 2.1 client: size appears correct (non-zero), stat shows it as a regular file /bin/cat at the cmd line gets 'no such file or directory'.

          In 2.5 client: size is 0, stat shows it as empty regular file, /bin/cat at the cmd line reports 'no such file or directory'.

          lfs getstripe looks ok both places, and the first couple of objects exist. The text can be dumped with debugfs from those objects. I can double check, perhaps they are all missing one of their objects.

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - It is confusing, for sure. The files reported by the user as 'could not open' behave this way, from what we have seen so far. There are many - we've only looked at a couple of them: In 2.1 client: size appears correct (non-zero), stat shows it as a regular file /bin/cat at the cmd line gets 'no such file or directory'. In 2.5 client: size is 0, stat shows it as empty regular file, /bin/cat at the cmd line reports 'no such file or directory'. lfs getstripe looks ok both places, and the first couple of objects exist. The text can be dumped with debugfs from those objects. I can double check, perhaps they are all missing one of their objects.
          green Oleg Drokin added a comment -

          Does that mean there are files that cat works on in 2.1 and not in 2.5 clients? identical lfs getstripe output there too?
          Or do files that work in 2.1 also work in 2.5 at least size-wise? Where do accesses to those fail and with what error?

          green Oleg Drokin added a comment - Does that mean there are files that cat works on in 2.1 and not in 2.5 clients? identical lfs getstripe output there too? Or do files that work in 2.1 also work in 2.5 at least size-wise? Where do accesses to those fail and with what error?

          Ruth,

          I don't want to distract you from the main dialog you are having – but if you can give us some details of the hardware you are using this will be helpful.

          Is this a DDN Exascaler system, or something different?
          How many OSS's in the system?
          Who has supplied the storage system?

          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Ruth, I don't want to distract you from the main dialog you are having – but if you can give us some details of the hardware you are using this will be helpful. Is this a DDN Exascaler system, or something different? How many OSS's in the system? Who has supplied the storage system? Thanks, ~ jfc.

          That object is from another similarly missing file in the same dir. And the object 15976682 does not exist, it is the 4th of 4 stripes.

          on 2.5 client:
          lmm_magic: 0x0BD10BD0
          lmm_seq: 0x21b8413ab
          lmm_object_id: 0xd1bc
          lmm_stripe_count: 4
          lmm_stripe_size: 1048576
          lmm_pattern: 1
          lmm_layout_gen: 0
          lmm_stripe_offset: 9
          obdidx objid objid group
          9 17272775 0x1078fc7 0
          30 15973650 0xf3bd12 0
          28 15985690 0xf3ec1a 0
          36 15976682 0xf3c8ea 0

          On 2.1 client:
          lmm_magic: 0x0BD10BD0
          lmm_seq: 0x21b8413ab
          lmm_object_id: 0xd1bc
          lmm_stripe_count: 4
          lmm_stripe_size: 1048576
          lmm_stripe_pattern: 1
          lmm_stripe_offset: 9
          obdidx objid objid group
          9 17272775 0x1078fc7 0
          30 15973650 0xf3bd12 0
          28 15985690 0xf3ec1a 0
          36 15976682 0xf3c8ea 0

          cat reports no such file or directory for both the 2.1 and 2.5 clients.

          We apparently did not check for every object when we were looking at small text files with debugfs.

          ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - That object is from another similarly missing file in the same dir. And the object 15976682 does not exist, it is the 4th of 4 stripes. on 2.5 client: lmm_magic: 0x0BD10BD0 lmm_seq: 0x21b8413ab lmm_object_id: 0xd1bc lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 9 obdidx objid objid group 9 17272775 0x1078fc7 0 30 15973650 0xf3bd12 0 28 15985690 0xf3ec1a 0 36 15976682 0xf3c8ea 0 On 2.1 client: lmm_magic: 0x0BD10BD0 lmm_seq: 0x21b8413ab lmm_object_id: 0xd1bc lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_stripe_pattern: 1 lmm_stripe_offset: 9 obdidx objid objid group 9 17272775 0x1078fc7 0 30 15973650 0xf3bd12 0 28 15985690 0xf3ec1a 0 36 15976682 0xf3c8ea 0 cat reports no such file or directory for both the 2.1 and 2.5 clients. We apparently did not check for every object when we were looking at small text files with debugfs.
          green Oleg Drokin added a comment -

          and when I say mount ost, I mean mount it as ldiskfs - this is safe to do even while the ost itself is up - kernel knows how to moderate those accesses.

          green Oleg Drokin added a comment - and when I say mount ost, I mean mount it as ldiskfs - this is safe to do even while the ost itself is up - kernel knows how to moderate those accesses.
          green Oleg Drokin added a comment -

          "lvbo_init failed for resource 15976682 " basically tells us the lock failed because the object does not exist (-2).
          Hence you get the client problem. So there's no need to get any more client logs.
          Instead if you can mount the ost this message comes from (I know you cannot tell it from the message so you'll need to trigger it for a known file and work from there) and then find this object in the O/.../ directory to make sure it's really there.
          I onder if 2.5 clients just form the resource names differently than 2.1 (though they should not, also show lfs getstripe output for that same file to see what the object is believed to be. Please do it on both 2.1 and 2.5 clients to see if the numbers match too).
          Additionally when you say that the 2.1 clients cannot open the file - do they get an error from the open? what is the error? Becuse seeing hte size means they do lock the correct resource at least.

          green Oleg Drokin added a comment - "lvbo_init failed for resource 15976682 " basically tells us the lock failed because the object does not exist (-2). Hence you get the client problem. So there's no need to get any more client logs. Instead if you can mount the ost this message comes from (I know you cannot tell it from the message so you'll need to trigger it for a known file and work from there) and then find this object in the O/.../ directory to make sure it's really there. I onder if 2.5 clients just form the resource names differently than 2.1 (though they should not, also show lfs getstripe output for that same file to see what the object is believed to be. Please do it on both 2.1 and 2.5 clients to see if the numbers match too). Additionally when you say that the 2.1 clients cannot open the file - do they get an error from the open? what is the error? Becuse seeing hte size means they do lock the correct resource at least.

          People

            green Oleg Drokin
            jamervi Joe Mervini
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: