Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5239

Recovery of small files with corrupt objects

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.4.1
    • None
    • RHEL6.4/distro IB kernel 2.6.32-358.18.1.el6
    • 3
    • 14611

    Description

      We had a backend storage issue on 5/30 that corrupted a number of blocks on the filesystem across different OSTs. Since then we were able to recover all filesystem structures with e2fsck and identify what we though were all the files. Just recently, we discovered a new scenario where inodes were corrupted, as so cleared by e2fsck. We have identified 665 of such files and an ls or stat returns "Cannot allocate memory". Syslog has the error

      Jun 20 20:53:15 f1-oss1d5 kernel: [853846.084587] LustreError: 14391:0:(ldlm_resource.c:1165:ldlm_resource_get()) f1-OST00bc: lvbo_init failed for resource 0xd4805:0x0: rc = -2

      This is expected because object 0xd4805 on f1-OST00bc was is invalid (it's inode on f1-OST00bc was cleared by e2fsck).
      gaea9:/tmp # lfs getstripe file.F90
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_stripe_offset: 186
      obdidx objid objid group
      186 871222 0xd4b36 0
      187 870647 0xd48f7 0
      188 870405 0xd4805 0
      189 869971 0xd4653 0

      We would like to attempt recovery of small files <3MB (stripe size 4), where the layout might position the missing object after EOF. We thought a dd if=bad.file of=good.file would return success if EOF was reached before the missing object. However, this method fails with "Cannot allocate memory" even for small files, where dd reports only copying some number of kB.

      What is causing dd to fail to read even files less than 1MB where the bad object is in the 3rd object?
      gaea9:/tmp # dd if=file.F90 of=/tmp/good.out
      dd: reading `file.F90': Cannot allocate memory
      33+0 records in
      33+0 records out
      16896 bytes (17 kB) copied, 0.0531134 s, 318 kB/s

      When opening good.out, it is not complete.

      Is there an alternative method to successfully read to EOF for small files?

      This is not causing a downtime, but it is desirable to recover these files as quickly as reasonably possible.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            blakecaldwell Blake Caldwell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: