Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3668

ldiskfs_check_descriptors: Block bitmap for group not in group

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 2.1.6
    • 3
    • 9453

    Description

      Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported.

      Symptoms:

      (1) cannot mount as normal lustre fs
      (2) also cannot mount as ldiskfs
      (3) e2fsck reports alarming number of issues

      Scenario:

      The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only.

      Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld
      iskfs_lookup: deleted inode referenced: 2463495
      Jul 29 22:16:40 oss28 kernel: [547129.298723] Aborting journal on device md24.
      Jul 29 22:16:40 oss28 kernel: [547129.304211] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013176 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316134] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013175 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316136] LDISKFS-fs error (device md14): ld
      iskfs_journal_start_sb: Detected aborted journal
      Jul 29 22:16:40 oss28 kernel: [547129.316139] LDISKFS-fs (md14): Remounting file
      system read-only

      Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST.

      I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows:

      e2fsck 1.42.7.wc1 (12-Apr-2013)
      Inode table for group 3536 is not in group. (block 103079215118)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3538 is not in group. (block 107524506255360)
      Relocate? no

      Inode bitmap for group 3538 is not in group. (block 18446612162378989568)
      Relocate? no

      Inode table for group 3539 is not in group. (block 3439182177370112)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3541 is not in group. (block 138784755704397824)
      Relocate? no

      Inode table for group 3542 is not in group. (block 7138029487521792000)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3544 is not in group. (block 180388626432)
      Relocate? no

      Inode table for group 3545 is not in group. (block 25769803776)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3547 is not in group. (block 346054104973312)
      Relocate? no

      Inode 503 has compression flag set on filesystem without compression support. \
      Clear? no

      Inode 503 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 503 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 503 has an unsupported hash version (40)
      Clear HTree index? no

      HTREE directory inode 503 uses an incompatible htree root node flag.
      Clear HTree index? no

      HTREE directory inode 503 has a tree depth (16) which is too big
      Clear HTree index? no

      Inode 503, i_blocks is 842359139, should be 0. Fix? no

      Inode 504 is in use, but has dtime set. Fix? no

      Inode 504 has imagic flag set. Clear? no

      Inode 504 has a extra size (25649) which is invalid
      Fix? no

      Inode 504 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      Inode 562 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 562 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 562 has an unsupported hash version (51)
      Clear HTree index? no

      HTREE directory inode 562 has a tree depth (59) which is too big
      Clear HTree index? no

      Inode 562, i_blocks is 828596838, should be 0. Fix? no

      Inode 563 is in use, but has dtime set. Fix? no

      Inode 563 has imagic flag set. Clear? no

      Inode 563 has a extra size (12387) which is invalid
      Fix? no

      lock #623050609 (3039575950) causes file to be too big. IGNORED.
      Block #623050610 (3038656474) causes file to be too big. IGNORED.
      Block #623050611 (3037435566) causes file to be too big. IGNORED.
      Block #623050612 (3035215768) causes file to be too big. IGNORED.
      Block #623050613 (3031785159) causes file to be too big. IGNORED.
      Block #623050614 (3027736066) causes file to be too big. IGNORED.
      Block #623050615 (3019627313) causes file to be too big. IGNORED.
      Block #623050616 (2970766533) causes file to be too big. IGNORED.
      Block #623050617 (871157932) causes file to be too big. IGNORED.
      Block #623050618 (879167937) causes file to be too big. IGNORED.
      Block #623050619 (883249763) causes file to be too big. IGNORED.
      Block #623050620 (885943218) causes file to be too big. IGNORED.
      Too many illegal blocks in inode 1618.
      Clear inode? no

      Suppress messages? no

      Attachments

        Issue Links

          Activity

            [LU-3668] ldiskfs_check_descriptors: Block bitmap for group not in group

            Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding.

            I'm submitting a patch for "mkfs.lustre --replace" under LU-14, which is an existing ticket for OST replacement (which thankfully was not needed in this case).

            adilger Andreas Dilger added a comment - Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding. I'm submitting a patch for " mkfs.lustre --replace " under LU-14 , which is an existing ticket for OST replacement (which thankfully was not needed in this case).

            This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f.

            adilger Andreas Dilger added a comment - This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f.

            Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario.

            minyard Tommy Minyard added a comment - Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario.

            While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of "mkfs.lustre --replace --index=N", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST.

            adilger Andreas Dilger added a comment - While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of " mkfs.lustre --replace --index=N ", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST.
            adilger Andreas Dilger added a comment - - edited

            It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid}, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed).

            adilger Andreas Dilger added a comment - - edited It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid }, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed).

            Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive.

            koomie Karl W Schulz (Inactive) added a comment - Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive.

            Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed.

            koomie Karl W Schulz (Inactive) added a comment - Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed.

            It looks like e2fsck is already trying one of the backup group descriptors, but is able to find a backup that doesn't have any problems, so I would just let it proceed with the one it finds. If the first reported problem is at inode 11468804, that is at least half-way through the filesystem (at 128 inodes per group, per previous dumpe2fs output), and inode 14291200 is about 60% through the filesystem so I suspect e2fsck should be able to recover the majority filesystem reasonably well.

            It does make sense to allow e2fsck to progress for a while to verify that it isn't finding massive corruption later on, but from the snippet here it looks much better than before.

            adilger Andreas Dilger added a comment - It looks like e2fsck is already trying one of the backup group descriptors, but is able to find a backup that doesn't have any problems, so I would just let it proceed with the one it finds. If the first reported problem is at inode 11468804, that is at least half-way through the filesystem (at 128 inodes per group, per previous dumpe2fs output), and inode 14291200 is about 60% through the filesystem so I suspect e2fsck should be able to recover the majority filesystem reasonably well. It does make sense to allow e2fsck to progress for a while to verify that it isn't finding massive corruption later on, but from the snippet here it looks much better than before.

            One quick update, we stopped the array and restarted it without the spare drive that was added in last night (running with 9 out of 10 of the drives currently). At this point, the e2fsck output looks much better than before (see below). One question from our side, should we just let e2fsck use the default superblock or should we specify one with the -b option? Also, should we be concerned about any of the errors that e2fsck has reported initially, most look like no major issue, except maybe the first one with resize inode not valid? The current e2fsck is not making any changes. Our plan now is to let this run and see how many errors it finds and if not too bad, rerun it with the -p option to make some repairs. We will still need to add back in the 10th drive and let the array rebuild at some point, but right now we just want to make sure we have a valid MD array that will mount without error.

            [root@oss28.stampede]# e2fsck -fn -B 4096 /dev/md14
            e2fsck 1.42.7.wc1 (12-Apr-2013)
            ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
            e2fsck: Group descriptors look bad... trying backup blocks...
            Resize inode not valid. Recreate? no

            Pass 1: Checking inodes, blocks, and sizes
            Inode 11468804 has an invalid extent node (blk 2936017803, lblk 393)
            Clear? no

            Inode 11468804, i_blocks is 8264, should be 5024. Fix? no

            Inode 11534337 has an invalid extent node (blk 2952816317, lblk 764)
            Clear? no

            Inode 11534337, i_size is 4292608, should be 3129344. Fix? no

            Inode 11534337, i_blocks is 8408, should be 6128. Fix? no

            Inode 13092415 has an invalid extent node (blk 3523217944, lblk 0)
            Clear? no

            Inode 13092415, i_blocks is 544, should be 0. Fix? no

            Inode 14291200 has an invalid extent node (blk 3526886078, lblk 0)
            Clear? no

            Inode 14291200, i_blocks is 2056, should be 0. Fix? no

            minyard Tommy Minyard added a comment - One quick update, we stopped the array and restarted it without the spare drive that was added in last night (running with 9 out of 10 of the drives currently). At this point, the e2fsck output looks much better than before (see below). One question from our side, should we just let e2fsck use the default superblock or should we specify one with the -b option? Also, should we be concerned about any of the errors that e2fsck has reported initially, most look like no major issue, except maybe the first one with resize inode not valid? The current e2fsck is not making any changes. Our plan now is to let this run and see how many errors it finds and if not too bad, rerun it with the -p option to make some repairs. We will still need to add back in the 10th drive and let the array rebuild at some point, but right now we just want to make sure we have a valid MD array that will mount without error. [root@oss28.stampede] # e2fsck -fn -B 4096 /dev/md14 e2fsck 1.42.7.wc1 (12-Apr-2013) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap e2fsck: Group descriptors look bad... trying backup blocks... Resize inode not valid. Recreate? no Pass 1: Checking inodes, blocks, and sizes Inode 11468804 has an invalid extent node (blk 2936017803, lblk 393) Clear? no Inode 11468804, i_blocks is 8264, should be 5024. Fix? no Inode 11534337 has an invalid extent node (blk 2952816317, lblk 764) Clear? no Inode 11534337, i_size is 4292608, should be 3129344. Fix? no Inode 11534337, i_blocks is 8408, should be 6128. Fix? no Inode 13092415 has an invalid extent node (blk 3523217944, lblk 0) Clear? no Inode 13092415, i_blocks is 544, should be 0. Fix? no Inode 14291200 has an invalid extent node (blk 3526886078, lblk 0) Clear? no Inode 14291200, i_blocks is 2056, should be 0. Fix? no

            This is the process to modify the /CONFIGS/mountdata file copied from OST0001 for OST0002, on my MythTV Lustre filesystem named "myth". I verified at the end that the generated "md2.bin" file was binary identical to the one that exists on OST0002 already.

            # mount -t ldiskfs /dev/vgmyth/lvmythost1 /mnt/tmp # mount other OST as ldiskfs
            # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md1.asc    # save mountdata for reference
            # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md2.asc    # save another one for editing
            # vi /tmp/md2.asc                                  # edit 0001 to 0002 in 3 places
            # xxd -r /tmp/md2.asc > /tmp/md2.bin               # convert modified one to binary
            # xxd -r /tmp/md2.bin > /tmp/md2.asc2              # convert back to ASCII to verify
            # diff -u /tmp/md1.asc /tmp/md2.asc2               # compare original and modified
            --- /tmp/md1.asc  2013-07-30 15:40:12.201994814 -0600
            +++ /tmp/md2.asc  2013-07-30 15:40:48.775245386 -0600
            @@ -1,14 +1,14 @@
             0000000: 0100 d01d 0000 0000 0000 0000 0000 0000  ................
            -0000010: 0300 0000 0200 0000 0100 0000 0100 0000  ................
            +0000010: 0300 0000 0200 0000 0200 0000 0100 0000  ................
             0000020: 6d79 7468 0065 0000 0000 0000 0000 0000  myth.e..........
             0000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
             0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
             0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
            -0000060: 6d79 7468 2d4f 5354 3030 3031 0000 0000  myth-OST0001....
            +0000060: 6d79 7468 2d4f 5354 3030 3032 0000 0000  myth-OST0002....
             0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
             0000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
             0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
            -00000a0: 6d79 7468 2d4f 5354 3030 3031 5f55 5549  myth-OST0001_UUI
            +00000a0: 6d79 7468 2d4f 5354 3030 3032 5f55 5549  myth-OST0002_UUI
             00000b0: 4400 0000 0000 0000 0000 0000 0000 0000  D...............
             00000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
             00000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
            

            Peter, it might make sense to allow a mkfs.lustre formatting option to clear the LDD_F_VIRGIN flag so that this binary editing dance isn't needed, and the "new" OST will not try to register with the MGS.

            adilger Andreas Dilger added a comment - This is the process to modify the /CONFIGS/mountdata file copied from OST0001 for OST0002, on my MythTV Lustre filesystem named "myth". I verified at the end that the generated "md2.bin" file was binary identical to the one that exists on OST0002 already. # mount -t ldiskfs /dev/vgmyth/lvmythost1 /mnt/tmp # mount other OST as ldiskfs # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md1.asc # save mountdata for reference # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md2.asc # save another one for editing # vi /tmp/md2.asc # edit 0001 to 0002 in 3 places # xxd -r /tmp/md2.asc > /tmp/md2.bin # convert modified one to binary # xxd -r /tmp/md2.bin > /tmp/md2.asc2 # convert back to ASCII to verify # diff -u /tmp/md1.asc /tmp/md2.asc2 # compare original and modified --- /tmp/md1.asc 2013-07-30 15:40:12.201994814 -0600 +++ /tmp/md2.asc 2013-07-30 15:40:48.775245386 -0600 @@ -1,14 +1,14 @@ 0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................ -0000010: 0300 0000 0200 0000 0100 0000 0100 0000 ................ +0000010: 0300 0000 0200 0000 0200 0000 0100 0000 ................ 0000020: 6d79 7468 0065 0000 0000 0000 0000 0000 myth.e.......... 0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ -0000060: 6d79 7468 2d4f 5354 3030 3031 0000 0000 myth-OST0001.... +0000060: 6d79 7468 2d4f 5354 3030 3032 0000 0000 myth-OST0002.... 0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................ -00000a0: 6d79 7468 2d4f 5354 3030 3031 5f55 5549 myth-OST0001_UUI +00000a0: 6d79 7468 2d4f 5354 3030 3032 5f55 5549 myth-OST0002_UUI 00000b0: 4400 0000 0000 0000 0000 0000 0000 0000 D............... 00000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ Peter, it might make sense to allow a mkfs.lustre formatting option to clear the LDD_F_VIRGIN flag so that this binary editing dance isn't needed, and the "new" OST will not try to register with the MGS.

            The OST is currently deactivated in the MDS, one of the first things we did this morning after finding the problem. I have also deactivated it on all client nodes for the cluster to prevent user tasks from hanging when trying to access a file that resides on that OST. I will talk with Karl and we will start testing with read-only assembly of the array to see if we can get it recovered.

            minyard Tommy Minyard added a comment - The OST is currently deactivated in the MDS, one of the first things we did this morning after finding the problem. I have also deactivated it on all client nodes for the cluster to prevent user tasks from hanging when trying to access a file that resides on that OST. I will talk with Karl and we will start testing with read-only assembly of the array to see if we can get it recovered.

            People

              adilger Andreas Dilger
              koomie Karl W Schulz (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: