Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3668

ldiskfs_check_descriptors: Block bitmap for group not in group

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 2.1.6
    • 3
    • 9453

    Description

      Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported.

      Symptoms:

      (1) cannot mount as normal lustre fs
      (2) also cannot mount as ldiskfs
      (3) e2fsck reports alarming number of issues

      Scenario:

      The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only.

      Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld
      iskfs_lookup: deleted inode referenced: 2463495
      Jul 29 22:16:40 oss28 kernel: [547129.298723] Aborting journal on device md24.
      Jul 29 22:16:40 oss28 kernel: [547129.304211] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013176 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316134] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013175 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316136] LDISKFS-fs error (device md14): ld
      iskfs_journal_start_sb: Detected aborted journal
      Jul 29 22:16:40 oss28 kernel: [547129.316139] LDISKFS-fs (md14): Remounting file
      system read-only

      Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST.

      I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows:

      e2fsck 1.42.7.wc1 (12-Apr-2013)
      Inode table for group 3536 is not in group. (block 103079215118)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3538 is not in group. (block 107524506255360)
      Relocate? no

      Inode bitmap for group 3538 is not in group. (block 18446612162378989568)
      Relocate? no

      Inode table for group 3539 is not in group. (block 3439182177370112)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3541 is not in group. (block 138784755704397824)
      Relocate? no

      Inode table for group 3542 is not in group. (block 7138029487521792000)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3544 is not in group. (block 180388626432)
      Relocate? no

      Inode table for group 3545 is not in group. (block 25769803776)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3547 is not in group. (block 346054104973312)
      Relocate? no

      Inode 503 has compression flag set on filesystem without compression support. \
      Clear? no

      Inode 503 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 503 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 503 has an unsupported hash version (40)
      Clear HTree index? no

      HTREE directory inode 503 uses an incompatible htree root node flag.
      Clear HTree index? no

      HTREE directory inode 503 has a tree depth (16) which is too big
      Clear HTree index? no

      Inode 503, i_blocks is 842359139, should be 0. Fix? no

      Inode 504 is in use, but has dtime set. Fix? no

      Inode 504 has imagic flag set. Clear? no

      Inode 504 has a extra size (25649) which is invalid
      Fix? no

      Inode 504 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      Inode 562 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 562 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 562 has an unsupported hash version (51)
      Clear HTree index? no

      HTREE directory inode 562 has a tree depth (59) which is too big
      Clear HTree index? no

      Inode 562, i_blocks is 828596838, should be 0. Fix? no

      Inode 563 is in use, but has dtime set. Fix? no

      Inode 563 has imagic flag set. Clear? no

      Inode 563 has a extra size (12387) which is invalid
      Fix? no

      lock #623050609 (3039575950) causes file to be too big. IGNORED.
      Block #623050610 (3038656474) causes file to be too big. IGNORED.
      Block #623050611 (3037435566) causes file to be too big. IGNORED.
      Block #623050612 (3035215768) causes file to be too big. IGNORED.
      Block #623050613 (3031785159) causes file to be too big. IGNORED.
      Block #623050614 (3027736066) causes file to be too big. IGNORED.
      Block #623050615 (3019627313) causes file to be too big. IGNORED.
      Block #623050616 (2970766533) causes file to be too big. IGNORED.
      Block #623050617 (871157932) causes file to be too big. IGNORED.
      Block #623050618 (879167937) causes file to be too big. IGNORED.
      Block #623050619 (883249763) causes file to be too big. IGNORED.
      Block #623050620 (885943218) causes file to be too big. IGNORED.
      Too many illegal blocks in inode 1618.
      Clear inode? no

      Suppress messages? no

      Attachments

        Issue Links

          Activity

            [LU-3668] ldiskfs_check_descriptors: Block bitmap for group not in group

            So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point.

            At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in LU-14.

            minyard Tommy Minyard added a comment - So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point. At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in LU-14 .

            Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it.

            I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels.

            minyard Tommy Minyard added a comment - Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it. I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels.
            pjones Peter Jones added a comment -

            Tommy

            This is very strange. I have reopened the ticket so we can continue to track this until we have a clearer picture about what is going on.

            Peter

            pjones Peter Jones added a comment - Tommy This is very strange. I have reopened the ticket so we can continue to track this until we have a clearer picture about what is going on. Peter

            Andreas, I know you just closed this yesterday but now we have had the SAME exact sequence of events happen on a second OSS when it had a drive fail and the automatic spare rebuild started. Note that we just upgraded to the 2.6.32-358.11.1 kernel provided in the Intel distribution for the Lustre 2.1.6 release on July 23rd and now after two drive failures have had the exact same sequence of events happen. We had plenty of drive failures and automatic rebuilds with our previous 2.1.5 and corresponding kernel (not sure right offhand the version). We have not seen other reports of this yet in our searching, but this has to be more than a coincidence. For now we have disabled our automatic spare rebuilding on all OSS's.

            Now the bad news is that even after we tried following our previous procedure of removing the drive that was most recently added and rebuilt on, we cannot get e2fsck to complete on the raid device. It ran overnight last night and grew to 27GB in memory but had not written anything to the screen for almost 12 hours before we gave up on it and restarted (took only a few hours on the other array). The restart has fixed a few more errors but is steadily growing in memory usage again. From what we have found searching around, if fsck is using a lot of memory, that typically points to pretty severe file corruption. Any thoughts on this situation or suggestions for us to try?

            Thanks,
            Tommy

            minyard Tommy Minyard added a comment - Andreas, I know you just closed this yesterday but now we have had the SAME exact sequence of events happen on a second OSS when it had a drive fail and the automatic spare rebuild started. Note that we just upgraded to the 2.6.32-358.11.1 kernel provided in the Intel distribution for the Lustre 2.1.6 release on July 23rd and now after two drive failures have had the exact same sequence of events happen. We had plenty of drive failures and automatic rebuilds with our previous 2.1.5 and corresponding kernel (not sure right offhand the version). We have not seen other reports of this yet in our searching, but this has to be more than a coincidence. For now we have disabled our automatic spare rebuilding on all OSS's. Now the bad news is that even after we tried following our previous procedure of removing the drive that was most recently added and rebuilt on, we cannot get e2fsck to complete on the raid device. It ran overnight last night and grew to 27GB in memory but had not written anything to the screen for almost 12 hours before we gave up on it and restarted (took only a few hours on the other array). The restart has fixed a few more errors but is steadily growing in memory usage again. From what we have found searching around, if fsck is using a lot of memory, that typically points to pretty severe file corruption. Any thoughts on this situation or suggestions for us to try? Thanks, Tommy

            Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding.

            I'm submitting a patch for "mkfs.lustre --replace" under LU-14, which is an existing ticket for OST replacement (which thankfully was not needed in this case).

            adilger Andreas Dilger added a comment - Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding. I'm submitting a patch for " mkfs.lustre --replace " under LU-14 , which is an existing ticket for OST replacement (which thankfully was not needed in this case).

            This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f.

            adilger Andreas Dilger added a comment - This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f.

            Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario.

            minyard Tommy Minyard added a comment - Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario.

            While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of "mkfs.lustre --replace --index=N", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST.

            adilger Andreas Dilger added a comment - While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of " mkfs.lustre --replace --index=N ", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST.
            adilger Andreas Dilger added a comment - - edited

            It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid}, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed).

            adilger Andreas Dilger added a comment - - edited It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid }, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed).

            Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive.

            koomie Karl W Schulz (Inactive) added a comment - Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive.

            Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed.

            koomie Karl W Schulz (Inactive) added a comment - Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed.

            People

              adilger Andreas Dilger
              koomie Karl W Schulz (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: