Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3668

ldiskfs_check_descriptors: Block bitmap for group not in group

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 2.1.6
    • 3
    • 9453

    Description

      Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported.

      Symptoms:

      (1) cannot mount as normal lustre fs
      (2) also cannot mount as ldiskfs
      (3) e2fsck reports alarming number of issues

      Scenario:

      The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only.

      Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld
      iskfs_lookup: deleted inode referenced: 2463495
      Jul 29 22:16:40 oss28 kernel: [547129.298723] Aborting journal on device md24.
      Jul 29 22:16:40 oss28 kernel: [547129.304211] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013176 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316134] LustreError: 17212:0:(obd.h:1615:o
      bd_transno_commit_cb()) scratch-OST0124: transno 176013175 commit error: 2
      Jul 29 22:16:40 oss28 kernel: [547129.316136] LDISKFS-fs error (device md14): ld
      iskfs_journal_start_sb: Detected aborted journal
      Jul 29 22:16:40 oss28 kernel: [547129.316139] LDISKFS-fs (md14): Remounting file
      system read-only

      Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST.

      I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows:

      e2fsck 1.42.7.wc1 (12-Apr-2013)
      Inode table for group 3536 is not in group. (block 103079215118)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3538 is not in group. (block 107524506255360)
      Relocate? no

      Inode bitmap for group 3538 is not in group. (block 18446612162378989568)
      Relocate? no

      Inode table for group 3539 is not in group. (block 3439182177370112)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3541 is not in group. (block 138784755704397824)
      Relocate? no

      Inode table for group 3542 is not in group. (block 7138029487521792000)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3544 is not in group. (block 180388626432)
      Relocate? no

      Inode table for group 3545 is not in group. (block 25769803776)
      WARNING: SEVERE DATA LOSS POSSIBLE.
      Relocate? no

      Block bitmap for group 3547 is not in group. (block 346054104973312)
      Relocate? no

      Inode 503 has compression flag set on filesystem without compression support. \
      Clear? no

      Inode 503 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 503 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 503 has an unsupported hash version (40)
      Clear HTree index? no

      HTREE directory inode 503 uses an incompatible htree root node flag.
      Clear HTree index? no

      HTREE directory inode 503 has a tree depth (16) which is too big
      Clear HTree index? no

      Inode 503, i_blocks is 842359139, should be 0. Fix? no

      Inode 504 is in use, but has dtime set. Fix? no

      Inode 504 has imagic flag set. Clear? no

      Inode 504 has a extra size (25649) which is invalid
      Fix? no

      Inode 504 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      Inode 562 has INDEX_FL flag set but is not a directory.
      Clear HTree index? no

      HTREE directory inode 562 has an invalid root node.
      Clear HTree index? no

      HTREE directory inode 562 has an unsupported hash version (51)
      Clear HTree index? no

      HTREE directory inode 562 has a tree depth (59) which is too big
      Clear HTree index? no

      Inode 562, i_blocks is 828596838, should be 0. Fix? no

      Inode 563 is in use, but has dtime set. Fix? no

      Inode 563 has imagic flag set. Clear? no

      Inode 563 has a extra size (12387) which is invalid
      Fix? no

      lock #623050609 (3039575950) causes file to be too big. IGNORED.
      Block #623050610 (3038656474) causes file to be too big. IGNORED.
      Block #623050611 (3037435566) causes file to be too big. IGNORED.
      Block #623050612 (3035215768) causes file to be too big. IGNORED.
      Block #623050613 (3031785159) causes file to be too big. IGNORED.
      Block #623050614 (3027736066) causes file to be too big. IGNORED.
      Block #623050615 (3019627313) causes file to be too big. IGNORED.
      Block #623050616 (2970766533) causes file to be too big. IGNORED.
      Block #623050617 (871157932) causes file to be too big. IGNORED.
      Block #623050618 (879167937) causes file to be too big. IGNORED.
      Block #623050619 (883249763) causes file to be too big. IGNORED.
      Block #623050620 (885943218) causes file to be too big. IGNORED.
      Too many illegal blocks in inode 1618.
      Clear inode? no

      Suppress messages? no

      Attachments

        Issue Links

          Activity

            [LU-3668] ldiskfs_check_descriptors: Block bitmap for group not in group
            jfc John Fuchs-Chesney (Inactive) made changes -
            Resolution New: Won't Fix [ 2 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

            Customer has apparently moved on from this issue.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Customer has apparently moved on from this issue. ~ jfc.

            Karl and Tommy,
            Reading through this I think you may have moved on from this problem.
            Do you want us to keep this issue open, or may we mark it as resolved?
            Many thanks,
            ~ jfc.

            jfc2 John Fuchs-Chesney (text) (Inactive) added a comment - Karl and Tommy, Reading through this I think you may have moved on from this problem. Do you want us to keep this issue open, or may we mark it as resolved? Many thanks, ~ jfc.
            pjones Peter Jones made changes -
            Labels New: patch

            As Murphy would have it, you hit this problem again the day after I went on vacation for a week.

            I looked through the Lustre 2.1.5->2.1.6 changes, and while there are some changes in the ldiskfs patches, these are mostly in patch context and not functional changes. I couldn't see any other significant changes to the OST code either.

            On the kernel side, there were some changes to ext4 that appear related to freezing the filesystem for suspend/resume, and better block accounting for delayed allocation (which Lustre doesn't use). There is another change to optimize extent handling for fallocate() (which Lustre also doesn't use), but I can't see how it would relate to MD device failure/rebuilding. I'm not sure what changes went into the MD RAID code.

            Do you still have logs for the e2fsck runs from the second OST failure? I can't imagine why it would be consuming so much memory, unless there was some strange corruption that e2fsck isn't expecting. It shouldn't be using more than about 5-6GB of RAM in the worst case. If you have logs it might be possible to reverse-engineer what type of corruption was seen. Presumably, you weren't able to recover anything from this OST in the end? Nothing in lost+found?

            adilger Andreas Dilger added a comment - As Murphy would have it, you hit this problem again the day after I went on vacation for a week. I looked through the Lustre 2.1.5->2.1.6 changes, and while there are some changes in the ldiskfs patches, these are mostly in patch context and not functional changes. I couldn't see any other significant changes to the OST code either. On the kernel side, there were some changes to ext4 that appear related to freezing the filesystem for suspend/resume, and better block accounting for delayed allocation (which Lustre doesn't use). There is another change to optimize extent handling for fallocate() (which Lustre also doesn't use), but I can't see how it would relate to MD device failure/rebuilding. I'm not sure what changes went into the MD RAID code. Do you still have logs for the e2fsck runs from the second OST failure? I can't imagine why it would be consuming so much memory, unless there was some strange corruption that e2fsck isn't expecting. It shouldn't be using more than about 5-6GB of RAM in the worst case. If you have logs it might be possible to reverse-engineer what type of corruption was seen. Presumably, you weren't able to recover anything from this OST in the end? Nothing in lost+found?

            We did take a quick stab at building the v 2.1.6 release against our older kernel (and even the 2.6.32-279.19.1.el6_lustre version that was supported with v 2.1.5), but it looks like the newer ldiskfs patches for the rhel6 series are finding some conflicts which prevents a build out of the box.

            Consequently, we've decided to roll the servers back to v 2.1.5 and the previous production OSS kernel (2.6.32-279.5.2.el6)

            koomie Karl W Schulz (Inactive) added a comment - We did take a quick stab at building the v 2.1.6 release against our older kernel (and even the 2.6.32-279.19.1.el6_lustre version that was supported with v 2.1.5), but it looks like the newer ldiskfs patches for the rhel6 series are finding some conflicts which prevents a build out of the box. Consequently, we've decided to roll the servers back to v 2.1.5 and the previous production OSS kernel (2.6.32-279.5.2.el6)
            pjones Peter Jones added a comment -

            Tommy

            Andreas is out this week but I did manage to connect with him to see whether he had any suggestions and he suggested trying to rebuild 2.1.6 against the older kernel to see whether that has any effect on this behaviour. I'm continuing to talk to other engineers with expertise in this area to see if there are any other thoughts.

            Regards

            Peter

            pjones Peter Jones added a comment - Tommy Andreas is out this week but I did manage to connect with him to see whether he had any suggestions and he suggested trying to rebuild 2.1.6 against the older kernel to see whether that has any effect on this behaviour. I'm continuing to talk to other engineers with expertise in this area to see if there are any other thoughts. Regards Peter

            So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point.

            At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in LU-14.

            minyard Tommy Minyard added a comment - So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point. At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in LU-14 .

            Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it.

            I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels.

            minyard Tommy Minyard added a comment - Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it. I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels.
            pjones Peter Jones made changes -
            Resolution Original: Not a Bug [ 6 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]

            People

              adilger Andreas Dilger
              koomie Karl W Schulz (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: