Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1026

ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828 corrupted

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • lustre-1.8.4
    • 3
    • 10118

    Description

      The last week, one of our customer got the corrupted messages in the ldiskfs, then OSS remounted that OST with readonly.
      Here is when we got the error messages. the situation is very similar to LU-501, but don't know this is exactly same problem. Please check on this log files.

      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.472484] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.472677]
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.472727] Aborting journal on device dm-5-8.
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.473961] LDISKFS-fs (dm-5): Remounting filesystem read-only
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.510914] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.511103]
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.513219] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.513396]
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.515388] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.515562]
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.517502] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.518511]
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.520586] LDISKFS-fs error (device dm-5): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828corrupted: 4190 blocks free in bitmap, 4189 - in gd
      Jan 19 18:25:46 lustre-oss-0-0 kernel: [8145936.520753]

      Attachments

        1. e2fsck_0E3030430100.log
          0.4 kB
        2. e2fsck_118631EB1200.log
          0.4 kB
        3. e2fsck_output_20130214.txt
          1 kB
        4. kernel_logs_20130214.txt
          13 kB
        5. ldiskfsck.static.sdc1.out_2012-01-30
          3.58 MB
        6. messages.1.gz
          61 kB
        7. messages.gz
          819 kB
        8. messages-another-oss.gz
          177 kB
        9. OST00bc_kern.log
          118 kB

        Issue Links

          Activity

            [LU-1026] ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828 corrupted

            We may have hit this bug after large amount of data was deleted, from a nearly full filesystem, and then data was being written again. We are going to see if we can reproduce it on our test filesystem.

            Can we get a backport to 2.7.2fe.

            Thanks,
            Mahmoud

            mhanafi Mahmoud Hanafi added a comment - We may have hit this bug after large amount of data was deleted, from a nearly full filesystem, and then data was being written again. We are going to see if we can reproduce it on our test filesystem. Can we get a backport to 2.7.2fe. Thanks, Mahmoud

            Shilong,
            it would be great if you submitted this patch http://review.whamcloud.com/16679 (and any parts of http://review.whamcloud.com/16312 needed) to upstream ext4, so that we don't have to maintain this forever in the future. The commit comment should be updated to match the newer kernel and reference ext4 instead of ldiskfs, like:

            ext4: make bitmap corruption not fatal
            
            There can be occasional reasons for bitmap problems, which are
            detected by ext4_mb_check_ondisk_bitmap() and cause the
            filesystem to be remounted read-only due to ext4_error():
            
             EXT4-fs error (device /dev/dm-6-8): ext4_mb_generate_buddy:755:
                group 294, block 0: block bitmap and bg descriptor inconsistent:
                20180 vs 20181 free clusters
             Aborting journal on device dm-6-8.
             EXT4-fs (dm-6): Remounting filesystem read-only
            
            This might be caused by some ext4 internal bugs, which are addressed
            separately.  This patch makes ext4 more robust by the following changes:
            
            - ext4_read_block_bitmap() printed error, so do not call ext4_error() again
            - mark all bits in bitmap used so that it will not be used for allocation
            - mark block group corrupt, use ext4_warning() instead of ext4_error()
            
            Tested by following script:
            
            TEST_DEV="/dev/sdb"
            TEST_MNT="/mnt/ext4"
            
            mkdir -p $TEST_MNT
            mkfs.ext4 -F $TEST_DEV
            
            mount -t ext4 $TEST_DEV $TEST_MNT
            dd if=/dev/zero of=$TEST_MNT/largefile oflag=direct bs=10485760 count=200
            umount $TEST_MNT
            dd if=/dev/zero of=$TEST_DEV oflag=direct bs=4096 seek=641 count=10
            mount -t ext4 $TEST_DEV $TEST_MNT
            rm -f $TEST_MNT/largefile
            dd if=/dev/zero of=$TEST_MNT/largefile oflag=direct bs=10485760 count=200 &&
                  echo "FILESYSTEM still usable after bitmaps corrupts happen"
            umount $TEST_MNT
            e2fsck $TEST_DEV -y
            
            Signed-off-by: Wang Shilong <wshilong@ddn.com>
            Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-1026 
            Reviewed-on: http://review.whamcloud.com/16679
            Reviewed-by: Bob Glossman <bob.glossman@intel.com>
            Reviewed-by: Yang Sheng <yang.sheng@intel.com>
            Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            
            adilger Andreas Dilger added a comment - Shilong, it would be great if you submitted this patch http://review.whamcloud.com/16679 (and any parts of http://review.whamcloud.com/16312 needed) to upstream ext4, so that we don't have to maintain this forever in the future. The commit comment should be updated to match the newer kernel and reference ext4 instead of ldiskfs, like: ext4: make bitmap corruption not fatal There can be occasional reasons for bitmap problems, which are detected by ext4_mb_check_ondisk_bitmap() and cause the filesystem to be remounted read-only due to ext4_error(): EXT4-fs error (device /dev/dm-6-8): ext4_mb_generate_buddy:755: group 294, block 0: block bitmap and bg descriptor inconsistent: 20180 vs 20181 free clusters Aborting journal on device dm-6-8. EXT4-fs (dm-6): Remounting filesystem read-only This might be caused by some ext4 internal bugs, which are addressed separately. This patch makes ext4 more robust by the following changes: - ext4_read_block_bitmap() printed error, so do not call ext4_error() again - mark all bits in bitmap used so that it will not be used for allocation - mark block group corrupt, use ext4_warning() instead of ext4_error() Tested by following script: TEST_DEV="/dev/sdb" TEST_MNT="/mnt/ext4" mkdir -p $TEST_MNT mkfs.ext4 -F $TEST_DEV mount -t ext4 $TEST_DEV $TEST_MNT dd if=/dev/zero of=$TEST_MNT/largefile oflag=direct bs=10485760 count=200 umount $TEST_MNT dd if=/dev/zero of=$TEST_DEV oflag=direct bs=4096 seek=641 count=10 mount -t ext4 $TEST_DEV $TEST_MNT rm -f $TEST_MNT/largefile dd if=/dev/zero of=$TEST_MNT/largefile oflag=direct bs=10485760 count=200 && echo "FILESYSTEM still usable after bitmaps corrupts happen" umount $TEST_MNT e2fsck $TEST_DEV -y Signed-off-by: Wang Shilong <wshilong@ddn.com> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-1026 Reviewed-on: http://review.whamcloud.com/16679 Reviewed-by: Bob Glossman <bob.glossman@intel.com> Reviewed-by: Yang Sheng <yang.sheng@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16679/
            Subject: LU-1026 ldiskfs: make bitmaps corruption not fatal
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e727c383db8b2485d9e6137895136699d57ea047

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16679/ Subject: LU-1026 ldiskfs: make bitmaps corruption not fatal Project: fs/lustre-release Branch: master Current Patch Set: Commit: e727c383db8b2485d9e6137895136699d57ea047

            Wang Shilong (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/16679
            Subject: LU-1026 ldiskfs: more bitmaps corruption handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5378282c5f0f9bbd12e00cde80e5fc757a091548

            gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/16679 Subject: LU-1026 ldiskfs: more bitmaps corruption handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5378282c5f0f9bbd12e00cde80e5fc757a091548

            Hi HongChao,

            This bug was hard to reproduce, it happend in customers' machine several months, so we are going to apply the patch,
            and will let you know results.

            wangshilong Wang Shilong (Inactive) added a comment - Hi HongChao, This bug was hard to reproduce, it happend in customers' machine several months, so we are going to apply the patch, and will let you know results.

            Hi Shilong,

            Do you manage to test with the new patch? and what is the result? Thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Shilong, Do you manage to test with the new patch? and what is the result? Thanks!

            Hi Shilong,

            sorry for delayed response, and thanks you very much for creating the corresponding patch!

            hongchao.zhang Hongchao Zhang added a comment - Hi Shilong, sorry for delayed response, and thanks you very much for creating the corresponding patch!

            Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13992
            Subject: LU-1026 ldiskfs: make bg RO if bitmaps validations fail
            Project: fs/lustre-release
            Branch: b2_1
            Current Patch Set: 1
            Commit: 635f7aa3a562d6af4887a463900768ec872a60a9

            gerrit Gerrit Updater added a comment - Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13992 Subject: LU-1026 ldiskfs: make bg RO if bitmaps validations fail Project: fs/lustre-release Branch: b2_1 Current Patch Set: 1 Commit: 635f7aa3a562d6af4887a463900768ec872a60a9

            Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13991
            Subject: LU-1026 ldiskfs: try one more time with locked buffer
            Project: fs/lustre-release
            Branch: b2_1
            Current Patch Set: 1
            Commit: 25a9e3904313190714b4e22ad1b3f50e1cef62d5

            gerrit Gerrit Updater added a comment - Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13991 Subject: LU-1026 ldiskfs: try one more time with locked buffer Project: fs/lustre-release Branch: b2_1 Current Patch Set: 1 Commit: 25a9e3904313190714b4e22ad1b3f50e1cef62d5

            People

              hongchao.zhang Hongchao Zhang
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: