Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      We had 2 OSS and 3 different OST crash with bitmap corrupted messages.

      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd
      Apr  3 18:38:16 nbp1-oss6 kernel: 
      Apr  3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3.
      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only
      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd
      
      
      

      These errors were on 2 different backend RAID devices. Note worthy  items:
      1 .The filesystem was +90% full and 1/2 of the data was deleted.
      2. OSTs are formatted with " -E packed_meta_blocks=1 "

      Attachments

        1. bt.2017-07-26-02.48.00
          765 kB
        2. bt.2017-07-26-12.08.43
          808 kB
        3. foreach.out
          736 kB
        4. mballoc.c
          145 kB
        5. ost258.dumpe2fs.after.fsck.gz
          34.46 MB
        6. ost258.dumpe2fs.after.readonly.gz
          34.44 MB
        7. syslog.gp270808.error.gz
          13.37 MB
        8. vmcore-dmesg.txt
          512 kB

        Issue Links

          Activity

            [LU-9410] on-disk bitmap corrupted

            Yes, master also needs the patch 28566.

            yong.fan nasf (Inactive) added a comment - Yes, master also needs the patch 28566.

            Do I need this patch for 2.10.0?

            jaylan Jay Lan (Inactive) added a comment - Do I need this patch for 2.10.0?

            I think that there may be something can be improved for mke2fs, not e2fsck.

            yong.fan nasf (Inactive) added a comment - I think that there may be something can be improved for mke2fs, not e2fsck.

            Does this patch require any changes to e2fsck?

            mhanafi Mahmoud Hanafi added a comment - Does this patch require any changes to e2fsck?

            mhanafi Thanks for the update.

            yong.fan nasf (Inactive) added a comment - mhanafi Thanks for the update.

            updated: we have applied https://review.whamcloud.com/28566 Friday and the filesystem has been stable.

            mhanafi Mahmoud Hanafi added a comment - updated: we have applied https://review.whamcloud.com/28566 Friday and the filesystem has been stable.

            Sorry I typed the patch number. I wanted to say it is stable with 28550.

            Then it is reasonable. As I explained above, 28550 may do more than the necessary fixes. But since it runs stable, you can keep it until next 'corruption'.

            yong.fan nasf (Inactive) added a comment - Sorry I typed the patch number. I wanted to say it is stable with 28550. Then it is reasonable. As I explained above, 28550 may do more than the necessary fixes. But since it runs stable, you can keep it until next 'corruption'.

            Sorry I typed the patch number. I wanted to say it is stable with 28550.

             

            mhanafi Mahmoud Hanafi added a comment - Sorry I typed the patch number. I wanted to say it is stable with 28550.  

            The patch 28550 will take effect before 28566, so if 28550 is applied, then 28566 is meaningless. But 28550 may do more things than the necessary fixes. I am afraid of some penitential side-effect.

            The filesystem is stable with the workaround patch (/28489/). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP.
            

            It is interesting to know that. Because 28489 is just a debug patch, I cannot imagine how it can resolve your issue. It may because your system has jumped over the groups with "BLOCK_UNINIT" flag and zero free blocks in GDP. If it is true, then applying 28566 will not show you more benefit. Since your system is stable running, you can replace the patches with 28566 when it 'corrupted' next time.

            yong.fan nasf (Inactive) added a comment - The patch 28550 will take effect before 28566, so if 28550 is applied, then 28566 is meaningless. But 28550 may do more things than the necessary fixes. I am afraid of some penitential side-effect. The filesystem is stable with the workaround patch (/28489/). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP. It is interesting to know that. Because 28489 is just a debug patch, I cannot imagine how it can resolve your issue. It may because your system has jumped over the groups with "BLOCK_UNINIT" flag and zero free blocks in GDP. If it is true, then applying 28566 will not show you more benefit. Since your system is stable running, you can replace the patches with 28566 when it 'corrupted' next time.

            The filesystem is stable with the workaround patch (/28489/). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP.

            mhanafi Mahmoud Hanafi added a comment - The filesystem is stable with the workaround patch ( /28489/ ). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP.
            jaylan Jay Lan (Inactive) added a comment - - edited

            I did a build with #28566 and #28550 yesterday. For testing purpose, do these two conflict?
            I will undo #28550, but if these two do not collide, we can do testing with the builds I did yesterday.

            Never mind. I just did another build with #28550 pulled out.

            jaylan Jay Lan (Inactive) added a comment - - edited I did a build with #28566 and #28550 yesterday. For testing purpose, do these two conflict? I will undo #28550, but if these two do not collide, we can do testing with the builds I did yesterday. Never mind. I just did another build with #28550 pulled out.

            People

              yong.fan nasf (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: