Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
None
-
3
-
9223372036854775807
Description
We had 2 OSS and 3 different OST crash with bitmap corrupted messages.
Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd Apr 3 18:38:16 nbp1-oss6 kernel: Apr 3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3. Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd
These errors were on 2 different backend RAID devices. Note worthy items:
1 .The filesystem was +90% full and 1/2 of the data was deleted.
2. OSTs are formatted with " -E packed_meta_blocks=1 "
It is true that we missed such patch, Andreas has pointed it out in the first comment:
https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=193803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-193803
But this patch is mostly used for handling the case after the bitmap corruption happened. It allows the system to go ahead without failure right away, then the users can run e2fsck at the maintain windows. As mhanafi commented:
https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=205024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205024, it may not help too much for NASA case.