Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
None
-
3
-
9223372036854775807
Description
We had 2 OSS and 3 different OST crash with bitmap corrupted messages.
Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd Apr 3 18:38:16 nbp1-oss6 kernel: Apr 3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3. Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only Apr 3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd
These errors were on 2 different backend RAID devices. Note worthy items:
1 .The filesystem was +90% full and 1/2 of the data was deleted.
2. OSTs are formatted with " -E packed_meta_blocks=1 "
Attachments
- bt.2017-07-26-02.48.00
- 765 kB
- bt.2017-07-26-12.08.43
- 808 kB
- foreach.out
- 736 kB
- mballoc.c
- 145 kB
- ost258.dumpe2fs.after.fsck.gz
- 34.46 MB
- syslog.gp270808.error.gz
- 13.37 MB
- vmcore-dmesg.txt
- 512 kB
Issue Links
Activity
mhanafi
It looks different from the original one, would you please to show me more logs (dmesg, /var/log/messages) about the latest corruption ? Is the system still accessible after above warning?
Applied the new patch. After a full fsck mounting osts resulted in this many block groups getting corrected.
---------------- service603 ---------------- 4549 dm-33): ---------------- service604 ---------------- 4425 dm-32): ---------------- service606 ---------------- 4658 dm-29): ---------------- service610 ---------------- 4631 dm-33): ---------------- service611 ---------------- 4616 dm-28): ---------------- service616 ---------------- 4652 dm-35): ---------------- service617 ---------------- 4501 dm-21): ---------------- service619 ---------------- 4657 dm-25):
We need to rate limit the warnings.
I used systemtap to catch one of these bad groups and dump out the ldiskfs_group_desc struct.
mballoc.c:826: first_group: 274007 bg_free_blocks_count_hi: 0 bg_block_bitmap_hi: 0 bg_free_blocks_count_lo: 0 mballoc.c:826:$desc {.bg_block_bitmap_lo=328727, .bg_inode_bitmap_lo=930551, .bg_inode_table_lo=3450424, .bg_free_blocks_count_lo=0, .bg_free_inodes_count_lo=128, .bg_used_dirs_count_lo=0, .bg_flags=7, .bg_reserved=[...], .bg_itable_unused_lo=128, .bg_checksum=55256, .bg_block_bitmap_hi=0, .bg_inode_bitmap_hi=0, .bg_inode_table_hi=0, .bg_free_blocks_count_hi=0, .bg_free_inodes_count_hi=0, .bg_used_dirs_count_hi=0, .bg_itable_unused_hi=0, .bg_reserved2=[...]}
It also seem odd that dumpe2fs can produce different results for unused block groups. Sometimes it will show block_bitmap!=free_blocks and other time it will be ok.
---
in ldiskfs_valid_block_bitmap() I don't understand this
if (LDISKFS_HAS_INCOMPAT_FEATURE(sb, LDISKFS_FEATURE_INCOMPAT_FLEX_BG)) { /* with FLEX_BG, the inode/block bitmaps and itable * blocks may not be in the group at all * so the bitmap validation will be skipped for those groups * or it has to also read the block group where the bitmaps * are located to verify they are set. */ return 1; }
We have flex_bg enabled would this apply to us?
For the OST that are prone to the bitmap errors cat /proc/fs/ldiskfs/dm*/mb_groups will reproduce the errors.
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768
The logs shows that the ldiskfs_init_block_bitmap() initialized the bitmap, but the free blocks count in the group descriptor is still zero, that caused the subsequent ldiskfs_mb_check_ondisk_bitmap() failure. Currently, I can not say it is corruption, but more like logic issue. The patch will set the free block count based on the real free bits in the bitmap. It may be not the perfect solution, but we can try whether it can resolve your trouble or not.
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28550
Subject: LU-9410 ldiskfs: handle unmatched bitmap
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0a4199ad21c5ac23a4a4e7e07847610ad8ec7994
Got block group debug logs with corruption. Block group is #270808. I will attach full log file to the case. syslog.gp270808.error.gz
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:14 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:15 nbp2-oss20 kernel: Error in loading buddy information for 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
With the new build are we suppose to have mballoc-debug in /proc or /sys?
because the find doesn't find anything.
Never mind I figured this out. We need to mount debugfs for it to show up.
LU-7114 will allow the system to go ahead without failure right away when found corrupted bitmap, but the corruption is still there. I would suggest to apply the patch https://review.whamcloud.com/#/c/28489/, it will give us more information the mb operations trace.
So haven't put patch debug 28489 in place but are now running with "" patch. It already has found bitmap errors.LU-7114
ug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:43 nbp2-oss20 kernel: Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:43 nbp2-oss20 kernel: Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:44 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:45 nbp2-oss20 kernel: Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:45 nbp2-oss20 kernel: Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:46 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:47 nbp2-oss20 kernel: Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:47 nbp2-oss20 kernel: Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:49 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:50 nbp2-oss20 kernel: Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:50 nbp2-oss20 kernel: Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:53 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:54 nbp2-oss20 kernel: Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:54 nbp2-oss20 kernel: Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:59 nbp2-oss20 kernel: Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:59 nbp2-oss20 kernel: Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:06:05 nbp2-oss20 kernel: Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:06:05 nbp2-oss20 kernel: Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Some time later
Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276684 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:05:12 nbp2-oss20 kernel: Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276685 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 pcp-pmie[5801]: High 1-minute load average 354load@nbp2-oss20 Aug 12 04:07:56 nbp2-oss20 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304861 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304862 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304863 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304864 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: .....
It has marked 6727 uniq groups as bad for dm-21(ost319)
https://review.whamcloud.com/28489 is refreshed, please try again. Thanks!
here is part of dmesg. The high rate of messages caused the root drive scsi device to reset. But all but one server recovered. I had to turn down printk log level down to get the last one to recover.
Here is /var/log/messages