Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      We had 2 OSS and 3 different OST crash with bitmap corrupted messages.

      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd
      Apr  3 18:38:16 nbp1-oss6 kernel: 
      Apr  3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3.
      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only
      Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd
      
      
      

      These errors were on 2 different backend RAID devices. Note worthy  items:
      1 .The filesystem was +90% full and 1/2 of the data was deleted.
      2. OSTs are formatted with " -E packed_meta_blocks=1 "

      Attachments

        1. bt.2017-07-26-02.48.00
          765 kB
        2. bt.2017-07-26-12.08.43
          808 kB
        3. foreach.out
          736 kB
        4. mballoc.c
          145 kB
        5. ost258.dumpe2fs.after.fsck.gz
          34.46 MB
        6. ost258.dumpe2fs.after.readonly.gz
          34.44 MB
        7. syslog.gp270808.error.gz
          13.37 MB
        8. vmcore-dmesg.txt
          512 kB

        Issue Links

          Activity

            [LU-9410] on-disk bitmap corrupted
            mhanafi Mahmoud Hanafi added a comment - - edited

            here is part of dmesg. The high rate of messages caused the root drive scsi device to reset. But all but one server recovered. I had to turn down printk log level down to get the last one to recover.

            LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262310
            
            LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262311
            
            LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262312
            
            LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262313
            
            LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262314
            LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.149.2.156@o2ib313 portal 28 match 1575300167923792 offset 0 length 520: 4
            LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Skipped 978380 previous similar messages
            sd 0:0:1:0: attempting task abort! scmd(ffff880af433e0c0)
            sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 a0 08 08 00 00 08 00
            scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
            scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2)
            scsi target0:0:1: enclosure level(0x0000),connector name(    ^C)
            sd 0:0:1:0: task abort: SUCCESS scmd(ffff880af433e0c0)
            sd 0:0:1:0: attempting task abort! scmd(ffff880a64ab46c0)
            sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 e0 08 08 00 00 08 00
            scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
            scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2)
            scsi target0:0:1: enclosure level(0x0000),connector name(    ^C)
            sd 0:0:1:0: task abort: SUCCESS scmd(ffff880a64ab46c0)
            sd 0:0:1:0: attempting task abort! scmd(ffff880b21cec180)
            sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 c0 08 08 00 00 08 00
            scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
            DISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts: 
            LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts: 
            mounted filesystem with ordered data mode. quota=on. Opts: 
            LDISKFS-fs (dm-29): mounted filesystem with ordered data mode. quota=on. Opts: 
            
            LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts: 
            Lustre: nbp2-OST0081: Not available for connect from 10.151.43.107@o2ib (not set up)
            Lustre: Skipped 3 previous similar messages
            Lustre: nbp2-OST0081: Not available for connect from 10.151.29.130@o2ib (not set up)
            Lustre: Skipped 113 previous similar messages
            Lustre: nbp2-OST0081: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
            Lustre: nbp2-OST0081: Will be in recovery for at least 2:30, or until 14441 clients reconnect
            Lustre: nbp2-OST0081: Denying connection for new client 35b99837-9505-fc4d-270f-f2d1ca30372d (at 10.151.30.176@o2ib), waiting for all 14441 known clients (44 recovered, 1 in progress, and 0 evicted) to recover in 5:10
            
            
            

            Here is /var/log/messages

            Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.151.30.120@o2ib portal 28 match 1575477031778096 offset 0 length 520: 4
            Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Skipped 1037319 previous similar messages
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-30):
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-28): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-31): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-21): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-19): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-22): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-26): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-33): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-32): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-24): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-25): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:40 nbp2-oss10 kernel: 
            Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-29):
            Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-35): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:41 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:49 nbp2-oss10 kernel: LDISKFS-fs (dm-27): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available for connect from 10.151.50.143@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
            Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: Skipped 314 previous similar messages
            Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.9.177@o2ib (not set up)
            Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 11 previous similar messages
            Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available for connect from 10.151.8.85@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
            Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: Skipped 3632 previous similar messages
            Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.50.241@o2ib (not set up)
            Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 180 previous similar messages
            Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0135_UUID: not available for connect from 10.151.48.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
            Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: Skipped 6273 previous similar messages
            Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.7.158@o2ib (not set up)
            Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: Skipped 402 previous similar messages
            Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
            Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Will be in recovery for at least 2:30, or until 14452 clients reconnect
            
            
            mhanafi Mahmoud Hanafi added a comment - - edited here is part of dmesg. The high rate of messages caused the root drive scsi device to reset. But all but one server recovered. I had to turn down printk log level down to get the last one to recover. LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262310 LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262311 LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262312 LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262313 LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 for group 262314 LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.149.2.156@o2ib313 portal 28 match 1575300167923792 offset 0 length 520: 4 LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Skipped 978380 previous similar messages sd 0:0:1:0: attempting task abort! scmd(ffff880af433e0c0) sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 a0 08 08 00 00 08 00 scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2) scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2) scsi target0:0:1: enclosure level(0x0000),connector name( ^C) sd 0:0:1:0: task abort: SUCCESS scmd(ffff880af433e0c0) sd 0:0:1:0: attempting task abort! scmd(ffff880a64ab46c0) sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 e0 08 08 00 00 08 00 scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2) scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2) scsi target0:0:1: enclosure level(0x0000),connector name( ^C) sd 0:0:1:0: task abort: SUCCESS scmd(ffff880a64ab46c0) sd 0:0:1:0: attempting task abort! scmd(ffff880b21cec180) sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 c0 08 08 00 00 08 00 scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2) DISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts: mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (dm-29): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: nbp2-OST0081: Not available for connect from 10.151.43.107@o2ib (not set up) Lustre: Skipped 3 previous similar messages Lustre: nbp2-OST0081: Not available for connect from 10.151.29.130@o2ib (not set up) Lustre: Skipped 113 previous similar messages Lustre: nbp2-OST0081: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Lustre: nbp2-OST0081: Will be in recovery for at least 2:30, or until 14441 clients reconnect Lustre: nbp2-OST0081: Denying connection for new client 35b99837-9505-fc4d-270f-f2d1ca30372d (at 10.151.30.176@o2ib), waiting for all 14441 known clients (44 recovered, 1 in progress, and 0 evicted) to recover in 5:10 Here is /var/log/messages Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.151.30.120@o2ib portal 28 match 1575477031778096 offset 0 length 520: 4 Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Skipped 1037319 previous similar messages Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-30): Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-28): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-31): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-21): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-19): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-22): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-26): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-33): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-32): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-24): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-25): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:40 nbp2-oss10 kernel: Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-29): Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-35): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:41 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:49 nbp2-oss10 kernel: LDISKFS-fs (dm-27): mounted filesystem with ordered data mode. quota=on. Opts: Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available for connect from 10.151.50.143@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: Skipped 314 previous similar messages Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.9.177@o2ib (not set up) Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 11 previous similar messages Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available for connect from 10.151.8.85@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: Skipped 3632 previous similar messages Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.50.241@o2ib (not set up) Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 180 previous similar messages Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0135_UUID: not available for connect from 10.151.48.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: Skipped 6273 previous similar messages Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available for connect from 10.151.7.158@o2ib (not set up) Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: Skipped 402 previous similar messages Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Will be in recovery for at least 2:30, or until 14452 clients reconnect

            mhanafi
            It looks different from the original one, would you please to show me more logs (dmesg, /var/log/messages) about the latest corruption ? Is the system still accessible after above warning?

            yong.fan nasf (Inactive) added a comment - mhanafi It looks different from the original one, would you please to show me more logs (dmesg, /var/log/messages) about the latest corruption ? Is the system still accessible after above warning?

            Applied the new patch. After a full fsck mounting osts resulted in this many block groups getting corrected.

            ----------------
            service603
            ----------------
             4549 dm-33):
            
            ----------------
            service604
            ----------------
             4425 dm-32):
            
            ----------------
            service606
            ----------------
             4658 dm-29):
            
            ----------------
            service610
            ----------------
             4631 dm-33):
            
            ----------------
            service611
            ----------------
             4616 dm-28):
            
            ----------------
            service616
            ----------------
             4652 dm-35):
            
            ----------------
            service617
            ----------------
             4501 dm-21):
            
            ----------------
            service619
            ----------------
             4657 dm-25):
            
            

            We need to rate limit the warnings.

            mhanafi Mahmoud Hanafi added a comment - Applied the new patch. After a full fsck mounting osts resulted in this many block groups getting corrected. ---------------- service603 ---------------- 4549 dm-33): ---------------- service604 ---------------- 4425 dm-32): ---------------- service606 ---------------- 4658 dm-29): ---------------- service610 ---------------- 4631 dm-33): ---------------- service611 ---------------- 4616 dm-28): ---------------- service616 ---------------- 4652 dm-35): ---------------- service617 ---------------- 4501 dm-21): ---------------- service619 ---------------- 4657 dm-25): We need to rate limit the warnings.
            mhanafi Mahmoud Hanafi added a comment - - edited

            I used systemtap to catch one of these bad groups and dump out the ldiskfs_group_desc struct.

            mballoc.c:826: first_group: 274007 bg_free_blocks_count_hi: 0 bg_block_bitmap_hi: 0 bg_free_blocks_count_lo: 0
            mballoc.c:826:$desc {.bg_block_bitmap_lo=328727, .bg_inode_bitmap_lo=930551, .bg_inode_table_lo=3450424, .bg_free_blocks_count_lo=0, .bg_free_inodes_count_lo=128, .bg_used_dirs_count_lo=0, .bg_flags=7, .bg_reserved=[...], .bg_itable_unused_lo=128, .bg_checksum=55256, .bg_block_bitmap_hi=0, .bg_inode_bitmap_hi=0, .bg_inode_table_hi=0, .bg_free_blocks_count_hi=0, .bg_free_inodes_count_hi=0, .bg_used_dirs_count_hi=0, .bg_itable_unused_hi=0, .bg_reserved2=[...]}
            
            
            

             

            It also seem odd that dumpe2fs can produce different results for unused block groups. Sometimes it will show block_bitmap!=free_blocks and other time it will be ok.

             ---

            in ldiskfs_valid_block_bitmap() I don't understand this

             if (LDISKFS_HAS_INCOMPAT_FEATURE(sb, LDISKFS_FEATURE_INCOMPAT_FLEX_BG)) {
             /* with FLEX_BG, the inode/block bitmaps and itable
             * blocks may not be in the group at all
             * so the bitmap validation will be skipped for those groups
             * or it has to also read the block group where the bitmaps
             * are located to verify they are set.
             */
             return 1;
             }
            
            

            We have flex_bg enabled would this apply to us?

             

            For the OST that are prone to the bitmap errors cat /proc/fs/ldiskfs/dm*/mb_groups will reproduce the errors.

             

            mhanafi Mahmoud Hanafi added a comment - - edited I used systemtap to catch one of these bad groups and dump out the ldiskfs_group_desc struct. mballoc.c:826: first_group: 274007 bg_free_blocks_count_hi: 0 bg_block_bitmap_hi: 0 bg_free_blocks_count_lo: 0 mballoc.c:826:$desc {.bg_block_bitmap_lo=328727, .bg_inode_bitmap_lo=930551, .bg_inode_table_lo=3450424, .bg_free_blocks_count_lo=0, .bg_free_inodes_count_lo=128, .bg_used_dirs_count_lo=0, .bg_flags=7, .bg_reserved=[...], .bg_itable_unused_lo=128, .bg_checksum=55256, .bg_block_bitmap_hi=0, .bg_inode_bitmap_hi=0, .bg_inode_table_hi=0, .bg_free_blocks_count_hi=0, .bg_free_inodes_count_hi=0, .bg_used_dirs_count_hi=0, .bg_itable_unused_hi=0, .bg_reserved2=[...]}   It also seem odd that dumpe2fs can produce different results for unused block groups. Sometimes it will show block_bitmap!=free_blocks and other time it will be ok.  --- in ldiskfs_valid_block_bitmap() I don't understand this if (LDISKFS_HAS_INCOMPAT_FEATURE(sb, LDISKFS_FEATURE_INCOMPAT_FLEX_BG)) { /* with FLEX_BG, the inode/block bitmaps and itable * blocks may not be in the group at all * so the bitmap validation will be skipped for those groups * or it has to also read the block group where the bitmaps * are located to verify they are set. */ return 1; } We have flex_bg enabled would this apply to us?   For the OST that are prone to the bitmap errors cat /proc/fs/ldiskfs/dm*/mb_groups will reproduce the errors.  

            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768

            The logs shows that the ldiskfs_init_block_bitmap() initialized the bitmap, but the free blocks count in the group descriptor is still zero, that caused the subsequent ldiskfs_mb_check_ondisk_bitmap() failure. Currently, I can not say it is corruption, but more like logic issue. The patch will set the free block count based on the real free bits in the bitmap. It may be not the perfect solution, but we can try whether it can resolve your trouble or not.

            yong.fan nasf (Inactive) added a comment - Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768 The logs shows that the ldiskfs_init_block_bitmap() initialized the bitmap, but the free blocks count in the group descriptor is still zero, that caused the subsequent ldiskfs_mb_check_ondisk_bitmap() failure. Currently, I can not say it is corruption, but more like logic issue. The patch will set the free block count based on the real free bits in the bitmap. It may be not the perfect solution, but we can try whether it can resolve your trouble or not.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28550
            Subject: LU-9410 ldiskfs: handle unmatched bitmap
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0a4199ad21c5ac23a4a4e7e07847610ad8ec7994

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28550 Subject: LU-9410 ldiskfs: handle unmatched bitmap Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0a4199ad21c5ac23a4a4e7e07847610ad8ec7994
            mhanafi Mahmoud Hanafi added a comment - - edited

            Got block group debug logs with corruption. Block group is #270808. I will attach full log file to the case. syslog.gp270808.error.gz

            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
            Aug 14 18:37:14 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
            Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
            Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
            Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 14 18:37:15 nbp2-oss20 kernel: Error in loading buddy information for 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
            Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808
            Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
            Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
            Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0
            
            

             

             

            mhanafi Mahmoud Hanafi added a comment - - edited Got block group debug logs with corruption. Block group is #270808. I will attach full log file to the case. syslog.gp270808.error.gz Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:14 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:15 nbp2-oss20 kernel: Error in loading buddy information for 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0 Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap for group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: Error in loading buddy information for 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808 Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap for group 270808 in page 541616/0    
            mhanafi Mahmoud Hanafi added a comment - - edited

            With the new build are we suppose to have mballoc-debug in /proc or /sys?

            because the find doesn't find anything.

             

            Never mind I figured this out. We need to mount debugfs for it to show up.

            mhanafi Mahmoud Hanafi added a comment - - edited With the new build are we suppose to have mballoc-debug in /proc or /sys? because the find doesn't find anything.   Never mind I figured this out. We need to mount debugfs for it to show up.

            LU-7114 will allow the system to go ahead without failure right away when found corrupted bitmap, but the corruption is still there. I would suggest to apply the patch https://review.whamcloud.com/#/c/28489/, it will give us more information the mb operations trace.

            yong.fan nasf (Inactive) added a comment - LU-7114 will allow the system to go ahead without failure right away when found corrupted bitmap, but the corruption is still there. I would suggest to apply the patch https://review.whamcloud.com/#/c/28489/ , it will give us more information the mb operations trace.

            So haven't put patch debug 28489 in place but are now running with "LU-7114" patch. It already has found bitmap errors.

            ug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:43 nbp2-oss20 kernel: 
            Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:43 nbp2-oss20 kernel: 
            Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:44 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:45 nbp2-oss20 kernel: 
            Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:45 nbp2-oss20 kernel: 
            Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:46 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:47 nbp2-oss20 kernel: 
            Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:47 nbp2-oss20 kernel: 
            Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:49 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:50 nbp2-oss20 kernel: 
            Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:50 nbp2-oss20 kernel: 
            Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:53 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:54 nbp2-oss20 kernel: 
            Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:54 nbp2-oss20 kernel: 
            Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:59 nbp2-oss20 kernel: 
            Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:05:59 nbp2-oss20 kernel: 
            Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:06:05 nbp2-oss20 kernel: 
            Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 01:06:05 nbp2-oss20 kernel: 
            Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790
            Aug 12 01:06:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
            
            
            
            
            

            Some time later

            Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276684 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:05:12 nbp2-oss20 kernel: 
            Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276685 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:07:56 nbp2-oss20 pcp-pmie[5801]: High 1-minute load average 354load@nbp2-oss20
            Aug 12 04:07:56 nbp2-oss20 - in gd
            Aug 12 04:07:56 nbp2-oss20 kernel: 
            Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304861 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:07:56 nbp2-oss20 kernel: 
            Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304862 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:07:56 nbp2-oss20 kernel: 
            Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304863 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:07:56 nbp2-oss20 kernel: 
            Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304864 corrupted: 32768 blocks free in bitmap, 0 - in gd
            Aug 12 04:07:56 nbp2-oss20 kernel: 
            .....
            

            It has marked 6727 uniq groups as bad for dm-21(ost319)

             

            mhanafi Mahmoud Hanafi added a comment - So haven't put patch debug 28489 in place but are now running with " LU-7114 " patch. It already has found bitmap errors. ug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:43 nbp2-oss20 kernel: Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:43 nbp2-oss20 kernel: Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:44 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:45 nbp2-oss20 kernel: Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:45 nbp2-oss20 kernel: Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:46 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:47 nbp2-oss20 kernel: Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:47 nbp2-oss20 kernel: Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:49 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:50 nbp2-oss20 kernel: Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:50 nbp2-oss20 kernel: Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:53 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:54 nbp2-oss20 kernel: Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:54 nbp2-oss20 kernel: Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:59 nbp2-oss20 kernel: Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:05:59 nbp2-oss20 kernel: Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:06:05 nbp2-oss20 kernel: Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 01:06:05 nbp2-oss20 kernel: Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information for 275790 Aug 12 01:06:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd Some time later Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276684 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:05:12 nbp2-oss20 kernel: Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 276685 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 pcp-pmie[5801]: High 1-minute load average 354load@nbp2-oss20 Aug 12 04:07:56 nbp2-oss20 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304861 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304862 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304863 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 304864 corrupted: 32768 blocks free in bitmap, 0 - in gd Aug 12 04:07:56 nbp2-oss20 kernel: ..... It has marked 6727 uniq groups as bad for dm-21(ost319)  

            https://review.whamcloud.com/28489 is refreshed, please try again. Thanks!

            yong.fan nasf (Inactive) added a comment - https://review.whamcloud.com/28489 is refreshed, please try again. Thanks!

            People

              yong.fan nasf (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: