Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11922

mkfs.lustre in 1.44.3.wc1 causes corruption if 'metadata_csum' option enabled

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • Lustre 2.12.0, Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      # mkfs.lustre --mgs --fsname=lustre /dev/sda 
      # mkfs.lustre --ost --servicenode=172.16.251.20@o2ib --mgsnode=172.16.251.20@o2ib --fsname=lustre --index=0 --reformat --mkfsoptions='-text4' /dev/sdc 
      
      # mount -t lustre /dev/sda /tmp/mgs/
      # mount -t lustre /dev/sdc /tmp/ost0/
      
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs error (device sdc): htree_dirblock_to_tree:1278: inode #11272193: block 721437184:
               comm mount.lustre: bad entry in directory: rec_len is too small for name_len - offset=4084(4084), inode=0, rec_len=12, name_len=0
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: Aborting journal on device sdc-8.
      Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): Remounting filesystem read-only
      Feb  5 06:09:04 ai200-7f94-vm00 kernel: LDISKFS-fs warning (device sdc): kmmpd:187: kmmpd being stopped since filesystem has been remounted as readonly.
      

      Without '-t ext4', it was no problem.
      Also, 1.42.13.wc6 didn't cause problem even with '-t ext4'. it seems lustre version doesn't matter. hit problem with 2.10, 2.12 and master.

      Attachments

        Issue Links

          Activity

            [LU-11922] mkfs.lustre in 1.44.3.wc1 causes corruption if 'metadata_csum' option enabled
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34219/
            Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ec7a166a498be607c3882ff11e98b625839e69d0

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34219/ Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum Project: fs/lustre-release Branch: master Current Patch Set: Commit: ec7a166a498be607c3882ff11e98b625839e69d0
            dongyang Dongyang Li added a comment -

            We need to do more testing on this, especially performance tests. I've just done some simple mdtest.

            @Mahmoud correct, for now you can removing metadata_csum from /etc/mke2fs.conf.

            dongyang Dongyang Li added a comment - We need to do more testing on this, especially performance tests. I've just done some simple mdtest. @Mahmoud correct, for now you can removing metadata_csum from /etc/mke2fs.conf.

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/34219
            Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 466a6c245551fde0b414956e236a0a383a39d3c0

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/34219 Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 466a6c245551fde0b414956e236a0a383a39d3c0

            Just ran into to this on a newly created filesystem. Lustre2.12 and e2fsprogs-1.44.5.wc1-0.el7.x86_64. Removing the metadata_csum from the /etc/mke2fs.conf the workaround?

            mhanafi Mahmoud Hanafi added a comment - Just ran into to this on a newly created filesystem. Lustre2.12 and e2fsprogs-1.44.5.wc1-0.el7.x86_64. Removing the metadata_csum from the /etc/mke2fs.conf the workaround?
            dongyang Dongyang Li added a comment -

            I made the changes mentioned in the previous comment, the targets can be mounted now.

            however when I started mdtest from the client it ran into problems:

            [439597.196196] LDISKFS-fs error (device vdb): ldiskfs_dx_csum_set:473: inode #524298: comm mdt00_003: dir seems corrupt?  Run e2fsck -D.
            [439597.199074] Aborting journal on device vdb-8.
            [439597.200800] LDISKFS-fs (vdb): Remounting filesystem read-only
            [439597.202036] LDISKFS-fs error (device vdb) in ldiskfs_reserve_inode_write:5313: Journal has aborted
            [439597.203684] LDISKFS-fs error (device vdb) in iam_txn_add:575: Journal has aborted
            [439597.204777] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
            [439597.205744] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) Skipped 1 previous similar message
            [439597.206623] LustreError: 14478:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff8ecfd7315400
            [439597.207585] LustreError: 14478:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) testfs-MDT0000: can't update reply_data file: rc = -30
            [439597.209803] LustreError: 14478:0:(osd_handler.c:2008:osd_trans_stop()) testfs-MDT0000: failed in transaction hook: rc = -30
            [439597.211020] LustreError: 14478:0:(osd_handler.c:2018:osd_trans_stop()) testfs-MDT0000: failed to stop transaction: rc = -30
            [439597.211149] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ecfdfb52700 commit error: 2
            [439597.211151] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) Skipped 183 previous similar messages
            

            and mdtest on a ldiskfs mount is showing some different errors:

            [450387.326280] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
            [450485.679052] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum.  Please run e2fsck -D.
            [450485.679055] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum.  Please run e2fsck -D.
            [450485.679058] LDISKFS-fs error (device loop0): __ldiskfs_read_dirblock:1110: inode #524290: block 508: comm mdtest: Directory index failed checksum
            

            I need to look further into it.

            dongyang Dongyang Li added a comment - I made the changes mentioned in the previous comment, the targets can be mounted now. however when I started mdtest from the client it ran into problems: [439597.196196] LDISKFS-fs error (device vdb): ldiskfs_dx_csum_set:473: inode #524298: comm mdt00_003: dir seems corrupt? Run e2fsck -D. [439597.199074] Aborting journal on device vdb-8. [439597.200800] LDISKFS-fs (vdb): Remounting filesystem read-only [439597.202036] LDISKFS-fs error (device vdb) in ldiskfs_reserve_inode_write:5313: Journal has aborted [439597.203684] LDISKFS-fs error (device vdb) in iam_txn_add:575: Journal has aborted [439597.204777] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 [439597.205744] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) Skipped 1 previous similar message [439597.206623] LustreError: 14478:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff8ecfd7315400 [439597.207585] LustreError: 14478:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) testfs-MDT0000: can't update reply_data file: rc = -30 [439597.209803] LustreError: 14478:0:(osd_handler.c:2008:osd_trans_stop()) testfs-MDT0000: failed in transaction hook: rc = -30 [439597.211020] LustreError: 14478:0:(osd_handler.c:2018:osd_trans_stop()) testfs-MDT0000: failed to stop transaction: rc = -30 [439597.211149] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ecfdfb52700 commit error: 2 [439597.211151] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) Skipped 183 previous similar messages and mdtest on a ldiskfs mount is showing some different errors: [450387.326280] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: ( null ) [450485.679052] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum. Please run e2fsck -D. [450485.679055] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum. Please run e2fsck -D. [450485.679058] LDISKFS-fs error (device loop0): __ldiskfs_read_dirblock:1110: inode #524290: block 508: comm mdtest: Directory index failed checksum I need to look further into it.
            dongyang Dongyang Li added a comment -

            I like the last one which allows us to enable metadata_csum on the targets.

            and yes only checking if file_type is EXT4_FT_DIR_CSUM is not enough.

            however within ext4_get_dirent_data_len() we can not figure out if the de is the last in the block, we need the beginning of the de block to determine that.

            we can check other things instead, if it is indeed a ext4_dir_entry_tail, then the inode number should be 0, also the name len. and the rec_len should be sizeof(struct ext4_dir_entry_tail) as well as the file_type should be 0xde. I think this is enough for us to return 0 in ext4_get_dirent_data_len().

            dongyang Dongyang Li added a comment - I like the last one which allows us to enable metadata_csum on the targets. and yes only checking if file_type is EXT4_FT_DIR_CSUM is not enough. however within ext4_get_dirent_data_len() we can not figure out if the de is the last in the block, we need the beginning of the de block to determine that. we can check other things instead, if it is indeed a ext4_dir_entry_tail, then the inode number should be 0, also the name len. and the rec_len should be sizeof(struct ext4_dir_entry_tail) as well as the file_type should be 0xde. I think this is enough for us to return 0 in ext4_get_dirent_data_len().
            adilger Andreas Dilger added a comment - - edited

            I'm able to reproduce this locally. It looks like it is caused by the use of metadata_csum, which is inherited from /etc/mke2fs.conf when "-t ext4" is enabled. Removing metadata_csum from /etc/mke2fs.conf avoids the problem.

            formatting backing filesystem ldiskfs on /dev/loop0
                    target name   testfs:OST0000
                    kilobytes     400000
                    options       -t ext4 -I 512 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg
                                  -G 256 -E resize="4290772992",lazy_journal_init -F
            mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0000 -t ext4 -I 512 -q
                       -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F /dev/loop0 400000k
            
            Filesystem volume name:   testfs:OST0000
            Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
                               needs_recovery extent 64bit flex_bg sparse_super large_file
                               huge_file dir_nlink extra_isize quota metadata_csum
            Checksum type:            crc32c
            Checksum:                 0x4ae324f9
            

            The later "-O" features that mkfs.lustre enables to not disable the default features that /etc/mke2fs.conf enables.

            I suspect it is some kind of bad interaction between the dir_data feature (even though it is not enabled on the OST) and the metadata_csum feature. Using metadata_csum stores a checksum in a fake directory entry at the end of each directory block. The dirdata feature tries to reserve enough space in the dirent to store extra FID data, so my guess is that htree_dirblock_to_tree() is trying to get enough space for the dirdata field, even though this feature is disabled on the OST.

            Looking at the code more closely, it definitely seems that this is the case:

            /*      
             * This is a bogus directory entry at the end of each leaf block that
             * records checksums.
             */
            struct ext4_dir_entry_tail {
                    __le32  det_reserved_zero1;     /* Pretend to be unused */
                    __le16  det_rec_len;            /* 12 */
                    __u8    det_reserved_zero2;     /* Zero name length */
                    __u8    det_reserved_ft;        /* 0xDE, fake file type */
                    __le32  det_checksum;           /* crc32c(uuid+inum+dirblock) */
            };
            

            Sadly, it seems that setting det_reserved_ft = 0xde is causing EXT4_DIR_REC_LEN(de)->ext4_get_dirent_data_len() to think there are extra dirdata records in the directory, adding bogus record lengths onto the dirent and triggering this error. This is quite problematic, because the metadata_csum feature is already the default for e2fsprogs-1.44, and the kernel code using det_reserved_ft = 0xde has been in ext4 since commit v3.4-rc5-2-ge61539189606, so there is no way to fix this to work transparently with dirdata.

            I think a few options exist:

            • only check ext4_get_dirent_data_len() if the dirdata feature is enabled, which would probably be smart, but doesn't fix this problem in the future when we actually want metadata_csum enabled. This is also hard to implement without passing the superblock pointer to EXT4_DIR_REC_LEN(), or masking off the high bits of de->file_type in the caller (which should be zero if this feature is disabled, but will break metadata_csum?)
            • check in ext4_get_dirent_data_len() for the case de->file_type == EXT4_FT_DIR_CSUM and return 0 immediately. This would prevent other features from being enabled in this fake dirent, but that in itself is OK... but...
            • There may be a problem at some point in the future if there really is a filetype E and all the dirdata flags D are set (0x80, 0x40, 0x10 = EXT4_DIRENT_LUFID), so a simple check EXT4_FT_DIR_CSUM may get confused, so only skip this filetype if it is the last entry in the directory block.
            adilger Andreas Dilger added a comment - - edited I'm able to reproduce this locally. It looks like it is caused by the use of metadata_csum , which is inherited from /etc/mke2fs.conf when " -t ext4 " is enabled. Removing metadata_csum from /etc/mke2fs.conf avoids the problem. formatting backing filesystem ldiskfs on /dev/loop0 target name testfs:OST0000 kilobytes 400000 options -t ext4 -I 512 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0000 -t ext4 -I 512 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F /dev/loop0 400000k Filesystem volume name: testfs:OST0000 Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize quota metadata_csum Checksum type: crc32c Checksum: 0x4ae324f9 The later "-O" features that mkfs.lustre enables to not disable the default features that /etc/mke2fs.conf enables. I suspect it is some kind of bad interaction between the dir_data feature (even though it is not enabled on the OST) and the metadata_csum feature. Using metadata_csum stores a checksum in a fake directory entry at the end of each directory block. The dirdata feature tries to reserve enough space in the dirent to store extra FID data, so my guess is that htree_dirblock_to_tree() is trying to get enough space for the dirdata field, even though this feature is disabled on the OST. Looking at the code more closely, it definitely seems that this is the case: /* * This is a bogus directory entry at the end of each leaf block that * records checksums. */ struct ext4_dir_entry_tail { __le32 det_reserved_zero1; /* Pretend to be unused */ __le16 det_rec_len; /* 12 */ __u8 det_reserved_zero2; /* Zero name length */ __u8 det_reserved_ft; /* 0xDE, fake file type */ __le32 det_checksum; /* crc32c(uuid+inum+dirblock) */ }; Sadly, it seems that setting det_reserved_ft = 0xde is causing EXT4_DIR_REC_LEN(de)->ext4_get_dirent_data_len() to think there are extra dirdata records in the directory, adding bogus record lengths onto the dirent and triggering this error. This is quite problematic, because the metadata_csum feature is already the default for e2fsprogs-1.44, and the kernel code using det_reserved_ft = 0xde has been in ext4 since commit v3.4-rc5-2-ge61539189606, so there is no way to fix this to work transparently with dirdata . I think a few options exist: only check ext4_get_dirent_data_len() if the dirdata feature is enabled, which would probably be smart, but doesn't fix this problem in the future when we actually want metadata_csum enabled. This is also hard to implement without passing the superblock pointer to EXT4_DIR_REC_LEN() , or masking off the high bits of de->file_type in the caller (which should be zero if this feature is disabled, but will break metadata_csum ?) check in ext4_get_dirent_data_len() for the case de->file_type == EXT4_FT_DIR_CSUM and return 0 immediately. This would prevent other features from being enabled in this fake dirent, but that in itself is OK... but... There may be a problem at some point in the future if there really is a filetype E and all the dirdata flags D are set (0x80, 0x40, 0x10 = EXT4_DIRENT_LUFID), so a simple check EXT4_FT_DIR_CSUM may get confused, so only skip this filetype if it is the last entry in the directory block.

            People

              dongyang Dongyang Li
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: