Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1366

getting "dirdata length set incorrectly" running e2fsck

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • Lustre 2.1.2
    • Lustre 2.1.1
    • DDN SFA10k - Dell R710 - TOSS2.0 OS release
    • 3
    • 4619

    Description

      After adding a network to the file system and adding the IP for the failover node to the MDS it wouldn't mount. (I later found that --param failnode= is no longer valid - much to my chagrin) I attempted to run fsck against the file system but it responded that the e2fsprogs was out of date for the file system so I ran fsck.ldiskfs. The fsck.ldiskfs found some bad inodes and corrected them but on a subsequent run with the -n option (done to make sure it was clean) I started seeing a flood of "dirdata length set incorrectly" messages. I stopped it and was able to mount the FS but later the FS spontaneously unmounted.

      What does this mean? Fortunately this file system is in pre-production and can be recreated (which is intended) but I'd like to know if this was caused by running fsck.ldiskfs since I did not see these messages on the first pass. The version of e2fsprogs (non-Redhat) is ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64. I have downloaded the wc4 version from the WC repo and installed it into a test image where I have rebooted the node into. I was able to use e2fsck to check the FS and I am using -fDy options but the "dirdata length set incorrectly" message continues to stream and has been going for more that an hour.

      Any help would be appreciated.

      Attachments

        Issue Links

          Activity

            [LU-1366] getting "dirdata length set incorrectly" running e2fsck

            FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed.

            The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661.

            adilger Andreas Dilger added a comment - FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed. The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661 .
            jamervi Joe Mervini added a comment -

            To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors.

            jamervi Joe Mervini added a comment - To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors.
            jamervi Joe Mervini added a comment -

            I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for:

            Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

            So is it your opinion that we should start from scratch again while we have the chance?

            jamervi Joe Mervini added a comment - I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for: Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize So is it your opinion that we should start from scratch again while we have the chance?

            In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT.

            I think a combination of factors is required here, to avoid this problem for other filesystems:

            • explicitly disable "extents" for MDT filesystems in mkfs_lustre.c (should go into 2.1.x)
            • fix e2fsck so that it does not corrupt extent-mapped symlinks (this may already be fixed in newer e2fsprogs)
            • land the OI Scrub feature for 2.3 (this is likely too much of a "feature" for 2.1.x)
            adilger Andreas Dilger added a comment - In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT. I think a combination of factors is required here, to avoid this problem for other filesystems: explicitly disable "extents" for MDT filesystems in mkfs_lustre.c (should go into 2.1.x) fix e2fsck so that it does not corrupt extent-mapped symlinks (this may already be fixed in newer e2fsprogs) land the OI Scrub feature for 2.3 (this is likely too much of a "feature" for 2.1.x)
            jamervi Joe Mervini added a comment -

            All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it.

            jamervi Joe Mervini added a comment - All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it.

            Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs

            {lots of options}

            ", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST.

            In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled.

            It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features.

            adilger Andreas Dilger added a comment - Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs {lots of options} ", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST. In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled. It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features.
            jamervi Joe Mervini added a comment -

            Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT:

            Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file
            huge_file uninit_bg dir_nlink extra_isize

            On the RHEL-6.1 created MDT these are the features:
            Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink

            I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there.

            [root@cmds1 ~]# mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000

            Permanent disk data:
            Target: scratch2-MDTffff
            Index: unassigned
            Lustre FS: scratch2
            Mount type: ldiskfs
            Flags: 0x75
            (MDT MGS needs_index first_time update )
            Persistent mount opts: user_xattr,errors=remount-ro
            Parameters: failover.node=10.196.135.143@o2ib1

            device size = 2858688MB
            formatting backing filesystem ldiskfs on /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
            target name scratch2-MDTffff
            4k blocks 731824160
            options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F
            mkfs_cmd = mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
            cmd: mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
            mkfs.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011)
            Discarding device blocks: failed - Operation not supported
            Filesystem label=scratch2-MDTffff
            OS type: Linux
            Block size=4096 (log=2)
            Fragment size=4096 (log=2)
            Stride=0 blocks, Stripe width=0 blocks
            1463654128 inodes, 731824160 blocks
            36591208 blocks (5.00%) reserved for the super user
            First data block=0
            Maximum filesystem blocks=2880079872
            44689 block groups
            16376 blocks per group, 16376 fragments per group
            32752 inodes per group
            Superblock backups stored on blocks:
            16376, 49128, 81880, 114632, 147384, 409400, 442152, 802424, 1326456,
            2047000, 3979368, 5616968, 10235000, 11938104, 35814312, 39318776,
            51175000, 107442936, 255875000, 275231432, 322328808

            Allocating group tables: done
            Writing inode tables: done
            Creating journal (102400 blocks): done
            Multiple mount protection has been enabled with update interval 5 seconds.
            Writing superblocks and filesystem accounting information: done

            This filesystem will be automatically checked every 0 mounts or
            0 days, whichever comes first. Use tunefs.ldiskfs -c or -i to override.
            Writing CONFIGS/mountdata

            [root@cmds1 ~]# dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
            dumpe2fs 1.41.12 (17-May-2010)
            Filesystem volume name: scratch2-MDTffff
            Last mounted on: /ram/tmp/mntBrCPMe
            Filesystem UUID: 48d57c6d-8156-4eb7-bcf8-a298fc0f7af9
            Filesystem magic number: 0xEF53
            Filesystem revision #: 1 (dynamic)
            Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
            Filesystem flags: signed_directory_hash
            Default mount options: user_xattr acl
            Filesystem state: clean
            Errors behavior: Continue
            Filesystem OS type: Linux
            Inode count: 1463654128
            Block count: 731824160
            Reserved block count: 36591208
            Free blocks: 548645355
            Free inodes: 1463654115
            First block: 0
            Block size: 4096
            Fragment size: 4096
            Reserved GDT blocks: 1024
            Blocks per group: 16376
            Fragments per group: 16376
            Inodes per group: 32752
            Inode blocks per group: 4094
            Flex block group size: 16
            Filesystem created: Tue May 15 15:31:37 2012
            Last mount time: Tue May 15 16:02:53 2012
            Last write time: Tue May 15 16:02:55 2012
            Mount count: 1
            Maximum mount count: 20
            Last checked: Tue May 15 15:31:37 2012
            Check interval: 0 (<none>)
            Lifetime writes: 698 GB
            Reserved blocks uid: 0 (user root)
            Reserved blocks gid: 0 (group root)
            First inode: 11
            Inode size: 512
            Required extra isize: 28
            Desired extra isize: 28
            Journal inode: 8
            Default directory hash: half_md4
            Directory Hash Seed: f6500d19-ed37-48f6-a446-e8485a2f9edf
            Journal backup: inode blocks
            Journal features: (none)
            Journal size: 400M
            Journal length: 102400
            Journal sequence: 0x00000005
            Journal start: 0

            jamervi Joe Mervini added a comment - Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize On the RHEL-6.1 created MDT these are the features: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there. [root@cmds1 ~] # mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 Permanent disk data: Target: scratch2-MDTffff Index: unassigned Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: failover.node=10.196.135.143@o2ib1 device size = 2858688MB formatting backing filesystem ldiskfs on /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 target name scratch2-MDTffff 4k blocks 731824160 options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F mkfs_cmd = mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160 cmd: mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160 mkfs.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011) Discarding device blocks: failed - Operation not supported Filesystem label=scratch2-MDTffff OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 1463654128 inodes, 731824160 blocks 36591208 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2880079872 44689 block groups 16376 blocks per group, 16376 fragments per group 32752 inodes per group Superblock backups stored on blocks: 16376, 49128, 81880, 114632, 147384, 409400, 442152, 802424, 1326456, 2047000, 3979368, 5616968, 10235000, 11938104, 35814312, 39318776, 51175000, 107442936, 255875000, 275231432, 322328808 Allocating group tables: done Writing inode tables: done Creating journal (102400 blocks): done Multiple mount protection has been enabled with update interval 5 seconds. Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 0 mounts or 0 days, whichever comes first. Use tunefs.ldiskfs -c or -i to override. Writing CONFIGS/mountdata [root@cmds1 ~] # dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 dumpe2fs 1.41.12 (17-May-2010) Filesystem volume name: scratch2-MDTffff Last mounted on: /ram/tmp/mntBrCPMe Filesystem UUID: 48d57c6d-8156-4eb7-bcf8-a298fc0f7af9 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 1463654128 Block count: 731824160 Reserved block count: 36591208 Free blocks: 548645355 Free inodes: 1463654115 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 16376 Fragments per group: 16376 Inodes per group: 32752 Inode blocks per group: 4094 Flex block group size: 16 Filesystem created: Tue May 15 15:31:37 2012 Last mount time: Tue May 15 16:02:53 2012 Last write time: Tue May 15 16:02:55 2012 Mount count: 1 Maximum mount count: 20 Last checked: Tue May 15 15:31:37 2012 Check interval: 0 (<none>) Lifetime writes: 698 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: f6500d19-ed37-48f6-a446-e8485a2f9edf Journal backup: inode blocks Journal features: (none) Journal size: 400M Journal length: 102400 Journal sequence: 0x00000005 Journal start: 0

            Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed.

            adilger Andreas Dilger added a comment - Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed.

            Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem:

            # dumpe2fs -h /tmp/lustre-mdt1  | grep feature
            dumpe2fs 1.41.90.wc3 (28-May-2011)
            Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink
            Journal features:         (none)
            

            Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk).

            adilger Andreas Dilger added a comment - Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem: # dumpe2fs -h /tmp/lustre-mdt1 | grep feature dumpe2fs 1.41.90.wc3 (28-May-2011) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink Journal features: (none) Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk).
            jamervi Joe Mervini added a comment -

            A little more data: No dirdata length errors with -fDy option on RHEL-6.1

            [root@cmds1 osc]# e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
            e2fsck 1.41.90.wc4 (01-Sep-2011)
            Pass 1: Checking inodes, blocks, and sizes
            Pass 2: Checking directory structure
            Pass 3: Checking directory connectivity
            Pass 3A: Optimizing directories
            Pass 4: Checking reference counts
            Pass 5: Checking group summary information

            scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
            scratch2-MDT0000: 2910/1463654128 files (0.2% non-contiguous), 183179282/731824160 blocks

            jamervi Joe Mervini added a comment - A little more data: No dirdata length errors with -fDy option on RHEL-6.1 [root@cmds1 osc] # e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 e2fsck 1.41.90.wc4 (01-Sep-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 3A: Optimizing directories Pass 4: Checking reference counts Pass 5: Checking group summary information scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** scratch2-MDT0000: 2910/1463654128 files (0.2% non-contiguous), 183179282/731824160 blocks
            jamervi Joe Mervini added a comment -

            The problem appears to be TOSS2 specific.

            As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related.

            I will be opening a bugzilla bug with Livermore.

            jamervi Joe Mervini added a comment - The problem appears to be TOSS2 specific. As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related. I will be opening a bugzilla bug with Livermore.

            People

              bobijam Zhenyu Xu
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: