Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.1.3
    • None
    • 3
    • 6149

    Description

      Doing an ls gives the following error
      ls: reading directory d4_stats/: Input/output error

      client error:
      [5237686.818045] LustreError: 77522:0:(dir.c:648:ll_readdir()) error reading dir [0x4488b6ced74:0x1edb5:0x0] at 0: rc -5
      [5237686.849844] LustreError: 77522:0:(dir.c:648:ll_readdir()) Skipped 51 previous similar messages

      MDT Error:
      Jan 16 11:18:37 nbp1-mds kernel: Lustre: 15390:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!

      Please advise on debug flags to use to gather logs.

      Attachments

        1. fsck.2.8.2012.nbp1.out.gz
          1.63 MB
        2. mdtsnap.fsck.out.gz
          1.10 MB
        3. nbp1FSCK.out.gz
          4.56 MB

        Issue Links

          Activity

            [LU-2627] /bin/ls gets Input/output error

            At this point we have been able to run fsck on the mdt and have recovered from the errors.

            mhanafi Mahmoud Hanafi added a comment - At this point we have been able to run fsck on the mdt and have recovered from the errors.

            What is your current state? What help can we give you?

            cliffw Cliff White (Inactive) added a comment - What is your current state? What help can we give you?

            The "dirdata" option is enabled by default for 2.x filesystems, but I don't think it is necessarily advisable to disable it at this time. It does appear at first glance that running e2fsck after removing the dirdata feature would handle this correctly and clear the extra dirdata flag in each dirent, but we haven't tested this at all, and it would also cause the MDS to become considerably slower.

            So far I don't see any indication besides the mixup with ".." entries that there is anything seriously wrong with these directories. The bytes at the start of the directory are used for ".", "..", and the htree index on directories over 4kB in size, and not user data. e2fsck should regenerate all of the needed information from redundant information elsewhere, except being able to move the entry from lost+found back to the proper place in the tree.

            adilger Andreas Dilger added a comment - The "dirdata" option is enabled by default for 2.x filesystems, but I don't think it is necessarily advisable to disable it at this time. It does appear at first glance that running e2fsck after removing the dirdata feature would handle this correctly and clear the extra dirdata flag in each dirent, but we haven't tested this at all, and it would also cause the MDS to become considerably slower. So far I don't see any indication besides the mixup with ".." entries that there is anything seriously wrong with these directories. The bytes at the start of the directory are used for ".", "..", and the htree index on directories over 4kB in size, and not user data. e2fsck should regenerate all of the needed information from redundant information elsewhere, except being able to move the entry from lost+found back to the proper place in the tree.

            It has been a very long time since we have ran e2fsck and that was during the 1.8.x code. We have never ran e2fsck since moving to 2.1.

            Should we remove the dirdata options?

            I will check the date and size of the directories. We may want to just archive these and restore them after the fsck or tar/delete/untar them.

            mhanafi Mahmoud Hanafi added a comment - It has been a very long time since we have ran e2fsck and that was during the 1.8.x code. We have never ran e2fsck since moving to 2.1. Should we remove the dirdata options? I will check the date and size of the directories. We may want to just archive these and restore them after the fsck or tar/delete/untar them.

            Looking at the test e2fsck log, one new directory is getting yet a different error related to the "." entry:

            Directory entry for '.' in /ROOT/msekula/fun/camrad (13208388) is big.
            Split? yes
            Missing '..' in directory inode 13208388.
            Fix? yes
            Setting filetype for entry '..' in /ROOT/msekula/fun/camrad (13208388) to 2.
            Entry '..' in /ROOT/msekula/fun/camrad (13208388) is duplicate '..' entry.
            Fix? yes
            

            I suspect that there is some code in e2fsck or in ldiskfs that is not handling the dirdata field correctly. It likely relates to LU-2638. There are several files moved to lost+found as I suspected, but it looks like the majority of symlinks are fine.

            It doesn't seem that a large number of directories will be repaired, so I think it makes sense to go ahead and fix the real MDT at this point. The only other thing you might want to check before doing the final is if you run "e2fsck -fy" on the snapshot a second time that it passes cleanly without any repairs. About 30 directories will be moved to lost+found, but they can be moved back to their correct location, and nothing should be lost.

            The next question to figure out what has caused this problem. When did you upgrade to 2.1? Were these directories existing before the upgrade from 1.8, or were they created afterward? How large are the directories (number of entries = "find ${directory} -print | wc -l", size of directory = "ls -ld ${directory}")? Do you know if the directories where renamed after they were created? How long has it been since you last ran e2fsck? Have you run it since the upgrade?

            adilger Andreas Dilger added a comment - Looking at the test e2fsck log, one new directory is getting yet a different error related to the "." entry: Directory entry for '.' in /ROOT/msekula/fun/camrad (13208388) is big. Split? yes Missing '..' in directory inode 13208388. Fix? yes Setting filetype for entry '..' in /ROOT/msekula/fun/camrad (13208388) to 2. Entry '..' in /ROOT/msekula/fun/camrad (13208388) is duplicate '..' entry. Fix? yes I suspect that there is some code in e2fsck or in ldiskfs that is not handling the dirdata field correctly. It likely relates to LU-2638 . There are several files moved to lost+found as I suspected, but it looks like the majority of symlinks are fine. It doesn't seem that a large number of directories will be repaired, so I think it makes sense to go ahead and fix the real MDT at this point. The only other thing you might want to check before doing the final is if you run " e2fsck -fy " on the snapshot a second time that it passes cleanly without any repairs. About 30 directories will be moved to lost+found, but they can be moved back to their correct location, and nothing should be lost. The next question to figure out what has caused this problem. When did you upgrade to 2.1? Were these directories existing before the upgrade from 1.8, or were they created afterward? How large are the directories (number of entries = "find ${directory} -print | wc -l", size of directory = "ls -ld ${directory}")? Do you know if the directories where renamed after they were created? How long has it been since you last ran e2fsck? Have you run it since the upgrade?

            Uploading the fsck ran on the snap. Please review before we run on the real mdt device.

            mhanafi Mahmoud Hanafi added a comment - Uploading the fsck ran on the snap. Please review before we run on the real mdt device.

            We did not use the xyratex upgrade tool. But we added that dirdata option at some point. Should we remove that option?

            mhanafi Mahmoud Hanafi added a comment - We did not use the xyratex upgrade tool. But we added that dirdata option at some point. Should we remove that option?

            I also see in your MDT feature list that there is the "dirdata" feature enabled, but this is definitely NOT a feature that would have been enabled with a filesystem formatted with 1.8. Also, the ".." corruption is definitely not random.

            Did you perhaps run the Xyratex "upgrade" tool on the MDT filesystem?

            I believe that this would be the root cause of the ".." corruption. My understanding is that it was deleting the ".." entry to add the FID, and then re-inserting it into the directory, but ext4/e2fsck require that the ".." entry immediately follow the "." entry at the start.

            adilger Andreas Dilger added a comment - I also see in your MDT feature list that there is the "dirdata" feature enabled, but this is definitely NOT a feature that would have been enabled with a filesystem formatted with 1.8. Also, the ".." corruption is definitely not random. Did you perhaps run the Xyratex "upgrade" tool on the MDT filesystem? I believe that this would be the root cause of the ".." corruption. My understanding is that it was deleting the ".." entry to add the FID, and then re-inserting it into the directory, but ext4/e2fsck require that the ".." entry immediately follow the "." entry at the start.

            Looking at the e2fsck code, it appears that it will correctly remove just the EXTENT_FL flag, rather than clear the whole inode:

                            if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL) &&
                                LINUX_S_ISLNK(inode->i_mode) &&
                                !ext2fs_inode_has_valid_blocks2(fs, inode) &&
                                fix_problem(ctx, PR_1_FAST_SYMLINK_EXTENT_FL, &pctx)) {
                                    inode->i_flags &= ~EXT4_EXTENTS_FL;
                                    e2fsck_write_inode(ctx, ino, inode, "pass1");
                            }
            

            so the only confusion is that the PR_1_FAST_SYMLINK_EXTENT_FL problem code is asking "Clear", which might be confusing to some (including myself) as asking whether the inode should be cleared instead of the flag being cleared. I will submit a patch to fix this.

            The later errors:

            Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/timeave_cumulate.F (inode #68169598) is invalid.
            Clear? no
            Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/cal_compdates.F (inode #68169136) is invalid.
            Clear? no
            

            should not be hit if the earlier checks to clear EXT4_EXTENT_FL had been allowed to clear this flag from the short symlinks.

            There are some further errors, much later in the log. There are ~20 of the following errors in Pass 2:

            Pass 2: Checking directory structure
            Second entry 'IE_t040101_000000.log' (inode=18364943) in directory inode 1837308
            5 should be '..'
            Fix? no
            Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
            RESULTS/run_all/IE (18373085) is duplicate '..' entry.
            Fix? no
            Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
            RESULTS/run_all/IE (18373085) is duplicate '..' entry.
            Fix? no
            Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
            RESULTS/run_all/IE (18373085) is a link to directory /ROOT/xjia/Saturn/run_Ideal
            izedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653).
            Clear? no
            

            that appear a bit unusual, but are not fatally broken. There are ~20 matching errors for the unfixed ".." entries later in Pass 3:

            Pass 3: Checking directory connectivity
            '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all/IE (18373085) is <The NULL inode> (0), should be /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653).
            Fix? no
            

            and a few minor errors in Pass 3A:

            Pass 3A: Optimizing directories
            Duplicate entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found.  Clear? no
            Entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename.
            Rename to c_t_f.~0? no
            Duplicate entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found.  Clear? no
            Entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename.
            Rename to b1b2b3~0? no
            

            It appears that the entries that would be "fixed" in Pass 2 will likely appear in lost+found once they are fixed, and if you want to recover those files you could mount the MDT locally with mount -t ldiskfs and rename them from .../lost+found/#inode to the path given for each inode number.

            I think you could go ahead with running e2fsck -fy on the snapshot, mount the snapshot MDT filesystem locally as ldiskfs to verify a handful of the symlinks are still intact, and check lost+found for the ~20 or so inodes that would need to be fixed (you could even write a short script to rename them if downtime is critical). If that works OK, then when you take the real MDT filesystem offline for repair, please make another snapshot at that time, run the e2fsck -fy on the real MDT, mount as ldiskfs and repair the files in lost+found before unmounting and remounting it again as lustre.

            In order to get the number of messages in the e2fsck log to a manageable number, I filtered out all of the duplicate messages:

            egrep -v "^$|^Fast symlink .* EXTENT_FL|^Inode .* missing NUL terminator|^Clear" e2fsck.log > e2fsck-filtered.log
            

            I had also filtered out "^Symlink.*is invalid" messages, but I don't think you should hit them during the repairing e2fsck run.

            adilger Andreas Dilger added a comment - Looking at the e2fsck code, it appears that it will correctly remove just the EXTENT_FL flag, rather than clear the whole inode: if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL) && LINUX_S_ISLNK(inode->i_mode) && !ext2fs_inode_has_valid_blocks2(fs, inode) && fix_problem(ctx, PR_1_FAST_SYMLINK_EXTENT_FL, &pctx)) { inode->i_flags &= ~EXT4_EXTENTS_FL; e2fsck_write_inode(ctx, ino, inode, "pass1" ); } so the only confusion is that the PR_1_FAST_SYMLINK_EXTENT_FL problem code is asking "Clear", which might be confusing to some (including myself) as asking whether the inode should be cleared instead of the flag being cleared. I will submit a patch to fix this. The later errors: Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/timeave_cumulate.F (inode #68169598) is invalid. Clear? no Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/cal_compdates.F (inode #68169136) is invalid. Clear? no should not be hit if the earlier checks to clear EXT4_EXTENT_FL had been allowed to clear this flag from the short symlinks. There are some further errors, much later in the log. There are ~20 of the following errors in Pass 2: Pass 2: Checking directory structure Second entry 'IE_t040101_000000.log' (inode=18364943) in directory inode 1837308 5 should be '..' Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is duplicate '..' entry. Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is duplicate '..' entry. Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is a link to directory /ROOT/xjia/Saturn/run_Ideal izedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653). Clear? no that appear a bit unusual, but are not fatally broken. There are ~20 matching errors for the unfixed ".." entries later in Pass 3: Pass 3: Checking directory connectivity '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all/IE (18373085) is <The NULL inode> (0), should be /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653). Fix? no and a few minor errors in Pass 3A: Pass 3A: Optimizing directories Duplicate entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found. Clear? no Entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename. Rename to c_t_f.~0? no Duplicate entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found. Clear? no Entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename. Rename to b1b2b3~0? no It appears that the entries that would be "fixed" in Pass 2 will likely appear in lost+found once they are fixed, and if you want to recover those files you could mount the MDT locally with mount -t ldiskfs and rename them from .../lost+found/#inode to the path given for each inode number. I think you could go ahead with running e2fsck -fy on the snapshot, mount the snapshot MDT filesystem locally as ldiskfs to verify a handful of the symlinks are still intact, and check lost+found for the ~20 or so inodes that would need to be fixed (you could even write a short script to rename them if downtime is critical). If that works OK, then when you take the real MDT filesystem offline for repair, please make another snapshot at that time, run the e2fsck -fy on the real MDT, mount as ldiskfs and repair the files in lost+found before unmounting and remounting it again as lustre. In order to get the number of messages in the e2fsck log to a manageable number, I filtered out all of the duplicate messages: egrep -v "^$|^Fast symlink .* EXTENT_FL|^Inode .* missing NUL terminator|^Clear" e2fsck.log > e2fsck-filtered.log I had also filtered out " ^Symlink.*is invalid " messages, but I don't think you should hit them during the repairing e2fsck run.

            This was a 1.8.x filesystem that was upgraded. So I think the extent option is leftover from the 1.8.x format.

            mhanafi Mahmoud Hanafi added a comment - This was a 1.8.x filesystem that was upgraded. So I think the extent option is leftover from the 1.8.x format.

            Bobijam, I think that the problem is with e2fsck rejecting short symlinks with the EXT4_EXTENTS_FL set. The LU-1540 NUL termination problem appears that it would be fixed correctly with the current e2fsck. This EXT4_EXTENTS_FL appears to be a bug in the osd-ldiskfs code, if "extents" is enabled, for which I've filed LU-2634. Since we never format the MDT with "extents", we have never seen such a problem in our testing.

            Inode 9482890 symlink missing NUL terminator.  Fix? no
            Inode 9482897 symlink missing NUL terminator.  Fix? no
            Fast symlink 9482914 has EXTENT_FL set.  Clear? no
            Fast symlink 9482917 has EXTENT_FL set.  Clear? no
            Fast symlink 9482921 has EXTENT_FL set.  Clear? no
            

            It makes sense to change e2fsck to accept such inodes and just clear the EXT4_EXTENTS_FL instead of considering it corrupted. That will allow recovering the filesystem without the need to restore the symlinks (which would just get EXT4_EXTENTS_FL set again, until LU-2634 is fixed).

            adilger Andreas Dilger added a comment - Bobijam, I think that the problem is with e2fsck rejecting short symlinks with the EXT4_EXTENTS_FL set. The LU-1540 NUL termination problem appears that it would be fixed correctly with the current e2fsck. This EXT4_EXTENTS_FL appears to be a bug in the osd-ldiskfs code, if "extents" is enabled, for which I've filed LU-2634 . Since we never format the MDT with "extents", we have never seen such a problem in our testing. Inode 9482890 symlink missing NUL terminator. Fix? no Inode 9482897 symlink missing NUL terminator. Fix? no Fast symlink 9482914 has EXTENT_FL set. Clear? no Fast symlink 9482917 has EXTENT_FL set. Clear? no Fast symlink 9482921 has EXTENT_FL set. Clear? no It makes sense to change e2fsck to accept such inodes and just clear the EXT4_EXTENTS_FL instead of considering it corrupted. That will allow recovering the filesystem without the need to restore the symlinks (which would just get EXT4_EXTENTS_FL set again, until LU-2634 is fixed).

            People

              cliffw Cliff White (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: