[LU-2627] /bin/ls gets Input/output error Created: 16/Jan/13 Updated: 03/Jul/13 Resolved: 21/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6149 | ||||||||||||||||
| Description |
|
Doing an ls gives the following error client error: MDT Error: Please advise on debug flags to use to gather logs. |
| Comments |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
MDS has logged these messages |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Can unmount and just run e2fsck on the mdt device? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Yes, you should umount and fsck the MDT. You do not have to umount clients, however clients may block while the MDT is down. |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Can you please provide the exact options to used for the fsck command |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
First, check all your logs and see if you are having hardware failures. Is there any error logging in your disk hardware? The device under dm-2 may have an issue.
This is a read-only pass, and should give you an idea of what is going on.
|
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Read only pass has lots of errors like this |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Can you post the full output? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
And are you using e2fsprogs from Whamcloud? Please indicate the version of e2fsprogs you have installed. |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Also, is there any indicate of hardware issue with the disk? |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
long list of this nbp1-MDT0000 has been mounted 110 times without being checked, check forced. Error while reading over extent tree in inode 8503011: Corrupt extent header Error while reading over extent tree in inode 8503034: Corrupt extent header Error while reading over extent tree in inode 8503327: Corrupt extent header Error while reading over extent tree in inode 8503340: Corrupt extent header Error while reading over extent tree in inode 8503345: Corrupt extent header Error while reading over extent tree in inode 8503781: Corrupt extent header Error while reading over extent tree in inode 8503785: Corrupt extent header Error while reading over extent tree in inode 8503787: Corrupt extent header Error while reading over extent tree in inode 8503801: Corrupt extent header Error while reading over extent tree in inode 8503805: Corrupt extent header Error while reading over extent tree in inode 8503808: Corrupt extent header Error while reading over extent tree in inode 8503810: Corrupt extent header Error while reading over extent tree in inode 8503956: Corrupt extent header Error while reading over extent tree in inode 8503961: Corrupt extent header Error while reading over extent tree in inode 8504005: Corrupt extent header Error while reading over extent tree in inode 8510949: Corrupt extent header Error while reading over extent tree in inode 8541695: Corrupt extent header I just stopped it for now. |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
FYI- this is 1.8 upgraded to 2.x filesystem. e2fsprogs-1.41.90.wc4-7.el6.x86_64 |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
That is rather bad. You need to verify that your disk hardware is healthy, you may be seeing a disk failure. Do you have a backup? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Can you give us your kernel version, and the version on all Lustre RPMS? You compile your own Lustre? |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Hardware is healthy. We don't have backups. But I am able to remount the mdt. |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Here is list of server lustre rpms Linux nbp1-mds 2.6.32-279.2.1.el6.20120824.x86_64.lustre213 #1 SMP Mon Aug 27 15:02:12 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux our source tree is out on github |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
should I remount the mdt for now so the clients can recover or hold off. |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
The only actual errors are the LDISKFS-fs warnings. The rest are mostly standard restart. (errors should start with LustreError) |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Should we upgrade our e2fsprogs and try. The hardware is differently healthy. |
| Comment by Jay Lan (Inactive) [ 16/Jan/13 ] |
|
The git source for the server can be found at |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
Do you have any documentation of identifying and restoring the affected files? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Yes, you should upgrade to the latest e2fsprogs from http://downloads.whamcloud.com/public/e2fsprogs/ |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
after the new e2fsck most of there errors are Fast symlink 21070423 has EXTENT_FL set. Clear? no and Inode 11172447 symlink missing NUL terminator. Fix? no is that good or bad? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Not horrible. We think you may lose some symlinks. I would go ahead and say 'y' |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
hmmm e2fsck -vn SIGSEGV one it started checking directory structures nbp1-mds ~/newrpms # e2fsck |
| Comment by Andreas Dilger [ 16/Jan/13 ] |
|
How did the extents feature get enabled on the MDT filesystem? This is not a standard formatting option, and not something that we test locally. It is likely the root cause of the problems that you are seeing. |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
here is what we have. Should we remove "extent" Option? |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
Can you give us the output of 'tune2fs -l <device>' and tunefs.lustre -print <device> ? |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
nbp1-mds ~/newrpms # tune2fs Read previous values: Permanent disk data: Writing CONFIGS/mountdata |
| Comment by Cliff White (Inactive) [ 16/Jan/13 ] |
|
sorry, looks like our replies crossed. I see you already supplied the tune2fs |
| Comment by Andreas Dilger [ 16/Jan/13 ] |
|
Can you please run e2fsck under gdb with "-n" option and paste the resulting stack trace here? I can't see enough of where the problem is above. If you have a spare SATA disk I would recommend making a full backup of the MDT device with "dd", since this would go relatively quickly (maybe at 100MB/s, so a few hours for the full backup). This may be important in case running the real e2fsck doesn't go well (depending on what corruption is being seen). |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
rogram received signal SIGSEGV, Segmentation fault. |
| Comment by Mahmoud Hanafi [ 16/Jan/13 ] |
|
dd is going to take 20hours! I have created a snapshot of the volume so we can run the fsck on it |
| Comment by Zhenyu Xu [ 16/Jan/13 ] |
|
the latest e2fsck has a glitch, and I uploaded a patch for it (http://review.whamcloud.com/5045) commit message LU-2627 e2fsck: check_symlink() SIGSEGV Since e2fsck_pass1_check_symlink()-> check_symlink(ctx, NULL, ino, inode, buf), we should use 'ino' instead of 'pctx->ino' in check_symlink(). this is just for the SIGSEGV issue. |
| Comment by Zhenyu Xu [ 16/Jan/13 ] |
|
As Andreas suggested, run e2fsck with run the patched e2fsck with "-n" under gdb and paste the resulting stack trace so that we can diagnose what the problem could be. Running e2fsck with '-n' won't change the disk device. |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
Looks like the patch got us past the SIGSEGV. But looks like the fsck will remove all symlinks! It is calling out what looks like all symlinks as invalid. |
| Comment by Cliff White (Inactive) [ 17/Jan/13 ] |
|
This is why we urge a backup before you fsck -y. Will the snapshot allow you to restore the symlinks? |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
It is not clear to me why it is removing all the symlunks. Is it because of the extent option? How would we restore the symlinks from the dd backup? |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
here is the summary of the test fsck. nbp1-MDT0000: ********** WARNING: Filesystem still has errors ********** 63829418 inodes used (23.78%, out of 268435456)
62047854 regular files I can upload the full upload of the output. [root@pladmin4:~/mhanafi]$ grep invalid fck.out | wc -l |
| Comment by Cliff White (Inactive) [ 17/Jan/13 ] |
|
We are not certain that the symlinks would be deleted, in a case such as this it is always desirable to have a backup, if possible. |
| Comment by Zhenyu Xu [ 17/Jan/13 ] |
|
please compress and upload fck.out. I want to check whether those invalid symlink file are those long symlink which miss NUL terminator. Something like an example Pass 1: Checking inodes, blocks, and sizes Inode 121351 symlink missing NUL terminator. Fix? no ... ... Pass 2: Checking directory structure Symlink /path/to/long/symlink/file (inode #121351) is invalid. Clear? no ... If it's this case, latest e2fsck should be capable of fixing them. (like |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
file is uploaded |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
Filed |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
Bobijam, I think that the problem is with e2fsck rejecting short symlinks with the EXT4_EXTENTS_FL set. The Inode 9482890 symlink missing NUL terminator. Fix? no Inode 9482897 symlink missing NUL terminator. Fix? no Fast symlink 9482914 has EXTENT_FL set. Clear? no Fast symlink 9482917 has EXTENT_FL set. Clear? no Fast symlink 9482921 has EXTENT_FL set. Clear? no It makes sense to change e2fsck to accept such inodes and just clear the EXT4_EXTENTS_FL instead of considering it corrupted. That will allow recovering the filesystem without the need to restore the symlinks (which would just get EXT4_EXTENTS_FL set again, until |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
This was a 1.8.x filesystem that was upgraded. So I think the extent option is leftover from the 1.8.x format. |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
Looking at the e2fsck code, it appears that it will correctly remove just the EXTENT_FL flag, rather than clear the whole inode: if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL) && LINUX_S_ISLNK(inode->i_mode) && !ext2fs_inode_has_valid_blocks2(fs, inode) && fix_problem(ctx, PR_1_FAST_SYMLINK_EXTENT_FL, &pctx)) { inode->i_flags &= ~EXT4_EXTENTS_FL; e2fsck_write_inode(ctx, ino, inode, "pass1"); } so the only confusion is that the PR_1_FAST_SYMLINK_EXTENT_FL problem code is asking "Clear", which might be confusing to some (including myself) as asking whether the inode should be cleared instead of the flag being cleared. I will submit a patch to fix this. The later errors: Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/timeave_cumulate.F (inode #68169598) is invalid. Clear? no Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/cal_compdates.F (inode #68169136) is invalid. Clear? no should not be hit if the earlier checks to clear EXT4_EXTENT_FL had been allowed to clear this flag from the short symlinks. There are some further errors, much later in the log. There are ~20 of the following errors in Pass 2: Pass 2: Checking directory structure Second entry 'IE_t040101_000000.log' (inode=18364943) in directory inode 1837308 5 should be '..' Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is duplicate '..' entry. Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is duplicate '..' entry. Fix? no Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/ RESULTS/run_all/IE (18373085) is a link to directory /ROOT/xjia/Saturn/run_Ideal izedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653). Clear? no that appear a bit unusual, but are not fatally broken. There are ~20 matching errors for the unfixed ".." entries later in Pass 3: Pass 3: Checking directory connectivity '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all/IE (18373085) is <The NULL inode> (0), should be /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653). Fix? no and a few minor errors in Pass 3A: Pass 3A: Optimizing directories Duplicate entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found. Clear? no Entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename. Rename to c_t_f.~0? no Duplicate entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found. Clear? no Entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename. Rename to b1b2b3~0? no It appears that the entries that would be "fixed" in Pass 2 will likely appear in lost+found once they are fixed, and if you want to recover those files you could mount the MDT locally with mount -t ldiskfs and rename them from .../lost+found/#inode to the path given for each inode number. I think you could go ahead with running e2fsck -fy on the snapshot, mount the snapshot MDT filesystem locally as ldiskfs to verify a handful of the symlinks are still intact, and check lost+found for the ~20 or so inodes that would need to be fixed (you could even write a short script to rename them if downtime is critical). If that works OK, then when you take the real MDT filesystem offline for repair, please make another snapshot at that time, run the e2fsck -fy on the real MDT, mount as ldiskfs and repair the files in lost+found before unmounting and remounting it again as lustre. In order to get the number of messages in the e2fsck log to a manageable number, I filtered out all of the duplicate messages: egrep -v "^$|^Fast symlink .* EXTENT_FL|^Inode .* missing NUL terminator|^Clear" e2fsck.log > e2fsck-filtered.log I had also filtered out "^Symlink.*is invalid" messages, but I don't think you should hit them during the repairing e2fsck run. |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
I also see in your MDT feature list that there is the "dirdata" feature enabled, but this is definitely NOT a feature that would have been enabled with a filesystem formatted with 1.8. Also, the ".." corruption is definitely not random. Did you perhaps run the Xyratex "upgrade" tool on the MDT filesystem? I believe that this would be the root cause of the ".." corruption. My understanding is that it was deleting the ".." entry to add the FID, and then re-inserting it into the directory, but ext4/e2fsck require that the ".." entry immediately follow the "." entry at the start. |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
We did not use the xyratex upgrade tool. But we added that dirdata option at some point. Should we remove that option? |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
Uploading the fsck ran on the snap. Please review before we run on the real mdt device. |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
Looking at the test e2fsck log, one new directory is getting yet a different error related to the "." entry: Directory entry for '.' in /ROOT/msekula/fun/camrad (13208388) is big. Split? yes Missing '..' in directory inode 13208388. Fix? yes Setting filetype for entry '..' in /ROOT/msekula/fun/camrad (13208388) to 2. Entry '..' in /ROOT/msekula/fun/camrad (13208388) is duplicate '..' entry. Fix? yes I suspect that there is some code in e2fsck or in ldiskfs that is not handling the dirdata field correctly. It likely relates to It doesn't seem that a large number of directories will be repaired, so I think it makes sense to go ahead and fix the real MDT at this point. The only other thing you might want to check before doing the final is if you run "e2fsck -fy" on the snapshot a second time that it passes cleanly without any repairs. About 30 directories will be moved to lost+found, but they can be moved back to their correct location, and nothing should be lost. The next question to figure out what has caused this problem. When did you upgrade to 2.1? Were these directories existing before the upgrade from 1.8, or were they created afterward? How large are the directories (number of entries = "find ${directory} -print | wc -l", size of directory = "ls -ld ${directory}")? Do you know if the directories where renamed after they were created? How long has it been since you last ran e2fsck? Have you run it since the upgrade? |
| Comment by Mahmoud Hanafi [ 17/Jan/13 ] |
|
It has been a very long time since we have ran e2fsck and that was during the 1.8.x code. We have never ran e2fsck since moving to 2.1. Should we remove the dirdata options? I will check the date and size of the directories. We may want to just archive these and restore them after the fsck or tar/delete/untar them. |
| Comment by Andreas Dilger [ 18/Jan/13 ] |
|
The "dirdata" option is enabled by default for 2.x filesystems, but I don't think it is necessarily advisable to disable it at this time. It does appear at first glance that running e2fsck after removing the dirdata feature would handle this correctly and clear the extra dirdata flag in each dirent, but we haven't tested this at all, and it would also cause the MDS to become considerably slower. So far I don't see any indication besides the mixup with ".." entries that there is anything seriously wrong with these directories. The bytes at the start of the directory are used for ".", "..", and the htree index on directories over 4kB in size, and not user data. e2fsck should regenerate all of the needed information from redundant information elsewhere, except being able to move the entry from lost+found back to the proper place in the tree. |
| Comment by Cliff White (Inactive) [ 21/Jan/13 ] |
|
What is your current state? What help can we give you? |
| Comment by Mahmoud Hanafi [ 21/Jan/13 ] |
|
At this point we have been able to run fsck on the mdt and have recovered from the errors. |
| Comment by Cliff White (Inactive) [ 21/Jan/13 ] |
|
Is the issue closed, or is there some other help we can give you? |
| Comment by Mahmoud Hanafi [ 08/Feb/13 ] |
|
We seem to have hit this issue again on the same filesystem. pfe1 ~ # ls -l /nobackupp1/xmeng/run_sc_anisopi/run06_dipole_semiimpl_nohyp_taug from the mdt |
| Comment by Andreas Dilger [ 08/Feb/13 ] |
|
This problem will persist for large 1.8 directories that are renamed until a version of the tune2fs -O dirdata /dev/mdtdev though this will have some negative performance impact for all newly-created files when doing name lookups and "ls -l". |
| Comment by Mahmoud Hanafi [ 08/Feb/13 ] |
|
uploading fsck output for review before we run it for real. |
| Comment by Johann Lombardi (Inactive) [ 12/Feb/13 ] |
|
There is nothing new in the fsck output compared to last time. I think you should go ahead and run fsck. |
| Comment by Peter Jones [ 21/Mar/13 ] |
|
As per NASA ok to close ticket |