[LU-1774] fsck -fD corrupts filesystem Created: 20/Aug/12 Updated: 04/Dec/12 Resolved: 04/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Wojciech Turek (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-source-2.1.2-2.6.32_220.17.1.el6_lustre.x86_64.x86_64 e2fsprogs-libs-1.42.3.wc1-7.el6.x86_64 2.6.32-220.17.1.el6_lustre.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 4008 | ||||||||||||
| Description |
|
I have been seeing a large number of messages like the one below on the production /scratch FS. Aug 17 17:54:52 mds07 mds07 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_dx_add_entry: Directory index full! the /scratch FS temporarily holds user /home directories until I install new hardware for separate lustre /home FS . The area of /scratch that is holding user /home directories is backed up on daily basis Device dm-2 is the mdt for our production scratch FS. The file system has around 160M files at the moment and from what I found by reading various posts the LDISKFS message above suggests that we may have a very large directories in our/scrtach FS. I decided to run fsck -fD which supposedly should optimize directory structures and get rid of the above problem (at least temporarily) Unfortunately this turned out to be a bad idea. The first pass of fsck found over 3200 invalid Symlinks and decided to clear them, for example /mnt/backup/home/dws29/sandy/InstallArea/XML/CamMapCut64.pie -> /home/dws29/sandy/Task_pkg/HL2_PowellSnakes/v00-00-020000_CVSHEAD/cmt/../XMLMODULESCHECKED//CamMapCut64.pie Obviously in a whole /scratch FS we have much more than 3K of Symlinks so I am puzzled by what criteria fsck decided to clear these particular Symlinks. I ran second pass of fsck and then mounted MDT back. Everything seemed ok until the overnight rsync backup process started to copy files and found many I/O errors when trying to enter some directories, for example I can see inside this directories from mds by using debugfs, so I am hoping that the data are not completely gone. Again I am able to recover directories that are on backed up area of scratch but this is not a lot and many of the corrupted directories are not backed up. Is there any way to reverse/fix what -D optimisation did and reconstruct the data? I am attaching a log from fsck Also maybe worth mentioning the FS is less than couple of months old and it was created using e2fsprogs-1.42.3.wc1-7.el6.x86_64 which already had some fixes for fsck -D issues. |
| Comments |
| Comment by Wojciech Turek (Inactive) [ 20/Aug/12 ] |
|
Some of the problems I am seeing seem to be related to |
| Comment by Cliff White (Inactive) [ 21/Aug/12 ] |
|
Your MDS logs start on August 12th and at that time the error message is already happening. Is it possible to get logs for the MDS from prior to August 12th? Can you determine when the error first appeared? |
| Comment by Wojciech Turek (Inactive) [ 21/Aug/12 ] |
|
Lustre syslog messages from 29Jul till 12 Aug |
| Comment by Wojciech Turek (Inactive) [ 21/Aug/12 ] |
|
Hi Cliff, I attached earlier syslogs. Please not though that the corruption occurred after running e2fsck with -D option on 17 of August. The situation got much worst today and it stops us from running /scratch filesystem, see detials below. I have decided to run e2fsck on scratch mdt today to fix symlinks that were missing NUL terminators. I updated e2fsprogs to the latest build see below I first run fsck -fvn to see what will be done and only symlinks problem were reported so I ran fsck -fvy which fixed bad symlinks but nothing else was reported to be fixed. Then I mounted filesystem as normal. Unfortunately the "old" directory corruption (which occurred on the 17AUG) was still there but also new directories were corrupted. For example I have detected that a large number user directories on /lscratch fs including myself were corrupted and I can not access them any more. Also mds log is full of scary messages about corruption , see below logs from client that I run ls on corrupted directories: MDS log Aug 21 21:19:11 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! There is more entries like that I am still running ls on the top directories of /lscratch to detect corruped ones. This is very bad and I hope we can recover them. I am attaching logs from both fsck runs |
| Comment by Wojciech Turek (Inactive) [ 22/Aug/12 ] |
|
I was wondering if you could update me of any development on this apparent critical issue. |
| Comment by Wojciech Turek (Inactive) [ 23/Aug/12 ] |
|
I am surprised that there is not much progress on this serious issue that everybody using lustre is affected by at the moment. I managed to reproduce the problem on my test filesystem, these are the steps: I hope that helps in debugging the problem. |
| Comment by Hellen (Inactive) [ 23/Aug/12 ] |
|
A potential customer is testing 2.1.2 release and ran into this issue? |
| Comment by Zhenyu Xu [ 27/Aug/12 ] |
|
patch tracking at http://review.whamcloud.com/3799 patch description
LU-1774 e2fsck: e2fsck -D does not change dirdata content
* Fix dir optimization to preserver dirdata content for dot and dotdot
entries.
* Add test case.
|
| Comment by Wojciech Turek (Inactive) [ 12/Sep/12 ] |
|
We have found a method to recover the data and copy them to a new filesystem. However I think that it still be useful to others to be able to repair the corruption rather than have to copy the data. |
| Comment by Zhenyu Xu [ 04/Dec/12 ] |
|
landed for e2fsprogs 1.42.6 |