Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.1.2
    • None
    • 2
    • 4008

    Description

      I have been seeing a large number of messages like the one below on the production /scratch FS.

      Aug 17 17:54:52 mds07 mds07 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_dx_add_entry: Directory index full!

      the /scratch FS temporarily holds user /home directories until I install new hardware for separate lustre /home FS . The area of /scratch that is holding user /home directories is backed up on daily basis

      Device dm-2 is the mdt for our production scratch FS. The file system has around 160M files at the moment and from what I found by reading various posts the LDISKFS message above suggests that we may have a very large directories in our/scrtach FS. I decided to run fsck -fD which supposedly should optimize directory structures and get rid of the above problem (at least temporarily)

      Unfortunately this turned out to be a bad idea. The first pass of fsck found over 3200 invalid Symlinks and decided to clear them, for example
      Symlink /ROOT/new_home/dws29/sandy/InstallArea/XML/CamMapCut64.pie (inode #66608225) is invalid.
      Clear<y>? yes
      I have checked those supposedly invalid symlinks with our /home backup and the symlink are actually correct, so fsck just removed over 3k valid symlinks.

      /mnt/backup/home/dws29/sandy/InstallArea/XML/CamMapCut64.pie -> /home/dws29/sandy/Task_pkg/HL2_PowellSnakes/v00-00-020000_CVSHEAD/cmt/../XMLMODULESCHECKED//CamMapCut64.pie

      Obviously in a whole /scratch FS we have much more than 3K of Symlinks so I am puzzled by what criteria fsck decided to clear these particular Symlinks.
      I was able to recover some of them from our /home backup but the ones that were in not backed up area of /scratch were cleared forever (not cool).

      I ran second pass of fsck and then mounted MDT back. Everything seemed ok until the overnight rsync backup process started to copy files and found many I/O errors when trying to enter some directories, for example
      ls /home/ad491/progs/gromacs-3.3.3/src/gmxlib/gmx_blas
      ls: reading directory /home/ad491/progs/gromacs-3.3.3/src/gmxlib/gmx_blas: Input/output error

      I can see inside this directories from mds by using debugfs, so I am hoping that the data are not completely gone.
      debugfs -c -R 'ls -l ROOT/new_home/ad491/progs/gromacs-3.3.3/src/gmxlib/gmx_blas/' /dev/mapper/mds08_scratch_mdt
      debugfs 1.42.3.wc1 (28-May-2012)
      /dev/mapper/mds08_scratch_mdt: catastrophic mode - not reading inode or group bitmaps
      51920683 40775 (18) 9040 9043 8192 19-Dec-2009 16:23 .
      51920609 40775 (18) 9040 9043 16384 19-Dec-2009 16:26 ..
      51921092 40775 (18) 9040 9043 4096 19-Dec-2009 16:23 .deps
      51921094 40775 (18) 9040 9043 4096 19-Dec-2009 16:23 .libs
      51923088 100664 (17) 9040 9043 0 19-Dec-2009 16:19 Makefile
      51923090 100644 (17) 9040 9043 0 2-Feb-2005 13:05 Makefile.am
      51923092 100644 (17) 9040 9043 0 28-Feb-2008 15:41 Makefile.in
      51923093 100644 (17) 9040 9043 0 24-Aug-2005 01:41 dasum.c
      51923094 100664 (17) 9040 9043 0 19-Dec-2009 16:23 dasum.lo
      51923095 100664 (17) 9040 9043 0 19-Dec-2009 16:23 dasum.o
      51923096 100644 (17) 9040 9043 0 2-Feb-2005 13:05 daxpy.c
      51923097 100664 (17) 9040 9043 0 19-Dec-2009 16:23 daxpy.lo
      51923098 100664 (17) 9040 9043 0 19-Dec-2009 16:23 daxpy.o

      Again I am able to recover directories that are on backed up area of scratch but this is not a lot and many of the corrupted directories are not backed up. Is there any way to reverse/fix what -D optimisation did and reconstruct the data?

      I am attaching a log from fsck
      and also few days worth of syslog messages from mds and oss servers. Please not that on the 17Aug around 6pm we had an IB network aoutage and there will be some noise related to these problems in the logs.

      Also maybe worth mentioning the FS is less than couple of months old and it was created using e2fsprogs-1.42.3.wc1-7.el6.x86_64 which already had some fixes for fsck -D issues.

      Attachments

        Issue Links

          Activity

            [LU-1774] fsck -fD corrupts filesystem
            bobijam Zhenyu Xu added a comment -

            landed for e2fsprogs 1.42.6

            bobijam Zhenyu Xu added a comment - landed for e2fsprogs 1.42.6

            We have found a method to recover the data and copy them to a new filesystem. However I think that it still be useful to others to be able to repair the corruption rather than have to copy the data.
            I tested the patch and it recovers access to the corrupted directories but it does not fix it completely. So application accessing the dot or dot dot directory still receives I/O error.
            For example if /scratch/yyy directory have been corrupted and then fixed rsync of /scratch/yyy will fail with I/O error but rsync of /scratch/yyy/* will work fine.

            wjt27 Wojciech Turek added a comment - We have found a method to recover the data and copy them to a new filesystem. However I think that it still be useful to others to be able to repair the corruption rather than have to copy the data. I tested the patch and it recovers access to the corrupted directories but it does not fix it completely. So application accessing the dot or dot dot directory still receives I/O error. For example if /scratch/yyy directory have been corrupted and then fixed rsync of /scratch/yyy will fail with I/O error but rsync of /scratch/yyy/* will work fine.
            bobijam Zhenyu Xu added a comment -

            patch tracking at http://review.whamcloud.com/3799

            patch description
                LU-1774 e2fsck: e2fsck -D does not change dirdata content
            
                * Fix dir optimization to preserver dirdata content for dot and dotdot
                  entries.
            
                * Add test case.
            
            bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/3799 patch description LU-1774 e2fsck: e2fsck -D does not change dirdata content * Fix dir optimization to preserver dirdata content for dot and dotdot entries. * Add test case.

            A potential customer is testing 2.1.2 release and ran into this issue?

            hellenn Hellen (Inactive) added a comment - A potential customer is testing 2.1.2 release and ran into this issue?

            I am surprised that there is not much progress on this serious issue that everybody using lustre is affected by at the moment.

            I managed to reproduce the problem on my test filesystem, these are the steps:
            1)create test filesystem with latest e2fsprogs wc3 release.
            2) Mount testfs on the client and create directory, fill it with files so the size of the directory is bigger then 4K
            cd /ltestfs
            ls -al /ltestfs/new_scratch1/
            total 40
            drwxr-xr-x 4 root root 4096 Aug 23 03:35 .
            drwxr-xr-x 4 root root 4096 Aug 23 03:35 ..
            drwxr-x--- 229 sjr20 sjr20 20480 Jun 22 10:25 sjr20
            drwxr-x--- 131 wjt27 wjt27 12288 Mar 15 20:19 wjt27
            3) umount MDT and run fsck -fvD on it. Evry time you run it e2fsck will modify the filesystem.
            4) mount MDT back and on the client move directory, for example I moved them one level down
            mv new_scratch1/* .
            ls -al
            total 48
            drwxr-xr-x 6 root root 4096 Aug 23 10:51 .
            drwxr-xr-x 32 root root 4096 Aug 23 00:58 ..
            drwxr-xr-x 2 root root 4096 Aug 22 21:06 .lustre
            drwxr-xr-x 2 root root 4096 Aug 23 10:51 new_scratch1
            drwxr-x--- 229 sjr20 sjr20 20480 Jun 22 10:25 sjr20
            drwxr-x--- 131 wjt27 wjt27 12288 Mar 15 20:19 wjt27
            5) try to list directory
            ls -al wjt27/
            ls: reading directory wjt27/: Input/output error
            total 0

            I hope that helps in debugging the problem.

            wjt27 Wojciech Turek added a comment - I am surprised that there is not much progress on this serious issue that everybody using lustre is affected by at the moment. I managed to reproduce the problem on my test filesystem, these are the steps: 1)create test filesystem with latest e2fsprogs wc3 release. 2) Mount testfs on the client and create directory, fill it with files so the size of the directory is bigger then 4K cd /ltestfs ls -al /ltestfs/new_scratch1/ total 40 drwxr-xr-x 4 root root 4096 Aug 23 03:35 . drwxr-xr-x 4 root root 4096 Aug 23 03:35 .. drwxr-x--- 229 sjr20 sjr20 20480 Jun 22 10:25 sjr20 drwxr-x--- 131 wjt27 wjt27 12288 Mar 15 20:19 wjt27 3) umount MDT and run fsck -fvD on it. Evry time you run it e2fsck will modify the filesystem. 4) mount MDT back and on the client move directory, for example I moved them one level down mv new_scratch1/* . ls -al total 48 drwxr-xr-x 6 root root 4096 Aug 23 10:51 . drwxr-xr-x 32 root root 4096 Aug 23 00:58 .. drwxr-xr-x 2 root root 4096 Aug 22 21:06 .lustre drwxr-xr-x 2 root root 4096 Aug 23 10:51 new_scratch1 drwxr-x--- 229 sjr20 sjr20 20480 Jun 22 10:25 sjr20 drwxr-x--- 131 wjt27 wjt27 12288 Mar 15 20:19 wjt27 5) try to list directory ls -al wjt27/ ls: reading directory wjt27/: Input/output error total 0 I hope that helps in debugging the problem.

            I was wondering if you could update me of any development on this apparent critical issue.
            After yesterday's e2fsck run that meant to only fix NUL termination of symlinks we have identify a 26 top level user directories that can not be accessed due to I/O error. It seem that all the affected directories are this that size is bigger than 4K.

            wjt27 Wojciech Turek added a comment - I was wondering if you could update me of any development on this apparent critical issue. After yesterday's e2fsck run that meant to only fix NUL termination of symlinks we have identify a 26 top level user directories that can not be accessed due to I/O error. It seem that all the affected directories are this that size is bigger than 4K.

            Hi Cliff,

            I attached earlier syslogs. Please not though that the corruption occurred after running e2fsck with -D option on 17 of August.

            The situation got much worst today and it stops us from running /scratch filesystem, see detials below.

            I have decided to run e2fsck on scratch mdt today to fix symlinks that were missing NUL terminators. I updated e2fsprogs to the latest build see below
            e2fsprogs-1.42.3.wc3-7.el6.x86_64

            I first run fsck -fvn to see what will be done and only symlinks problem were reported so I ran fsck -fvy which fixed bad symlinks but nothing else was reported to be fixed. Then I mounted filesystem as normal. Unfortunately the "old" directory corruption (which occurred on the 17AUG) was still there but also new directories were corrupted. For example I have detected that a large number user directories on /lscratch fs including myself were corrupted and I can not access them any more. Also mds log is full of scary messages about corruption , see below

            logs from client that I run ls on corrupted directories:
            Aug 21 20:18:52 west-1-1 kernel: Lustre: Mounted lscratch-client
            Aug 21 20:19:45 west-1-1 kernel: Lustre: Mounted lhome-client
            Aug 21 21:19:11 west-1-1 kernel: LustreError: 9836:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x20000045f:0xe16e:0x0] at 0: rc -5
            Aug 21 21:19:11 west-1-1 kernel: LustreError: 9836:0:(dir.c:649:ll_readdir()) error reading dir [0x20000045f:0xe16e:0x0] at 0: rc -5
            Aug 21 21:29:54 west-1-1 kernel: LustreError: 9882:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x200000404:0x2a4:0x0] at 0: rc -5
            Aug 21 21:29:54 west-1-1 kernel: LustreError: 9882:0:(dir.c:649:ll_readdir()) error reading dir [0x200000404:0x2a4:0x0] at 0: rc -5
            Aug 21 21:32:03 west-1-1 kernel: LustreError: 9890:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000404:0x2a4:0x0] at 0: rc -5
            Aug 21 21:32:03 west-1-1 kernel: LustreError: 9890:0:(dir.c:649:ll_readdir()) error reading dir [0x200000404:0x2a4:0x0] at 0: rc -5

            MDS log

            Aug 21 21:19:11 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:27:27 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:27:27 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551510, running e2fsck is recommended.
            Aug 21 21:27:27 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:29:01 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:29:01 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551603, running e2fsck is recommended.
            Aug 21 21:29:01 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:29:54 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:29:59 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:29:59 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073075, running e2fsck is recommended.
            Aug 21 21:29:59 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:32:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:34:45 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:34:45 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22550971, running e2fsck is recommended.
            Aug 21 21:34:45 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:34:56 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:34:56 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551058, running e2fsck is recommended.
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551153, running e2fsck is recommended.
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 1 previous similar message
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551158, running e2fsck is recommended.
            Aug 21 21:35:09 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 182 for directory #23073097
            Aug 21 21:35:09 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073097, running e2fsck is recommended.
            Aug 21 21:50:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:50:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 1 previous similar message
            Aug 21 21:50:29 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:29 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073110, running e2fsck is recommended.
            Aug 21 21:50:29 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:50:31 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:31 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551292, running e2fsck is recommended.
            Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073121, running e2fsck is recommended.
            Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551319, running e2fsck is recommended.
            Aug 21 21:50:36 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:36 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073123, running e2fsck is recommended.
            Aug 21 21:50:37 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
            Aug 21 21:50:37 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551353, running e2fsck is recommended.
            Aug 21 21:50:37 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
            Aug 21 21:50:37 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 4 previous similar messages
            Aug 21 21:50:38 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 5 for directory #22551357

            There is more entries like that I am still running ls on the top directories of /lscratch to detect corruped ones.

            This is very bad and I hope we can recover them.

            I am attaching logs from both fsck runs

            wjt27 Wojciech Turek added a comment - Hi Cliff, I attached earlier syslogs. Please not though that the corruption occurred after running e2fsck with -D option on 17 of August. The situation got much worst today and it stops us from running /scratch filesystem, see detials below. I have decided to run e2fsck on scratch mdt today to fix symlinks that were missing NUL terminators. I updated e2fsprogs to the latest build see below e2fsprogs-1.42.3.wc3-7.el6.x86_64 I first run fsck -fvn to see what will be done and only symlinks problem were reported so I ran fsck -fvy which fixed bad symlinks but nothing else was reported to be fixed. Then I mounted filesystem as normal. Unfortunately the "old" directory corruption (which occurred on the 17AUG) was still there but also new directories were corrupted. For example I have detected that a large number user directories on /lscratch fs including myself were corrupted and I can not access them any more. Also mds log is full of scary messages about corruption , see below logs from client that I run ls on corrupted directories: Aug 21 20:18:52 west-1-1 kernel: Lustre: Mounted lscratch-client Aug 21 20:19:45 west-1-1 kernel: Lustre: Mounted lhome-client Aug 21 21:19:11 west-1-1 kernel: LustreError: 9836:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x20000045f:0xe16e:0x0] at 0: rc -5 Aug 21 21:19:11 west-1-1 kernel: LustreError: 9836:0:(dir.c:649:ll_readdir()) error reading dir [0x20000045f:0xe16e:0x0] at 0: rc -5 Aug 21 21:29:54 west-1-1 kernel: LustreError: 9882:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x200000404:0x2a4:0x0] at 0: rc -5 Aug 21 21:29:54 west-1-1 kernel: LustreError: 9882:0:(dir.c:649:ll_readdir()) error reading dir [0x200000404:0x2a4:0x0] at 0: rc -5 Aug 21 21:32:03 west-1-1 kernel: LustreError: 9890:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000404:0x2a4:0x0] at 0: rc -5 Aug 21 21:32:03 west-1-1 kernel: LustreError: 9890:0:(dir.c:649:ll_readdir()) error reading dir [0x200000404:0x2a4:0x0] at 0: rc -5 MDS log Aug 21 21:19:11 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:27:27 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:27:27 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551510, running e2fsck is recommended. Aug 21 21:27:27 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:29:01 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:29:01 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551603, running e2fsck is recommended. Aug 21 21:29:01 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:29:54 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:29:59 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:29:59 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073075, running e2fsck is recommended. Aug 21 21:29:59 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:32:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:34:45 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:34:45 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22550971, running e2fsck is recommended. Aug 21 21:34:45 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:34:56 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:34:56 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551058, running e2fsck is recommended. Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551153, running e2fsck is recommended. Aug 21 21:35:08 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:35:08 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 1 previous similar message Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:35:08 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551158, running e2fsck is recommended. Aug 21 21:35:09 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 182 for directory #23073097 Aug 21 21:35:09 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073097, running e2fsck is recommended. Aug 21 21:50:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:50:15 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 1 previous similar message Aug 21 21:50:29 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:29 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073110, running e2fsck is recommended. Aug 21 21:50:29 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:50:31 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:31 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551292, running e2fsck is recommended. Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073121, running e2fsck is recommended. Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:33 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551319, running e2fsck is recommended. Aug 21 21:50:36 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:36 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 23073123, running e2fsck is recommended. Aug 21 21:50:37 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit Aug 21 21:50:37 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 22551353, running e2fsck is recommended. Aug 21 21:50:37 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5! Aug 21 21:50:37 10.143.245.207 mds07 kernel: Lustre: 12085:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 4 previous similar messages Aug 21 21:50:38 10.143.245.207 mds07 kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 5 for directory #22551357 There is more entries like that I am still running ls on the top directories of /lscratch to detect corruped ones. This is very bad and I hope we can recover them. I am attaching logs from both fsck runs

            Lustre syslog messages from 29Jul till 12 Aug

            wjt27 Wojciech Turek added a comment - Lustre syslog messages from 29Jul till 12 Aug

            Your MDS logs start on August 12th and at that time the error message is already happening. Is it possible to get logs for the MDS from prior to August 12th? Can you determine when the error first appeared?

            cliffw Cliff White (Inactive) added a comment - Your MDS logs start on August 12th and at that time the error message is already happening. Is it possible to get logs for the MDS from prior to August 12th? Can you determine when the error first appeared?

            Some of the problems I am seeing seem to be related to LU-1366 and LU-1540, so e2fsprogs-1.42.3.wc3 should at least sort out the symlink problem but I can not find anything related to a directory corruption.

            wjt27 Wojciech Turek added a comment - Some of the problems I am seeing seem to be related to LU-1366 and LU-1540 , so e2fsprogs-1.42.3.wc3 should at least sort out the symlink problem but I can not find anything related to a directory corruption.

            People

              bobijam Zhenyu Xu
              wjt27 Wojciech Turek
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: