[LU-2627] /bin/ls gets Input/output error Created: 16/Jan/13  Updated: 03/Jul/13  Resolved: 21/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File fsck.2.8.2012.nbp1.out.gz     File mdtsnap.fsck.out.gz     File nbp1FSCK.out.gz    
Issue Links:
Duplicate
is duplicated by LU-3519 EIO on directory access Closed
Related
is related to LU-2634 short symlinks on MDT with "extents" ... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2638 corruption of MDT ".." entry in some ... Technical task Resolved nasf  
Severity: 3
Rank (Obsolete): 6149

 Description   

Doing an ls gives the following error
ls: reading directory d4_stats/: Input/output error

client error:
[5237686.818045] LustreError: 77522:0:(dir.c:648:ll_readdir()) error reading dir [0x4488b6ced74:0x1edb5:0x0] at 0: rc -5
[5237686.849844] LustreError: 77522:0:(dir.c:648:ll_readdir()) Skipped 51 previous similar messages

MDT Error:
Jan 16 11:18:37 nbp1-mds kernel: Lustre: 15390:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!

Please advise on debug flags to use to gather logs.



 Comments   
Comment by Mahmoud Hanafi [ 16/Jan/13 ]

MDS has logged these messages
an 15 15:23:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
Jan 15 15:23:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 158428053, running e2fsck is recommended.
Jan 15 15:23:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
Jan 15 15:23:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 158428056, running e2fsck is recommended.
Jan 15 15:23:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
Jan 15 15:23:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 158427968, running e2fsck is recommended.
Jan 15 15:23:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: dx entry: limit != root limit
Jan 15 15:23:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 158428043, running e2fsck is recommended.
Jan 16 03:28:55 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 5 for directory #37273881
Jan 16 03:28:55 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 37273881, running e2fsck is recommended.
Jan 16 07:24:28 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 4 for directory #39331752
Jan 16 07:24:28 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 39331752, running e2fsck is recommended.
Jan 16 07:25:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 5 for directory #37273881
Jan 16 07:25:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 37273881, running e2fsck is recommended.
Jan 16 09:59:07 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 107 for directory #15731753
Jan 16 09:59:07 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 15731753, running e2fsck is recommended.
Jan 16 11:13:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 10 for directory #37272081
Jan 16 11:13:11 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 37272081, running e2fsck is recommended.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Can unmount and just run e2fsck on the mdt device?

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Yes, you should umount and fsck the MDT. You do not have to umount clients, however clients may block while the MDT is down.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Can you please provide the exact options to used for the fsck command

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

First, check all your logs and see if you are having hardware failures. Is there any error logging in your disk hardware? The device under dm-2 may have an issue.
For fsck, first run

  1. fsck -fn <yourMDTdevice>

This is a read-only pass, and should give you an idea of what is going on.
Then, you can run

  1. fsck -fy <yourMDTdevice>
    to repair.
Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Read only pass has lots of errors like this
Error while reading over extent tree in inode 14726136: Corrupt extent header

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Can you post the full output?

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

And are you using e2fsprogs from Whamcloud? Please indicate the version of e2fsprogs you have installed.

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Also, is there any indicate of hardware issue with the disk?

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

long list of this

nbp1-MDT0000 has been mounted 110 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error while reading over extent tree in inode 8502968: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503011: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503034: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503327: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503340: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503345: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503781: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503785: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503787: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503801: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503805: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503808: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503810: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503956: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8503961: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8504005: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8510949: Corrupt extent header
Clear inode? no

Error while reading over extent tree in inode 8541695: Corrupt extent header
Clear inode? no

I just stopped it for now.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

FYI- this is 1.8 upgraded to 2.x filesystem.
we have the following

e2fsprogs-1.41.90.wc4-7.el6.x86_64
lustre-ldiskfs-3.3.0-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

That is rather bad. You need to verify that your disk hardware is healthy, you may be seeing a disk failure. Do you have a backup?

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Can you give us your kernel version, and the version on all Lustre RPMS? You compile your own Lustre?

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Hardware is healthy. We don't have backups. But I am able to remount the mdt.
but got these errors
Jan 16 14:24:10 nbp1-mds kernel: Lustre: nbp1-MDT0000: disconnecting 1 stale clients
Jan 16 14:24:10 nbp1-mds kernel: LustreError: 86971:0:(mdt_handler.c:2792:mdt_recovery()) operation 35 on unconnected MDS from 12345-10.151.53.248@o2ib
Jan 16 14:24:10 nbp1-mds kernel: LustreError: 86971:0:(mdt_handler.c:2792:mdt_recovery()) Skipped 2802 previous similar messages
Jan 16 14:24:10 nbp1-mds kernel: Lustre: 87267:0:(ldlm_lib.c:946:target_handle_connect()) nbp1-MDT0000: connection from 14c5fa4f-a16b-c855-4b79-5ba10c67a331@10.151.53.248@o2ib recovering/t491111894205 exp (null) cur 1358375050 last 0
Jan 16 14:24:10 nbp1-mds kernel: Lustre: 87267:0:(ldlm_lib.c:946:target_handle_connect()) Skipped 4169 previous similar messages
Jan 16 14:24:10 nbp1-mds kernel: Lustre: nbp1-MDT0000: Client f0dbfa3e-028e-c2ce-22e3-c3f171e89ebf (at 10.151.26.25@o2ib) reconnecting, waiting for 12090 clients in recovery for 3:12
Jan 16 14:24:10 nbp1-mds kernel: Lustre: nbp1-MDT0000: Denying connection for new client 10.151.53.248@o2ib (at 14c5fa4f-a16b-c855-4b79-5ba10c67a331), waiting for 745 clients in recovery for 3:12
Jan 16 14:24:10 nbp1-mds kernel: Lustre: Skipped 47 previous similar messages
Jan 16 14:24:12 nbp1-mds kernel: Lustre: nbp1-MDT0000: Client dd76159a-cc95-ee10-7333-4317d790b9fb (at 10.151.27.24@o2ib) reconnecting, waiting for 12090 clients in recovery for 3:12
Jan 16 14:24:13 nbp1-mds kernel: Lustre: nbp1-MDT0000: sending delayed replies to recovered clients
Jan 16 14:24:15 nbp1-mds kernel: Lustre: MDS mdd_obd-nbp1-MDT0000: nbp1-OST000a_UUID now active, resetting orphans
Jan 16 14:24:15 nbp1-mds kernel: Lustre: Skipped 91 previous similar messages
Jan 16 14:24:15 nbp1-mds kernel: Lustre: MDS mdd_obd-nbp1-MDT0000: nbp1-OST0002_UUID now active, resetting orphans
Jan 16 14:24:15 nbp1-mds kernel: Lustre: Skipped 1 previous similar message
Jan 16 14:24:15 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Found orphan! Delete it
Jan 16 14:24:15 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Skipped 2579 previous similar messages
Jan 16 14:24:16 nbp1-mds kernel: Lustre: MDS mdd_obd-nbp1-MDT0000: nbp1-OST0025_UUID now active, resetting orphans
Jan 16 14:24:16 nbp1-mds kernel: Lustre: Skipped 22 previous similar messages
Jan 16 14:24:18 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Found orphan! Delete it
Jan 16 14:24:18 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Skipped 1 previous similar message
Jan 16 14:24:22 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Found orphan! Delete it
Jan 16 14:24:22 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Skipped 18 previous similar messages
Jan 16 14:24:30 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Found orphan! Delete it
Jan 16 14:24:30 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Skipped 156 previous similar messages
Jan 16 14:24:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Unrecognised inode hash code 10 for directory #37272081
Jan 16 14:24:45 nbp1-mds kernel: LDISKFS-fs warning (device dm-2): dx_probe: Corrupt dir inode 37272081, running e2fsck is recommended.
Jan 16 14:24:45 nbp1-mds kernel: Lustre: 87669:0:(mdd_object.c:2412:__mdd_readpage()) build page failed: -5!
Jan 16 14:24:45 nbp1-mds kernel: Lustre: 87669:0:(mdd_object.c:2412:__mdd_readpage()) Skipped 1 previous similar message
Jan 16 14:24:46 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Found orphan! Delete it
Jan 16 14:24:46 nbp1-mds kernel: Lustre: 86698:0:(mdd_orphans.c:371:orph_key_test_and_del()) Skipped 1476 previous similar messages

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Here is list of server lustre rpms
lustre-debuginfo-2.1.3-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-2.1.3-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-systemtap-0.9.3-2.noarch
lustre-modules-2.1.3-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-ldiskfs-debuginfo-3.3.0-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-ldiskfs-3.3.0-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-source-2.1.3-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64
lustre-tools-0.7.10-2.noarch
lustre-tests-2.1.3-1nasS_2.6.32_279.2.1.el6.20120824.x86_64.lustre213.x86_64

Linux nbp1-mds 2.6.32-279.2.1.el6.20120824.x86_64.lustre213 #1 SMP Mon Aug 27 15:02:12 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

our source tree is out on github

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

should I remount the mdt for now so the clients can recover or hold off.

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

The only actual errors are the LDISKFS-fs warnings. The rest are mostly standard restart. (errors should start with LustreError)
It would have been better to have run fsck -fy prior to the remount.
We are concerned that the issues reported by fsck -fn will like result in lost files if you run fsck -fy (the inodes with errors may be removed). You need to verify the health of the hardware prior to attempting any filesystem repair.
If the hardware is healthy, fsck -fy will likely make the filesystem operable, but you will probably have to identify and restore the affected files.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Should we upgrade our e2fsprogs and try. The hardware is differently healthy.

Comment by Jay Lan (Inactive) [ 16/Jan/13 ]

The git source for the server can be found at
https://github.com/jlan/lustre-nas
with branch nas-2.1.3, tag 2.1.3-1nasS.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

Do you have any documentation of identifying and restoring the affected files?

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Yes, you should upgrade to the latest e2fsprogs from http://downloads.whamcloud.com/public/e2fsprogs/

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

after the new e2fsck most of there errors are

Fast symlink 21070423 has EXTENT_FL set. Clear? no

and

Inode 11172447 symlink missing NUL terminator. Fix? no

is that good or bad?

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Not horrible. We think you may lose some symlinks. I would go ahead and say 'y'

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

hmmm e2fsck -vn SIGSEGV one it started checking directory structures

nbp1-mds ~/newrpms # e2fsck vn /dev/mapper/nbp1-vg-mdt1 > initcheck.out
e2fsck 1.42.3.wc3 (15-Aug-2012)
Signal (11) SIGSEGV si_code=SEGV_MAPERR fault addr=0x8
e2fsck[0x42d1dd]
/lib64/libc.so.6[0x3a42432900]
e2fsck[0x41160a]
e2fsck(e2fsck_process_bad_inode+0x693)[0x419b13]
e2fsck[0x419cfb]
/lib64/libext2fs.so.2(ext2fs_dblist_iterate2+0x87)[0x7fffed8c9447]
e2fsck(e2fsck_pass2+0x10b)[0x418ceb]
e2fsck(e2fsck_run+0x4f)[0x40eb6f]
e2fsck(main+0xbd2)[0x40cbf2]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3a4241ecdd]
e2fsck[0x409c49]

Comment by Andreas Dilger [ 16/Jan/13 ]

How did the extents feature get enabled on the MDT filesystem? This is not a standard formatting option, and not something that we test locally. It is likely the root cause of the problems that you are seeing.

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

here is what we have. Should we remove "extent" Option?
nbp1-mds ~/newrpms # tune2fs l /dev/mapper/nbp1-vg-mdt1
tune2fs 1.42.3.wc3 (15-Aug-2012)
Filesystem volume name: nbp1-MDT0000
Last mounted on: /
Filesystem UUID: aa5a51d9-5858-4bad-b6b6-668298ae0a7e
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 268435456
Block count: 268435456
Reserved block count: 0
Free blocks: 224938984
Free inodes: 204578188
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 960
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 32768
Inode blocks per group: 4096
Flex block group size: 16
Filesystem created: Wed Jun 8 19:54:48 2011
Last mount time: Wed Jan 16 14:17:10 2013
Last write time: Wed Jan 16 15:30:31 2013
Mount count: 112
Maximum mount count: 20
Last checked: Wed Jun 8 19:54:48 2011
Check interval: 15552000 (6 months)
Next check after: Mon Dec 5 18:54:48 2011
Lifetime writes: 7121 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 9bb09704-2b6c-4030-afe1-fcb7935216aa
Journal backup: inode blocks

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

Can you give us the output of 'tune2fs -l <device>' and tunefs.lustre -print <device> ?

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

nbp1-mds ~/newrpms # tune2fs l /dev/mapper/nbp1-vg-mdt1
tune2fs 1.42.3.wc3 (15-Aug-2012)
Filesystem volume name: nbp1-MDT0000
Last mounted on: /
Filesystem UUID: aa5a51d9-5858-4bad-b6b6-668298ae0a7e
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 268435456
Block count: 268435456
Reserved block count: 0
Free blocks: 224938984
Free inodes: 204578188
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 960
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 32768
Inode blocks per group: 4096
Flex block group size: 16
Filesystem created: Wed Jun 8 19:54:48 2011
Last mount time: Wed Jan 16 14:17:10 2013
Last write time: Wed Jan 16 15:30:31 2013
Mount count: 112
Maximum mount count: 20
Last checked: Wed Jun 8 19:54:48 2011
Check interval: 15552000 (6 months)
Next check after: Mon Dec 5 18:54:48 2011
Lifetime writes: 7121 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 9bb09704-2b6c-4030-afe1-fcb7935216aa
Journal backup: inode blocks
nbp1-mds ~/newrpms # tunefs.lustre print /dev/mapper/nbp1-vg-mdt1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: nbp1-MDT0000
Index: 0
Lustre FS: nbp1
Mount type: ldiskfs
Flags: 0x401
(MDT )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.151.26.26@o2ib lov.stripesize=1048576 lov.stripecount=4 mdd.quota_type=u mdt.identity_info=/usr/sbin/l_getidentity

Permanent disk data:
Target: nbp1-MDT0000
Index: 0
Lustre FS: nbp1
Mount type: ldiskfs
Flags: 0x441
(MDT update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.151.26.26@o2ib lov.stripesize=1048576 lov.stripecount=4 mdd.quota_type=u mdt.identity_info=/usr/sbin/l_getidentity rint

Writing CONFIGS/mountdata
nbp1-mds ~/newrpms #

Comment by Cliff White (Inactive) [ 16/Jan/13 ]

sorry, looks like our replies crossed. I see you already supplied the tune2fs

Comment by Andreas Dilger [ 16/Jan/13 ]

Can you please run e2fsck under gdb with "-n" option and paste the resulting stack trace here? I can't see enough of where the problem is above.

If you have a spare SATA disk I would recommend making a full backup of the MDT device with "dd", since this would go relatively quickly (maybe at 100MB/s, so a few hours for the full backup). This may be important in case running the real e2fsck doesn't go well (depending on what corruption is being seen).

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

rogram received signal SIGSEGV, Segmentation fault.
check_symlink (ctx=0x645060, pctx=0x0, ino=8503956, inode=0x7fffffffe180, buf=<value optimized out>) at pass1.c:190
190 if (ext2fs_extent_open2(ctx->fs, pctx->ino, inode, &handle))
Missing separate debuginfos, use: debuginfo-install db4-4.7.25-16.el6.x86_64 glibc-2.12-1.47.el6.x86_64 libblkid-2.17.2-12.4.el6.x86_64 libuuid-2.17.2-12.4.el6.x86_64
(gdb) bt
#0 check_symlink (ctx=0x645060, pctx=0x0, ino=8503956, inode=0x7fffffffe180, buf=<value optimized out>) at pass1.c:190
#1 0x0000000000419b13 in e2fsck_process_bad_inode (ctx=0x645060, dir=<value optimized out>, ino=8503956, buf=0x66eda0 "") at pass2.c:1402
#2 0x0000000000419cfb in check_dir_block (fs=0x645510, db=0x7fffe11d2a68, priv_data=0x7fffffffe570) at pass2.c:1044
#3 0x00007fffed8c9447 in ext2fs_dblist_iterate2 (dblist=0x649100, func=0x419b70 <check_dir_block>, priv_data=0x7fffffffe570) at dblist.c:239
#4 0x0000000000418ceb in e2fsck_pass2 (ctx=0x645060) at pass2.c:148
#5 0x000000000040eb6f in e2fsck_run (ctx=0x645060) at e2fsck.c:226
#6 0x000000000040cbf2 in main (argc=<value optimized out>, argv=<value optimized out>) at unix.c:1852

Comment by Mahmoud Hanafi [ 16/Jan/13 ]

dd is going to take 20hours! I have created a snapshot of the volume so we can run the fsck on it

Comment by Zhenyu Xu [ 16/Jan/13 ]

the latest e2fsck has a glitch, and I uploaded a patch for it (http://review.whamcloud.com/5045)

commit message
LU-2627 e2fsck: check_symlink() SIGSEGV

Since e2fsck_pass1_check_symlink()->
check_symlink(ctx, NULL, ino, inode, buf), we should use 'ino' instead
of 'pctx->ino' in check_symlink().

this is just for the SIGSEGV issue.

Comment by Zhenyu Xu [ 16/Jan/13 ]

As Andreas suggested, run e2fsck with run the patched e2fsck with "-n" under gdb and paste the resulting stack trace so that we can diagnose what the problem could be.

Running e2fsck with '-n' won't change the disk device.

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

Looks like the patch got us past the SIGSEGV. But looks like the fsck will remove all symlinks!

It is calling out what looks like all symlinks as invalid.

Comment by Cliff White (Inactive) [ 17/Jan/13 ]

This is why we urge a backup before you fsck -y. Will the snapshot allow you to restore the symlinks?

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

It is not clear to me why it is removing all the symlunks. Is it because of the extent option? How would we restore the symlinks from the dd backup?

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

here is the summary of the test fsck.

nbp1-MDT0000: ********** WARNING: Filesystem still has errors **********

63829418 inodes used (23.78%, out of 268435456)
3242 non-contiguous files (0.0%)
35652 non-contiguous directories (0.1%)

  1. of inodes with ind/dind/tind blocks: 0/0/0
    Extent depth histogram: 63096846/16544/13
    43499835 blocks used (16.20%, out of 268435456)
    0 bad blocks
    21886 large files

62047854 regular files
773172 directories
0 character device files
0 block device files
20 fifos
6652 links
612271 symbolic links (349798 fast symbolic links)
65 sockets
------------
63836054 files

I can upload the full upload of the output.

[root@pladmin4:~/mhanafi]$ grep invalid fck.out | wc -l
396020

Comment by Cliff White (Inactive) [ 17/Jan/13 ]

We are not certain that the symlinks would be deleted, in a case such as this it is always desirable to have a backup, if possible.

Comment by Zhenyu Xu [ 17/Jan/13 ]

please compress and upload fck.out.

I want to check whether those invalid symlink file are those long symlink which miss NUL terminator. Something like

an example
Pass 1: Checking inodes, blocks, and sizes
Inode 121351 symlink missing NUL terminator.  Fix? no
...
...
Pass 2: Checking directory structure
Symlink /path/to/long/symlink/file (inode #121351) is invalid.		
Clear? no
...

If it's this case, latest e2fsck should be capable of fixing them. (like LU-1540 indicates)

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

file is uploaded

Comment by Andreas Dilger [ 17/Jan/13 ]

Filed LU-2634 for tracking issue with EXT4_EXTENTS_FL set on symlinks for MDT with "extents" feature enabled.

Comment by Andreas Dilger [ 17/Jan/13 ]

Bobijam, I think that the problem is with e2fsck rejecting short symlinks with the EXT4_EXTENTS_FL set. The LU-1540 NUL termination problem appears that it would be fixed correctly with the current e2fsck. This EXT4_EXTENTS_FL appears to be a bug in the osd-ldiskfs code, if "extents" is enabled, for which I've filed LU-2634. Since we never format the MDT with "extents", we have never seen such a problem in our testing.

Inode 9482890 symlink missing NUL terminator.  Fix? no
Inode 9482897 symlink missing NUL terminator.  Fix? no
Fast symlink 9482914 has EXTENT_FL set.  Clear? no
Fast symlink 9482917 has EXTENT_FL set.  Clear? no
Fast symlink 9482921 has EXTENT_FL set.  Clear? no

It makes sense to change e2fsck to accept such inodes and just clear the EXT4_EXTENTS_FL instead of considering it corrupted. That will allow recovering the filesystem without the need to restore the symlinks (which would just get EXT4_EXTENTS_FL set again, until LU-2634 is fixed).

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

This was a 1.8.x filesystem that was upgraded. So I think the extent option is leftover from the 1.8.x format.

Comment by Andreas Dilger [ 17/Jan/13 ]

Looking at the e2fsck code, it appears that it will correctly remove just the EXTENT_FL flag, rather than clear the whole inode:

                if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL) &&
                    LINUX_S_ISLNK(inode->i_mode) &&
                    !ext2fs_inode_has_valid_blocks2(fs, inode) &&
                    fix_problem(ctx, PR_1_FAST_SYMLINK_EXTENT_FL, &pctx)) {
                        inode->i_flags &= ~EXT4_EXTENTS_FL;
                        e2fsck_write_inode(ctx, ino, inode, "pass1");
                }

so the only confusion is that the PR_1_FAST_SYMLINK_EXTENT_FL problem code is asking "Clear", which might be confusing to some (including myself) as asking whether the inode should be cleared instead of the flag being cleared. I will submit a patch to fix this.

The later errors:

Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/timeave_cumulate.F (inode #68169598) is invalid.
Clear? no
Symlink /ROOT/pheimbac/ecco/2013-01-seaice-adjoint/MITgcm_latest/mysetups/arctic210x192x50/build_forw/cal_compdates.F (inode #68169136) is invalid.
Clear? no

should not be hit if the earlier checks to clear EXT4_EXTENT_FL had been allowed to clear this flag from the short symlinks.

There are some further errors, much later in the log. There are ~20 of the following errors in Pass 2:

Pass 2: Checking directory structure
Second entry 'IE_t040101_000000.log' (inode=18364943) in directory inode 1837308
5 should be '..'
Fix? no
Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
RESULTS/run_all/IE (18373085) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
RESULTS/run_all/IE (18373085) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/
RESULTS/run_all/IE (18373085) is a link to directory /ROOT/xjia/Saturn/run_Ideal
izedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653).
Clear? no

that appear a bit unusual, but are not fatally broken. There are ~20 matching errors for the unfixed ".." entries later in Pass 3:

Pass 3: Checking directory connectivity
'..' in /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all/IE (18373085) is <The NULL inode> (0), should be /ROOT/xjia/Saturn/run_IdealizedSW_notilt_1e275_newgrid2_highorder/RESULTS/run_all (13221653).
Fix? no

and a few minor errors in Pass 3A:

Pass 3A: Optimizing directories
Duplicate entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found.  Clear? no
Entry 'c_t_f.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename.
Rename to c_t_f.~0? no
Duplicate entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) found.  Clear? no
Entry 'b1b2b3.x' in /ROOT/aiannett/NCC/Testing/Back-Face-Step (77623393) has a non-unique filename.
Rename to b1b2b3~0? no

It appears that the entries that would be "fixed" in Pass 2 will likely appear in lost+found once they are fixed, and if you want to recover those files you could mount the MDT locally with mount -t ldiskfs and rename them from .../lost+found/#inode to the path given for each inode number.

I think you could go ahead with running e2fsck -fy on the snapshot, mount the snapshot MDT filesystem locally as ldiskfs to verify a handful of the symlinks are still intact, and check lost+found for the ~20 or so inodes that would need to be fixed (you could even write a short script to rename them if downtime is critical). If that works OK, then when you take the real MDT filesystem offline for repair, please make another snapshot at that time, run the e2fsck -fy on the real MDT, mount as ldiskfs and repair the files in lost+found before unmounting and remounting it again as lustre.

In order to get the number of messages in the e2fsck log to a manageable number, I filtered out all of the duplicate messages:

egrep -v "^$|^Fast symlink .* EXTENT_FL|^Inode .* missing NUL terminator|^Clear" e2fsck.log > e2fsck-filtered.log

I had also filtered out "^Symlink.*is invalid" messages, but I don't think you should hit them during the repairing e2fsck run.

Comment by Andreas Dilger [ 17/Jan/13 ]

I also see in your MDT feature list that there is the "dirdata" feature enabled, but this is definitely NOT a feature that would have been enabled with a filesystem formatted with 1.8. Also, the ".." corruption is definitely not random.

Did you perhaps run the Xyratex "upgrade" tool on the MDT filesystem?

I believe that this would be the root cause of the ".." corruption. My understanding is that it was deleting the ".." entry to add the FID, and then re-inserting it into the directory, but ext4/e2fsck require that the ".." entry immediately follow the "." entry at the start.

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

We did not use the xyratex upgrade tool. But we added that dirdata option at some point. Should we remove that option?

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

Uploading the fsck ran on the snap. Please review before we run on the real mdt device.

Comment by Andreas Dilger [ 17/Jan/13 ]

Looking at the test e2fsck log, one new directory is getting yet a different error related to the "." entry:

Directory entry for '.' in /ROOT/msekula/fun/camrad (13208388) is big.
Split? yes
Missing '..' in directory inode 13208388.
Fix? yes
Setting filetype for entry '..' in /ROOT/msekula/fun/camrad (13208388) to 2.
Entry '..' in /ROOT/msekula/fun/camrad (13208388) is duplicate '..' entry.
Fix? yes

I suspect that there is some code in e2fsck or in ldiskfs that is not handling the dirdata field correctly. It likely relates to LU-2638. There are several files moved to lost+found as I suspected, but it looks like the majority of symlinks are fine.

It doesn't seem that a large number of directories will be repaired, so I think it makes sense to go ahead and fix the real MDT at this point. The only other thing you might want to check before doing the final is if you run "e2fsck -fy" on the snapshot a second time that it passes cleanly without any repairs. About 30 directories will be moved to lost+found, but they can be moved back to their correct location, and nothing should be lost.

The next question to figure out what has caused this problem. When did you upgrade to 2.1? Were these directories existing before the upgrade from 1.8, or were they created afterward? How large are the directories (number of entries = "find ${directory} -print | wc -l", size of directory = "ls -ld ${directory}")? Do you know if the directories where renamed after they were created? How long has it been since you last ran e2fsck? Have you run it since the upgrade?

Comment by Mahmoud Hanafi [ 17/Jan/13 ]

It has been a very long time since we have ran e2fsck and that was during the 1.8.x code. We have never ran e2fsck since moving to 2.1.

Should we remove the dirdata options?

I will check the date and size of the directories. We may want to just archive these and restore them after the fsck or tar/delete/untar them.

Comment by Andreas Dilger [ 18/Jan/13 ]

The "dirdata" option is enabled by default for 2.x filesystems, but I don't think it is necessarily advisable to disable it at this time. It does appear at first glance that running e2fsck after removing the dirdata feature would handle this correctly and clear the extra dirdata flag in each dirent, but we haven't tested this at all, and it would also cause the MDS to become considerably slower.

So far I don't see any indication besides the mixup with ".." entries that there is anything seriously wrong with these directories. The bytes at the start of the directory are used for ".", "..", and the htree index on directories over 4kB in size, and not user data. e2fsck should regenerate all of the needed information from redundant information elsewhere, except being able to move the entry from lost+found back to the proper place in the tree.

Comment by Cliff White (Inactive) [ 21/Jan/13 ]

What is your current state? What help can we give you?

Comment by Mahmoud Hanafi [ 21/Jan/13 ]

At this point we have been able to run fsck on the mdt and have recovered from the errors.

Comment by Cliff White (Inactive) [ 21/Jan/13 ]

Is the issue closed, or is there some other help we can give you?

Comment by Mahmoud Hanafi [ 08/Feb/13 ]

We seem to have hit this issue again on the same filesystem.

pfe1 ~ # ls -l /nobackupp1/xmeng/run_sc_anisopi/run06_dipole_semiimpl_nohyp_taug
r_60000ss/SC
ls: reading directory /nobackupp1/xmeng/run_sc_anisopi/run06_dipole_semiimpl_noh
yp_taugr_60000ss/SC: Input/output error
total 0

from the mdt
Feb 8 06:50:58 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Unrecognised inode hash code 18 for directory #17309149
Feb 8 06:50:58 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Corrupt dir inode 17309149, running e2fsck is recommended.
Feb 8 06:51:57 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Unrecognised inode hash code 8 for directory #17309159
Feb 8 06:51:57 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Corrupt dir inode 17309159, running e2fsck is recommended.
Feb 8 08:35:12 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Unrecognised inode hash code 15 for directory #130557236
Feb 8 08:35:12 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Corrupt dir inode 130557236, running e2fsck is recommended.
Feb 8 11:45:38 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Unrecognised inode hash code 3 for directory #157287952
Feb 8 11:45:39 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Corrupt dir inode 157287952, running e2fsck is recommended.
Feb 8 11:46:07 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Unrecognised inode hash code 4 for directory #157331367
Feb 8 11:46:07 nbp1-mds kernel: LDISKFS-fs warning (device dm-4): dx_probe: Corrupt dir inode 157331367, running e2fsck is recommended.

Comment by Andreas Dilger [ 08/Feb/13 ]

This problem will persist for large 1.8 directories that are renamed until a version of the LU-2638 patch http://review.whamcloud.com/5179 is applied. For the short term, until this patch is applied, it is possible to disable the dirdata feature on the unmounted MDT filesystem:

tune2fs -O dirdata /dev/mdtdev

though this will have some negative performance impact for all newly-created files when doing name lookups and "ls -l".

Comment by Mahmoud Hanafi [ 08/Feb/13 ]

uploading fsck output for review before we run it for real.

Comment by Johann Lombardi (Inactive) [ 12/Feb/13 ]

There is nothing new in the fsck output compared to last time. I think you should go ahead and run fsck.

Comment by Peter Jones [ 21/Mar/13 ]

As per NASA ok to close ticket

Generated at Sat Feb 10 01:26:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.