[LU-4373] MDS corruption - e2fsck issues Created: 10/Dec/13  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 11972

 Description   

Hi,

We had some corruption appear on our MDS (possibly as a result of testing an SSD caching solution - Virident) so we took it offline to run an e2fsck. The check ran out of memory (we have 256Gigs + 90G swap) but not before it managed to do something screwy with the quotas. We could at least mount the FS with the corruption before the e2fsck but now it just bombed out instantly with errors on the quota. I then disabled the ext4 quota feature (we don't use quotas) and tried to mount but it goes read-only instantly now (originally it took a while to hit the corrupt area of the disk).

I will probably try e2fsck one more time adding another disk as swap (with ~500 million 4k inodes - how much do we need?) and then resort to reformatting (we can repopulate the fs in a week or so). I'm attaching the relevant output in case there is anything useful here for you.

First see the corruption:

Dec  5 02:55:49 bmds1 kernel: LDISKFS-fs error (device vgca0_vcache1): ldiskfs_xattr_block_get: inode 299912086: bad block 299969421
Dec  5 02:55:49 bmds1 kernel: Aborting journal on device sdb.
Dec  5 02:55:49 bmds1 kernel: LDISKFS-fs error (device vgca0_vcache1): ldiskfs_journal_start_sb: Detected aborted journal
Dec  5 02:55:49 bmds1 kernel: LDISKFS-fs (vgca0_vcache1): Remounting filesystem read-only
..
..
Dec  5 05:23:10 bmds1 kernel: VFS: cannot write quota structure on device vgca0_vcache1 (error -30). Quota may get out of sync!
Dec  5 05:23:16 bmds1 kernel: LDISKFS-fs (vgca0_vcache1): Quota write (off=99328, len=1024) cancelled because transaction is not started
Dec  5 05:23:16 bmds1 kernel: VFS: Can't insert quota data block (97) to free entry list.
..
..
Dec  5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Unrecognised inode hash code 32 for directory #244352420
Dec  5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Corrupt dir inode 244352420, running e2fsck is recommended.
Dec  5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Unrecognised inode hash code 32 for directory #244352420

Remount without e2fsck and wait until:

Dec  5 18:53:49 bmds1 kernel: LDISKFS-fs error (device md0): mb_free_blocks: double-free of inode 0's block 857326879(bit 17695 in group 26163)
Dec  5 18:53:49 bmds1 kernel: Aborting journal on device sdb.
Dec  5 18:53:49 bmds1 kernel: LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
Dec  5 18:53:49 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only
Dec  5 18:53:49 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only

So I guess we better run e2fsck then:

bmds1 /root # e2fsck -fy /dev/md0
e2fsck 1.42.7.wc1 (12-Apr-2013)
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 857268317 has zero dtime.  Fix? yes

Deleted inode 2613636819 has zero dtime.  Fix? yes

Deleted inode 2961739541 has zero dtime.  Fix? yes


Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 268975905: 269046865
Multiply-claimed block(s) in inode 268975912: 269046838
Multiply-claimed block(s) in inode 268998892: 269046838
Multiply-claimed block(s) in inode 268998896: 269046865
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 4 inodes containing multiply-claimed blocks.)

File /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/photoshop (inode #268975905, mod time Wed Apr 24 15:31:27 2013) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        ... (inode #268998896, mod time Wed Jan  9 12:59:03 2013)
Clone multiply-claimed blocks? yes

File /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/paperwork (inode #268975912, mod time Wed Apr 24 15:31:26 2013) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        ... (inode #268998892, mod time Wed Jan  9 12:59:03 2013)
Clone multiply-claimed blocks? yes

File ... (inode #268998892, mod time Wed Jan  9 12:59:03 2013) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/paperwork (inode #268975912, mod time Wed Apr 24 15:31:26 2013)
Multiply-claimed blocks already reassigned or cloned.

File ... (inode #268998896, mod time Wed Jan  9 12:59:03 2013) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/photoshop (inode #268975905, mod time Wed Apr 24 15:31:27 2013)
Multiply-claimed blocks already reassigned or cloned.

Pass 2: Checking directory structure
Error allocating icount structure: Memory allocation failed

bravo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
[QUOTA WARNING] Usage inconsistent for ID 0:actual (352486866944, 75720160) != expected (287050371072, 62379940)
[QUOTA WARNING] Usage inconsistent for ID 4078:actual (5853184, 1468) != expected (0, 1193)
[QUOTA WARNING] Usage inconsistent for ID 5255:actual (32276480, 7735) != expected (24416256, 5884)
[QUOTA WARNING] Usage inconsistent for ID 5305:actual (578752512, 2100806) != expected (3155456000, 3376042)
[QUOTA WARNING] Usage inconsistent for ID 4080:actual (56586240, 13815) != expected (55222272, 13750)
[QUOTA WARNING] Usage inconsistent for ID 6731:actual (241618944, 56584) != expected (263524352, 59180)
[QUOTA WARNING] Usage inconsistent for ID 3449:actual (3298160640, 679894) != expected (2682540032, 596461)
[QUOTA WARNING] Usage inconsistent for ID 4052:actual (1331068928, 273724) != expected (1242320896, 218991)
[QUOTA WARNING] Usage inconsistent for ID 4088:actual (246837248, 60468) != expected (27856896, 19173)
[QUOTA WARNING] Usage inconsistent for ID 5339:actual (1207443456, 254756) != expected (1176674304, 270763)
[QUOTA WARNING] Usage inconsistent for ID 6699:actual (140206080, 37540) != expected (10919936, 12748)
[QUOTA WARNING] Usage inconsistent for ID 3695:actual (68915200, 15943) != expected (56963072, 13213)
..
..
[QUOTA WARNING] Usage inconsistent for ID 7349:actual (64790528, 19964) != expected (52305920, 17445)
[QUOTA WARNING] Usage inconsistent for ID 188:actual (20480, 5) != expected (16384, 4)
[QUOTA WARNING] Usage inconsistent for ID 1056:actual (15908864, 3640) != expected (15458304, 3552)
[QUOTA WARNING] Usage inconsistent for ID 615:actual (180224, 44) != expected (172032, 42)
[QUOTA WARNING] Usage inconsistent for ID 513:actual (364544, 89) != expected (32768, 8)
[QUOTA WARNING] Usage inconsistent for ID 999:actual (16240640, 3965) != expected (15622144, 3814)
[QUOTA WARNING] Usage inconsistent for ID 5001:actual (21745664, 5229) != expected (21049344, 5063)
[QUOTA WARNING] Usage inconsistent for ID 1003:actual (0, 0) != expected (36864, 9)
[QUOTA WARNING] Usage inconsistent for ID 462:actual (233472, 132) != expected (225280, 123)
Update quota info for quota type 1? yes

[ERROR] quotaio_tree.c:357:free_dqentry:: Quota structure has offset to other block (0) than it should (34).
e2fsck: aborted

bravo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****

Try to mount now:

Dec  9 09:56:42 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended
Dec  9 09:56:42 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode
Dec  9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): recovery complete
Dec  9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): Can't enable usage tracking on a filesystem with the QUOTA feature set
Dec  9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): mount failed
Dec  9 09:56:45 bmds1 kernel: ------------[ cut here ]------------
Dec  9 09:56:45 bmds1 kernel: WARNING: at fs/proc/generic.c:847 remove_proc_entry+0x24f/0x260() (Tainted: P           ---------------   )
Dec  9 09:56:45 bmds1 kernel: Hardware name: PowerEdge R620
Dec  9 09:56:45 bmds1 kernel: remove_proc_entry: removing non-empty directory 'ldiskfs/md0', leaking at least 'prealloc_table'
Dec  9 09:56:45 bmds1 kernel: Modules linked in: ldiskfs(U) jbd2 mptctl mptbase ipmi_devintf dell_rbu nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding 8021q garp stp llc uinput ipv6 raid1 power_meter sg 
vgcinit(P)(U) vgcdebug(P)(U) shpchp bnx2x libcrc32c mdio dcdbas microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci wmi megaraid_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Dec  9 09:56:45 bmds1 kernel: Pid: 8092, comm: mount Tainted: P           ---------------    2.6.32-358.18.1.el6_lustre.x86_64 #1
Dec  9 09:56:45 bmds1 kernel: Call Trace:
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
Dec  9 09:56:45 bmds1 kernel: [<ffffffff811ef7bd>] ? xlate_proc_name+0x4d/0xd0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff811efb1f>] ? remove_proc_entry+0x24f/0x260
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8116e625>] ? pcpu_free_area+0x165/0x1e0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8116e755>] ? free_percpu+0xb5/0x140
Dec  9 09:56:45 bmds1 kernel: [<ffffffffa049acec>] ? ldiskfs_fill_super+0x23c/0x2a10 [ldiskfs]
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8118477e>] ? get_sb_bdev+0x18e/0x1d0
Dec  9 09:56:45 bmds1 kernel: [<ffffffffa049aab0>] ? ldiskfs_fill_super+0x0/0x2a10 [ldiskfs]
Dec  9 09:56:45 bmds1 kernel: [<ffffffffa0495388>] ? ldiskfs_get_sb+0x18/0x20 [ldiskfs]
Dec  9 09:56:45 bmds1 kernel: [<ffffffff81183beb>] ? vfs_kern_mount+0x7b/0x1b0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff81183d92>] ? do_kern_mount+0x52/0x130
Dec  9 09:56:45 bmds1 kernel: [<ffffffff811a3f52>] ? do_mount+0x2d2/0x8d0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff81139ff4>] ? strndup_user+0x64/0xc0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff811a45e0>] ? sys_mount+0x90/0xe0
Dec  9 09:56:45 bmds1 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Dec  9 09:56:45 bmds1 kernel: ---[ end trace dfa074843fd85142 ]---

Okay try without the quota feature set:

Dec  9 15:47:56 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended
Dec  9 15:47:56 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode
Dec  9 15:47:59 bmds1 kernel: LDISKFS-fs (md0): recovery complete
Dec  9 15:47:59 bmds1 kernel: LDISKFS-fs (md0): mounted filesystem with journalled data mode. quota=off. Opts: 
Dec  9 15:48:30 bmds1 kernel: LNet: HW CPU cores: 32, npartitions: 4
Dec  9 15:48:30 bmds1 kernel: alg: No test for crc32 (crc32-table)
Dec  9 15:48:30 bmds1 kernel: alg: No test for adler32 (adler32-zlib)
Dec  9 15:48:30 bmds1 kernel: alg: No test for crc32 (crc32-pclmul)
Dec  9 15:48:34 bmds1 kernel: padlock: VIA PadLock Hash Engine not detected.
Dec  9 15:48:34 bmds1 modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko): No such device
Dec  9 15:48:38 bmds1 kernel: Lustre: Lustre: Build Version: 2.4.1-RC2--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec  9 15:48:38 bmds1 kernel: LNet: Added LNI 10.21.22.50@tcp [8/256/0/180]
Dec  9 15:48:38 bmds1 kernel: LNet: Accept secure, port 988
Dec  9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): barriers disabled
Dec  9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended
Dec  9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode
Dec  9 15:48:42 bmds1 kernel: LDISKFS-fs (md0): mounted filesystem with journalled data mode. quota=off. Opts: 
Dec  9 15:48:43 bmds1 kernel: Lustre: bravo-MDT0000: used disk, loading
Dec  9 15:48:43 bmds1 kernel: LDISKFS-fs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 37corrupted: 525 blocks free in bitmap, 524 - in gd
Dec  9 15:48:43 bmds1 kernel: 
Dec  9 15:48:43 bmds1 kernel: Aborting journal on device sdb.
Dec  9 15:48:43 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only


 Comments   
Comment by Daire Byrne (Inactive) [ 10/Dec/13 ]

And another e2fsck run:

bmds1 /root # e2fsck -fvy /dev/md0
e2fsck 1.42.7.wc1 (12-Apr-2013)
bravo-MDT0000: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 268998892.
Fix? yes

Entry 'lineup' in /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/??? (268998892) is a link to directory /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/paperwork/lineup (268975913).
Clear? yes

Invalid inode number for '.' in directory inode 268998896.
Fix? yes

Entry '..' in .../??? (268998889) has deleted/unused inode 266936598.  Clear? yes

Entry 'prop-base' in .../??? (268998889) has deleted/unused inode 269521254.  Clear? yes

Entry 'props' in .../??? (268998889) is a link to directory /ROOT/ARCHIVE/dirvish/servers/jenkins1/20131204/tree/usr/src/debug/kernel-2.6.32-279.5.2.el6/linux-2.6.32-279.5.2.el6.x86_64/include/drm (270036736).
Clear? yes

Entry '..' in .../??? (902387214) has deleted/unused inode 882443083.  Clear? yes

ext2fs_get_mem: Cannot allocate memory
Comment by Peter Jones [ 04/Mar/14 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 05/Mar/14 ]

Looks your memory (256 + 90)G is insufficient to hold all directory structures, you need to enable the "scratch_files" to make e2fsck stores the dir structures in a file instead of memory. (see man e2fsck.conf), hope that'll complete the fsck.

Comment by Daire Byrne (Inactive) [ 05/Mar/14 ]

I believe I tried scratch_files too at the time but still ran out of memory (with a 900G drive?). I guess the question is roughly how much RAM do we need to fsck 300 million dirs/files? I'd rather not find this out by trial and error (900Gigs... 4TB?). 4k X 300 million (dirs) =~ 1.2 TB?

Comment by Niu Yawei (Inactive) [ 06/Mar/14 ]

I don't know exactly if it requires 1.2TB or more, I suggest you use the largest drive you have (4TB?), and monitor the disk usage while running fsck. ('tt' option prints the memory usage for each phase)

Comment by Peter Jones [ 28/Apr/14 ]

Hi Daire

Have you tried this procedure yet?

Peter

Comment by Daire Byrne (Inactive) [ 29/Apr/14 ]

Peter,

No I haven't had the chance yet but we still have the RAID and data so I still plan to do the work soon.

Comment by Peter Jones [ 29/Apr/14 ]

ok Daire. Thanks for the update. Please keep us posted.

Comment by Daire Byrne (Inactive) [ 24/Jul/14 ]

Hey, I finally got around to setting up some hardware (5TB disk for e2fsck scratch) to re-run the e2fsck on this corrupted RAID. I had to remove the external journal but now it looks like e2fsck isn't getting as far as it did before:

[root@localhost ~]# e2fsck -tt -fvy /dev/md0
e2fsck 1.42.7.wc1 (12-Apr-2013)
Pass 1: Checking inodes, blocks, and sizes
Signal (11) SIGSEGV si_code=SEGV_MAPERR fault addr=0x7ff5bac5fffc
e2fsck[0x42dbcd]
/lib64/libc.so.6(+0x329a0)[0x7ff4b991d9a0]
/lib64/libc.so.6(memcpy+0x35)[0x7ff4b9974985]
/lib64/libext2fs.so.2(+0x2c046)[0x7ff4ba84e046]
/lib64/libext2fs.so.2(+0x2cc3e)[0x7ff4ba84ec3e]
/lib64/libext2fs.so.2(ext2fs_tdb_store+0x747)[0x7ff4ba8510d7]
/lib64/libext2fs.so.2(+0x20c79)[0x7ff4ba842c79]
/lib64/libext2fs.so.2(ext2fs_icount_store+0x46)[0x7ff4ba842d26]
e2fsck(e2fsck_pass1+0xb58)[0x415468]
e2fsck(e2fsck_run+0x4f)[0x40efff]
e2fsck(main+0xeac)[0x40d1cc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4b9909d1d]
e2fsck[0x409eb9]
Comment by Niu Yawei (Inactive) [ 25/Jul/14 ]

It could be LU-2627, could you upgrade your e2fsprogs to the latest version? (1.42.9.wc1-7) Thanks.

Comment by Daire Byrne (Inactive) [ 28/Jul/14 ]

I guess it may have been an issue with using e2fsck's scratch files option. It ran to completion using 4TB of swap space. We are happy that we can at least fsck our MDT in future with enough swap configured. We can close this ticket now.

Pass 5: Memory used: 18014398508741340k/4240k (18014398507398050k/1343291k), time: 442.38/111.08/ 0.00
Pass 5: I/O read: 270MB, write: 1MB, rate: 0.61MB/s

bravo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****

   429147617 inodes used (13.32%, out of 3221225472)
         189 non-contiguous files (0.0%)
     4422586 non-contiguous directories (0.0%)
             # of inodes with ind/dind/tind blocks: 712347/4105/0
   833979888 blocks used (25.89%, out of 3221225472)
           0 bad blocks
           1 large file

   280344591 regular files
   146575289 directories
          21 character device files
          32 block device files
         123 fifos
  1959990622 links
     2223722 symbolic links (1652788 fast symbolic links)
        3830 sockets
------------
  2389138230 files
Memory used: 18014398508741340k/4240k (18014398507398050k/1343291k), time: 57393.48/10071.86/3111.58
I/O read: 2102044MB, write: 265MB, rate: 36.63MB/s
Comment by Peter Jones [ 28/Jul/14 ]

ok thanks Daire

Generated at Sat Feb 10 01:42:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.