Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.4.1
-
None
-
3
-
11972
Description
Hi,
We had some corruption appear on our MDS (possibly as a result of testing an SSD caching solution - Virident) so we took it offline to run an e2fsck. The check ran out of memory (we have 256Gigs + 90G swap) but not before it managed to do something screwy with the quotas. We could at least mount the FS with the corruption before the e2fsck but now it just bombed out instantly with errors on the quota. I then disabled the ext4 quota feature (we don't use quotas) and tried to mount but it goes read-only instantly now (originally it took a while to hit the corrupt area of the disk).
I will probably try e2fsck one more time adding another disk as swap (with ~500 million 4k inodes - how much do we need?) and then resort to reformatting (we can repopulate the fs in a week or so). I'm attaching the relevant output in case there is anything useful here for you.
First see the corruption:
Dec 5 02:55:49 bmds1 kernel: LDISKFS-fs error (device vgca0_vcache1): ldiskfs_xattr_block_get: inode 299912086: bad block 299969421 Dec 5 02:55:49 bmds1 kernel: Aborting journal on device sdb. Dec 5 02:55:49 bmds1 kernel: LDISKFS-fs error (device vgca0_vcache1): ldiskfs_journal_start_sb: Detected aborted journal Dec 5 02:55:49 bmds1 kernel: LDISKFS-fs (vgca0_vcache1): Remounting filesystem read-only .. .. Dec 5 05:23:10 bmds1 kernel: VFS: cannot write quota structure on device vgca0_vcache1 (error -30). Quota may get out of sync! Dec 5 05:23:16 bmds1 kernel: LDISKFS-fs (vgca0_vcache1): Quota write (off=99328, len=1024) cancelled because transaction is not started Dec 5 05:23:16 bmds1 kernel: VFS: Can't insert quota data block (97) to free entry list. .. .. Dec 5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Unrecognised inode hash code 32 for directory #244352420 Dec 5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Corrupt dir inode 244352420, running e2fsck is recommended. Dec 5 06:39:49 bmds1 kernel: LDISKFS-fs warning (device vgca0_vcache1): dx_probe: Unrecognised inode hash code 32 for directory #244352420
Remount without e2fsck and wait until:
Dec 5 18:53:49 bmds1 kernel: LDISKFS-fs error (device md0): mb_free_blocks: double-free of inode 0's block 857326879(bit 17695 in group 26163)
Dec 5 18:53:49 bmds1 kernel: Aborting journal on device sdb.
Dec 5 18:53:49 bmds1 kernel: LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
Dec 5 18:53:49 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only
Dec 5 18:53:49 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only
So I guess we better run e2fsck then:
bmds1 /root # e2fsck -fy /dev/md0 e2fsck 1.42.7.wc1 (12-Apr-2013) Pass 1: Checking inodes, blocks, and sizes Deleted inode 857268317 has zero dtime. Fix? yes Deleted inode 2613636819 has zero dtime. Fix? yes Deleted inode 2961739541 has zero dtime. Fix? yes Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 268975905: 269046865 Multiply-claimed block(s) in inode 268975912: 269046838 Multiply-claimed block(s) in inode 268998892: 269046838 Multiply-claimed block(s) in inode 268998896: 269046865 Pass 1C: Scanning directories for inodes with multiply-claimed blocks Pass 1D: Reconciling multiply-claimed blocks (There are 4 inodes containing multiply-claimed blocks.) File /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/photoshop (inode #268975905, mod time Wed Apr 24 15:31:27 2013) has 1 multiply-claimed block(s), shared with 1 file(s): ... (inode #268998896, mod time Wed Jan 9 12:59:03 2013) Clone multiply-claimed blocks? yes File /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/paperwork (inode #268975912, mod time Wed Apr 24 15:31:26 2013) has 1 multiply-claimed block(s), shared with 1 file(s): ... (inode #268998892, mod time Wed Jan 9 12:59:03 2013) Clone multiply-claimed blocks? yes File ... (inode #268998892, mod time Wed Jan 9 12:59:03 2013) has 1 multiply-claimed block(s), shared with 1 file(s): /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/REF/paperwork (inode #268975912, mod time Wed Apr 24 15:31:26 2013) Multiply-claimed blocks already reassigned or cloned. File ... (inode #268998896, mod time Wed Jan 9 12:59:03 2013) has 1 multiply-claimed block(s), shared with 1 file(s): /ROOT/ARCHIVE/dirvish/filers/nfs11/20131203/tree/user_data/FL/rdev_tars/photoshop (inode #268975905, mod time Wed Apr 24 15:31:27 2013) Multiply-claimed blocks already reassigned or cloned. Pass 2: Checking directory structure Error allocating icount structure: Memory allocation failed bravo-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** [QUOTA WARNING] Usage inconsistent for ID 0:actual (352486866944, 75720160) != expected (287050371072, 62379940) [QUOTA WARNING] Usage inconsistent for ID 4078:actual (5853184, 1468) != expected (0, 1193) [QUOTA WARNING] Usage inconsistent for ID 5255:actual (32276480, 7735) != expected (24416256, 5884) [QUOTA WARNING] Usage inconsistent for ID 5305:actual (578752512, 2100806) != expected (3155456000, 3376042) [QUOTA WARNING] Usage inconsistent for ID 4080:actual (56586240, 13815) != expected (55222272, 13750) [QUOTA WARNING] Usage inconsistent for ID 6731:actual (241618944, 56584) != expected (263524352, 59180) [QUOTA WARNING] Usage inconsistent for ID 3449:actual (3298160640, 679894) != expected (2682540032, 596461) [QUOTA WARNING] Usage inconsistent for ID 4052:actual (1331068928, 273724) != expected (1242320896, 218991) [QUOTA WARNING] Usage inconsistent for ID 4088:actual (246837248, 60468) != expected (27856896, 19173) [QUOTA WARNING] Usage inconsistent for ID 5339:actual (1207443456, 254756) != expected (1176674304, 270763) [QUOTA WARNING] Usage inconsistent for ID 6699:actual (140206080, 37540) != expected (10919936, 12748) [QUOTA WARNING] Usage inconsistent for ID 3695:actual (68915200, 15943) != expected (56963072, 13213) .. .. [QUOTA WARNING] Usage inconsistent for ID 7349:actual (64790528, 19964) != expected (52305920, 17445) [QUOTA WARNING] Usage inconsistent for ID 188:actual (20480, 5) != expected (16384, 4) [QUOTA WARNING] Usage inconsistent for ID 1056:actual (15908864, 3640) != expected (15458304, 3552) [QUOTA WARNING] Usage inconsistent for ID 615:actual (180224, 44) != expected (172032, 42) [QUOTA WARNING] Usage inconsistent for ID 513:actual (364544, 89) != expected (32768, 8) [QUOTA WARNING] Usage inconsistent for ID 999:actual (16240640, 3965) != expected (15622144, 3814) [QUOTA WARNING] Usage inconsistent for ID 5001:actual (21745664, 5229) != expected (21049344, 5063) [QUOTA WARNING] Usage inconsistent for ID 1003:actual (0, 0) != expected (36864, 9) [QUOTA WARNING] Usage inconsistent for ID 462:actual (233472, 132) != expected (225280, 123) Update quota info for quota type 1? yes [ERROR] quotaio_tree.c:357:free_dqentry:: Quota structure has offset to other block (0) than it should (34). e2fsck: aborted bravo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
Try to mount now:
Dec 9 09:56:42 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended Dec 9 09:56:42 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode Dec 9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): recovery complete Dec 9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): Can't enable usage tracking on a filesystem with the QUOTA feature set Dec 9 09:56:45 bmds1 kernel: LDISKFS-fs (md0): mount failed Dec 9 09:56:45 bmds1 kernel: ------------[ cut here ]------------ Dec 9 09:56:45 bmds1 kernel: WARNING: at fs/proc/generic.c:847 remove_proc_entry+0x24f/0x260() (Tainted: P --------------- ) Dec 9 09:56:45 bmds1 kernel: Hardware name: PowerEdge R620 Dec 9 09:56:45 bmds1 kernel: remove_proc_entry: removing non-empty directory 'ldiskfs/md0', leaking at least 'prealloc_table' Dec 9 09:56:45 bmds1 kernel: Modules linked in: ldiskfs(U) jbd2 mptctl mptbase ipmi_devintf dell_rbu nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding 8021q garp stp llc uinput ipv6 raid1 power_meter sg vgcinit(P)(U) vgcdebug(P)(U) shpchp bnx2x libcrc32c mdio dcdbas microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Dec 9 09:56:45 bmds1 kernel: Pid: 8092, comm: mount Tainted: P --------------- 2.6.32-358.18.1.el6_lustre.x86_64 #1 Dec 9 09:56:45 bmds1 kernel: Call Trace: Dec 9 09:56:45 bmds1 kernel: [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 Dec 9 09:56:45 bmds1 kernel: [<ffffffff811ef7bd>] ? xlate_proc_name+0x4d/0xd0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff811efb1f>] ? remove_proc_entry+0x24f/0x260 Dec 9 09:56:45 bmds1 kernel: [<ffffffff8116e625>] ? pcpu_free_area+0x165/0x1e0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff8116e755>] ? free_percpu+0xb5/0x140 Dec 9 09:56:45 bmds1 kernel: [<ffffffffa049acec>] ? ldiskfs_fill_super+0x23c/0x2a10 [ldiskfs] Dec 9 09:56:45 bmds1 kernel: [<ffffffff8118477e>] ? get_sb_bdev+0x18e/0x1d0 Dec 9 09:56:45 bmds1 kernel: [<ffffffffa049aab0>] ? ldiskfs_fill_super+0x0/0x2a10 [ldiskfs] Dec 9 09:56:45 bmds1 kernel: [<ffffffffa0495388>] ? ldiskfs_get_sb+0x18/0x20 [ldiskfs] Dec 9 09:56:45 bmds1 kernel: [<ffffffff81183beb>] ? vfs_kern_mount+0x7b/0x1b0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff81183d92>] ? do_kern_mount+0x52/0x130 Dec 9 09:56:45 bmds1 kernel: [<ffffffff811a3f52>] ? do_mount+0x2d2/0x8d0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff81139ff4>] ? strndup_user+0x64/0xc0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff811a45e0>] ? sys_mount+0x90/0xe0 Dec 9 09:56:45 bmds1 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Dec 9 09:56:45 bmds1 kernel: ---[ end trace dfa074843fd85142 ]---
Okay try without the quota feature set:
Dec 9 15:47:56 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended Dec 9 15:47:56 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode Dec 9 15:47:59 bmds1 kernel: LDISKFS-fs (md0): recovery complete Dec 9 15:47:59 bmds1 kernel: LDISKFS-fs (md0): mounted filesystem with journalled data mode. quota=off. Opts: Dec 9 15:48:30 bmds1 kernel: LNet: HW CPU cores: 32, npartitions: 4 Dec 9 15:48:30 bmds1 kernel: alg: No test for crc32 (crc32-table) Dec 9 15:48:30 bmds1 kernel: alg: No test for adler32 (adler32-zlib) Dec 9 15:48:30 bmds1 kernel: alg: No test for crc32 (crc32-pclmul) Dec 9 15:48:34 bmds1 kernel: padlock: VIA PadLock Hash Engine not detected. Dec 9 15:48:34 bmds1 modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko): No such device Dec 9 15:48:38 bmds1 kernel: Lustre: Lustre: Build Version: 2.4.1-RC2--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64 Dec 9 15:48:38 bmds1 kernel: LNet: Added LNI 10.21.22.50@tcp [8/256/0/180] Dec 9 15:48:38 bmds1 kernel: LNet: Accept secure, port 988 Dec 9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): barriers disabled Dec 9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended Dec 9 15:48:39 bmds1 kernel: LDISKFS-fs (md0): Ignoring delalloc option - requested data journaling mode Dec 9 15:48:42 bmds1 kernel: LDISKFS-fs (md0): mounted filesystem with journalled data mode. quota=off. Opts: Dec 9 15:48:43 bmds1 kernel: Lustre: bravo-MDT0000: used disk, loading Dec 9 15:48:43 bmds1 kernel: LDISKFS-fs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 37corrupted: 525 blocks free in bitmap, 524 - in gd Dec 9 15:48:43 bmds1 kernel: Dec 9 15:48:43 bmds1 kernel: Aborting journal on device sdb. Dec 9 15:48:43 bmds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only