Details
-
Bug
-
Resolution: Won't Fix
-
Critical
-
None
-
Lustre 2.1.6
-
3
-
9453
Description
Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported.
Symptoms:
(1) cannot mount as normal lustre fs
(2) also cannot mount as ldiskfs
(3) e2fsck reports alarming number of issues
Scenario:
The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only.
Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld
iskfs_lookup: deleted inode referenced: 2463495
Jul 29 22:16:40 oss28 kernel: [547129.298723] Aborting journal on device md24.
Jul 29 22:16:40 oss28 kernel: [547129.304211] LustreError: 17212:0:(obd.h:1615:o
bd_transno_commit_cb()) scratch-OST0124: transno 176013176 commit error: 2
Jul 29 22:16:40 oss28 kernel: [547129.316134] LustreError: 17212:0:(obd.h:1615:o
bd_transno_commit_cb()) scratch-OST0124: transno 176013175 commit error: 2
Jul 29 22:16:40 oss28 kernel: [547129.316136] LDISKFS-fs error (device md14): ld
iskfs_journal_start_sb: Detected aborted journal
Jul 29 22:16:40 oss28 kernel: [547129.316139] LDISKFS-fs (md14): Remounting file
system read-only
Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST.
I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows:
e2fsck 1.42.7.wc1 (12-Apr-2013)
Inode table for group 3536 is not in group. (block 103079215118)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
Block bitmap for group 3538 is not in group. (block 107524506255360)
Relocate? no
Inode bitmap for group 3538 is not in group. (block 18446612162378989568)
Relocate? no
Inode table for group 3539 is not in group. (block 3439182177370112)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
Block bitmap for group 3541 is not in group. (block 138784755704397824)
Relocate? no
Inode table for group 3542 is not in group. (block 7138029487521792000)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
Block bitmap for group 3544 is not in group. (block 180388626432)
Relocate? no
Inode table for group 3545 is not in group. (block 25769803776)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
Block bitmap for group 3547 is not in group. (block 346054104973312)
Relocate? no
Inode 503 has compression flag set on filesystem without compression support. \
Clear? no
Inode 503 has INDEX_FL flag set but is not a directory.
Clear HTree index? no
HTREE directory inode 503 has an invalid root node.
Clear HTree index? no
HTREE directory inode 503 has an unsupported hash version (40)
Clear HTree index? no
HTREE directory inode 503 uses an incompatible htree root node flag.
Clear HTree index? no
HTREE directory inode 503 has a tree depth (16) which is too big
Clear HTree index? no
Inode 503, i_blocks is 842359139, should be 0. Fix? no
Inode 504 is in use, but has dtime set. Fix? no
Inode 504 has imagic flag set. Clear? no
Inode 504 has a extra size (25649) which is invalid
Fix? no
Inode 504 has INDEX_FL flag set but is not a directory.
Clear HTree index? no
Inode 562 has INDEX_FL flag set but is not a directory.
Clear HTree index? no
HTREE directory inode 562 has an invalid root node.
Clear HTree index? no
HTREE directory inode 562 has an unsupported hash version (51)
Clear HTree index? no
HTREE directory inode 562 has a tree depth (59) which is too big
Clear HTree index? no
Inode 562, i_blocks is 828596838, should be 0. Fix? no
Inode 563 is in use, but has dtime set. Fix? no
Inode 563 has imagic flag set. Clear? no
Inode 563 has a extra size (12387) which is invalid
Fix? no
lock #623050609 (3039575950) causes file to be too big. IGNORED.
Block #623050610 (3038656474) causes file to be too big. IGNORED.
Block #623050611 (3037435566) causes file to be too big. IGNORED.
Block #623050612 (3035215768) causes file to be too big. IGNORED.
Block #623050613 (3031785159) causes file to be too big. IGNORED.
Block #623050614 (3027736066) causes file to be too big. IGNORED.
Block #623050615 (3019627313) causes file to be too big. IGNORED.
Block #623050616 (2970766533) causes file to be too big. IGNORED.
Block #623050617 (871157932) causes file to be too big. IGNORED.
Block #623050618 (879167937) causes file to be too big. IGNORED.
Block #623050619 (883249763) causes file to be too big. IGNORED.
Block #623050620 (885943218) causes file to be too big. IGNORED.
Too many illegal blocks in inode 1618.
Clear inode? no
Suppress messages? no
Attachments
Issue Links
- is related to
-
LU-14 live replacement of OST
-
- Resolved
-
One quick update, we stopped the array and restarted it without the spare drive that was added in last night (running with 9 out of 10 of the drives currently). At this point, the e2fsck output looks much better than before (see below). One question from our side, should we just let e2fsck use the default superblock or should we specify one with the -b option? Also, should we be concerned about any of the errors that e2fsck has reported initially, most look like no major issue, except maybe the first one with resize inode not valid? The current e2fsck is not making any changes. Our plan now is to let this run and see how many errors it finds and if not too bad, rerun it with the -p option to make some repairs. We will still need to add back in the 10th drive and let the array rebuild at some point, but right now we just want to make sure we have a valid MD array that will mount without error.
[root@oss28.stampede]# e2fsck -fn -B 4096 /dev/md14
e2fsck 1.42.7.wc1 (12-Apr-2013)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
Resize inode not valid. Recreate? no
Pass 1: Checking inodes, blocks, and sizes
Inode 11468804 has an invalid extent node (blk 2936017803, lblk 393)
Clear? no
Inode 11468804, i_blocks is 8264, should be 5024. Fix? no
Inode 11534337 has an invalid extent node (blk 2952816317, lblk 764)
Clear? no
Inode 11534337, i_size is 4292608, should be 3129344. Fix? no
Inode 11534337, i_blocks is 8408, should be 6128. Fix? no
Inode 13092415 has an invalid extent node (blk 3523217944, lblk 0)
Clear? no
Inode 13092415, i_blocks is 544, should be 0. Fix? no
Inode 14291200 has an invalid extent node (blk 3526886078, lblk 0)
Clear? no
Inode 14291200, i_blocks is 2056, should be 0. Fix? no