Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
3
-
12367
Description
The customer, Yale, encountered file system corruption on one of their OST devices, dm-20 which is "scratch-OST0028". Customer fan e2fsck on that device, which fixed the corruption, but now they would like to have a RCA to prevent it from happening in future.
The corruption was first reported on Jan-11, but there aren't any irregular events on the storage side that would have caused such corruption, which could indicate the corruption happened sometime before and was only reported on the 11th.
Jan 11 11:54:33 oss7 kernel: Lustre: 2916:0:(o2iblnd_cb.c:2249:kiblnd_passive_connect()) Conn stale 10.191.133.6@o2ib [old ver: 12, new ver: 12]
Jan 11 11:54:33 oss7 kernel: Lustre: 2916:0:(o2iblnd_cb.c:2249:kiblnd_passive_connect()) Skipped 101 previous similar messages
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20): ldiskfs_mb_free_metadata: Double free of blocks 30208 (30208 148)
Jan 11 11:59:52 oss7 kernel: Aborting journal on device dm-20-8.
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs (dm-20): Remounting filesystem read-only
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_ext_remove_space: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_orphan_del: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_ext_truncate: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LustreError: 22850:0:(fsfilt-ldiskfs.c:369:fsfilt_ldiskfs_start()) error starting handle for op 8 (106 credits): rc -30
FSCK output:
e2fsck 1.42.3.wc3 (15-Aug-2012)
device /dev/mapper/ost_scratch_40 mounted by lustre per /proc/fs/lustre/obdfilter/scratch-OST0028/mntdev
Warning! /dev/mapper/ost_scratch_40 is mounted.
MMP interval is 10 seconds and total wait time is 42 seconds. Please wait...
Warning: skipping journal recovery because doing a read-only filesystem check.
scratch-OST0028 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
What other debugging or data can be pulled to explain the problem?
Thanks,
Oz