[LU-4523] Need explanation for FS corruption - ldiskfs_mb_free_metadata: Double free of blocks Created: 21/Jan/14  Updated: 11/Feb/14  Resolved: 11/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oz Rentas Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File kern.log.1     File uname_r     File version    
Severity: 3
Rank (Obsolete): 12367

 Description   

The customer, Yale, encountered file system corruption on one of their OST devices, dm-20 which is "scratch-OST0028". Customer fan e2fsck on that device, which fixed the corruption, but now they would like to have a RCA to prevent it from happening in future.

The corruption was first reported on Jan-11, but there aren't any irregular events on the storage side that would have caused such corruption, which could indicate the corruption happened sometime before and was only reported on the 11th.

Jan 11 11:54:33 oss7 kernel: Lustre: 2916:0:(o2iblnd_cb.c:2249:kiblnd_passive_connect()) Conn stale 10.191.133.6@o2ib [old ver: 12, new ver: 12]
Jan 11 11:54:33 oss7 kernel: Lustre: 2916:0:(o2iblnd_cb.c:2249:kiblnd_passive_connect()) Skipped 101 previous similar messages
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20): ldiskfs_mb_free_metadata: Double free of blocks 30208 (30208 148)
Jan 11 11:59:52 oss7 kernel: Aborting journal on device dm-20-8.
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs (dm-20): Remounting filesystem read-only
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_ext_remove_space: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_orphan_del: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_reserve_inode_write: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LDISKFS-fs error (device dm-20) in ldiskfs_ext_truncate: Journal has aborted
Jan 11 11:59:52 oss7 kernel: LustreError: 22850:0:(fsfilt-ldiskfs.c:369:fsfilt_ldiskfs_start()) error starting handle for op 8 (106 credits): rc -30

FSCK output:
e2fsck 1.42.3.wc3 (15-Aug-2012)
device /dev/mapper/ost_scratch_40 mounted by lustre per /proc/fs/lustre/obdfilter/scratch-OST0028/mntdev
Warning! /dev/mapper/ost_scratch_40 is mounted.
MMP interval is 10 seconds and total wait time is 42 seconds. Please wait...
Warning: skipping journal recovery because doing a read-only filesystem check.
scratch-OST0028 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks

What other debugging or data can be pulled to explain the problem?

Thanks,
Oz



 Comments   
Comment by Peter Jones [ 21/Jan/14 ]

Oz

We'll certainly need details about which Lustre version is in use and some logs - dmesg from this node (or syslog) as a start, say 24 hours into the past.

Thanks

Peter

Comment by Oz Rentas [ 21/Jan/14 ]

Ah, yes, of course. Sorry about that. I've attached the missing files.

Lustre: 1.8.9
Kernel: 2.6.18-348.1.1.el5

Comment by Peter Jones [ 22/Jan/14 ]

Hongchao has been looking at this information

Comment by Peter Jones [ 24/Jan/14 ]

Hongchao

As per our recent discussion on this topic I understand that you believe this issue to be a duplication of LU-482 which periodically affected our internal testing on older Lustre releases but has not been seen on 2.4.x and newer releases and that you believe this is due to an issue in the underlying ext4 code that has been addressed with newer kernel versions.

Do I have this right? Is there anything to add/correct?

Thanks

Peter

Comment by Hongchao Zhang [ 26/Jan/14 ]

Hi Peter,

Yes, it could be problem related to the ext4 (patched a little by Lustre and renamed to ldiskfs), there are some similar ticket (LU-482, LU-699, etc)

btw, there is a similar issues reported on Redhat,

https://access.redhat.com/site/solutions/157393

Thanks

Comment by Oz Rentas [ 11/Feb/14 ]

Thank you for this very useful information. It has been passed on to the customer.
We have since upgraded the OS / Lustre build on the servers. This ticket can be closed.

Comment by Peter Jones [ 11/Feb/14 ]

ok thanks Oz

Generated at Sat Feb 10 01:43:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.