Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.4.1
-
None
-
RHEL6.4/distro IB kernel 2.6.32-358.18.1.el6
-
3
-
14611
Description
We had a backend storage issue on 5/30 that corrupted a number of blocks on the filesystem across different OSTs. Since then we were able to recover all filesystem structures with e2fsck and identify what we though were all the files. Just recently, we discovered a new scenario where inodes were corrupted, as so cleared by e2fsck. We have identified 665 of such files and an ls or stat returns "Cannot allocate memory". Syslog has the error
Jun 20 20:53:15 f1-oss1d5 kernel: [853846.084587] LustreError: 14391:0:(ldlm_resource.c:1165:ldlm_resource_get()) f1-OST00bc: lvbo_init failed for resource 0xd4805:0x0: rc = -2
This is expected because object 0xd4805 on f1-OST00bc was is invalid (it's inode on f1-OST00bc was cleared by e2fsck).
gaea9:/tmp # lfs getstripe file.F90
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_stripe_offset: 186
obdidx objid objid group
186 871222 0xd4b36 0
187 870647 0xd48f7 0
188 870405 0xd4805 0
189 869971 0xd4653 0
We would like to attempt recovery of small files <3MB (stripe size 4), where the layout might position the missing object after EOF. We thought a dd if=bad.file of=good.file would return success if EOF was reached before the missing object. However, this method fails with "Cannot allocate memory" even for small files, where dd reports only copying some number of kB.
What is causing dd to fail to read even files less than 1MB where the bad object is in the 3rd object?
gaea9:/tmp # dd if=file.F90 of=/tmp/good.out
dd: reading `file.F90': Cannot allocate memory
33+0 records in
33+0 records out
16896 bytes (17 kB) copied, 0.0531134 s, 318 kB/s
When opening good.out, it is not complete.
Is there an alternative method to successfully read to EOF for small files?
This is not causing a downtime, but it is desirable to recover these files as quickly as reasonably possible.