[LU-5239] Recovery of small files with corrupt objects Created: 20/Jun/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Blake Caldwell Assignee: Zhenyu Xu
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL6.4/distro IB kernel 2.6.32-358.18.1.el6


Severity: 3
Rank (Obsolete): 14611

 Description   

We had a backend storage issue on 5/30 that corrupted a number of blocks on the filesystem across different OSTs. Since then we were able to recover all filesystem structures with e2fsck and identify what we though were all the files. Just recently, we discovered a new scenario where inodes were corrupted, as so cleared by e2fsck. We have identified 665 of such files and an ls or stat returns "Cannot allocate memory". Syslog has the error

Jun 20 20:53:15 f1-oss1d5 kernel: [853846.084587] LustreError: 14391:0:(ldlm_resource.c:1165:ldlm_resource_get()) f1-OST00bc: lvbo_init failed for resource 0xd4805:0x0: rc = -2

This is expected because object 0xd4805 on f1-OST00bc was is invalid (it's inode on f1-OST00bc was cleared by e2fsck).
gaea9:/tmp # lfs getstripe file.F90
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_stripe_offset: 186
obdidx objid objid group
186 871222 0xd4b36 0
187 870647 0xd48f7 0
188 870405 0xd4805 0
189 869971 0xd4653 0

We would like to attempt recovery of small files <3MB (stripe size 4), where the layout might position the missing object after EOF. We thought a dd if=bad.file of=good.file would return success if EOF was reached before the missing object. However, this method fails with "Cannot allocate memory" even for small files, where dd reports only copying some number of kB.

What is causing dd to fail to read even files less than 1MB where the bad object is in the 3rd object?
gaea9:/tmp # dd if=file.F90 of=/tmp/good.out
dd: reading `file.F90': Cannot allocate memory
33+0 records in
33+0 records out
16896 bytes (17 kB) copied, 0.0531134 s, 318 kB/s

When opening good.out, it is not complete.

Is there an alternative method to successfully read to EOF for small files?

This is not causing a downtime, but it is desirable to recover these files as quickly as reasonably possible.



 Comments   
Comment by Peter Jones [ 21/Jun/14 ]

Bobijam

Could you please advise with this one?

Thanks

Peter

Comment by Zhenyu Xu [ 23/Jun/14 ]

How can you be sure that the file EOF before 3M? If you can sure of that, would "dd if=file.F90 of=/tmp/good.out bs=1M count=2" work for it?

Comment by Blake Caldwell [ 23/Jun/14 ]

That causes the same error "Cannot allocate memory". By setting bs=1, the whole file (17333 bytes) can be recovered to EOF. WIth a block-size of 4, I could reproduce the error without reading the whole file. Is there an optimization where the client tries to read the next object even if EOF is reached on the first object?

gaea9:/tmp # dd if=file.F90 of=/tmp/good.out bs=1
17333+0 records in
17333+0 records out
17333 bytes (17 kB) copied, 0.0607574 s, 285 kB/s

gaea9:/tmp # dd if=file.F90 of=/tmp/good.out bs=4 count=4333
4333+0 records in
4333+0 records out
17332 bytes (17 kB) copied, 0.0237968 s, 728 kB/s
gaea9:/tmp # dd if=file.F90 of=/tmp/good.out bs=4 count=4334
dd: reading `file.F90': Cannot allocate memory
4333+0 records in
4333+0 records out
17332 bytes (17 kB) copied, 0.0223031 s, 777 kB/s

Comment by Zhenyu Xu [ 23/Jun/14 ]

from what you described, file.F90 only has 17333 bytes available to recovered. When block-size is set to 4, dd tries to read 4 bytes at a time, thus it can only succeed to read 4333 times, which covers 4333 * 4 = 17332 bytes, the last read reaches the missing object on OST00bc and it fails. And it also explains the dd command without setting bs parameter, whose default value is 512, in that case it reads 512*33=16996 bytes, and fails to read another 512 bytes which reaches the missing object on OST00bc as well.

Comment by Blake Caldwell [ 03/Jul/14 ]

While we were ables to complete recovery of the files with bs=1, we weren't completely clear why reading 17336 bytes (4*4334) would return an error when 17332 bytes is fine. Lustre would have to know that the first object is 17332 bytes and that it needs to read 4 more bytes from the 2nd object.

Why would it prefetch the 2nd object in the 17336 bytes case and not the 17332 bytes case?

Comment by Zhenyu Xu [ 04/Jul/14 ]

I suspect it could involves with dd implementation, I don't know the detail though, but I guess dd does not even try to understand whether there is EOF in the last 4 bytes read request, and it just asks 4 bytes, and Lustre reaches the unavailable region and returns ENOENT for the request.

Comment by Blake Caldwell [ 18/Sep/14 ]

This can be resolved. There is no practical reason to investigate this more and it definitely could be in dd implemenation, and using conv=sync could have helped out with the investigation. We were able to recover about half of the files using this technique (with bs=1) because the cleared block was after EOF.

Generated at Sat Feb 10 01:49:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.