[LU-3019] Files are corrupted after OSS unmount. Created: 22/Mar/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Farenyuk Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Rocks 5.3 cluster distributive (CentOS5 based).


Attachments: File chklog_after_fix.gz     File chklog_before_fix.gz     File chklog_fixing.gz     File first_check_logs.tar.gz    
Severity: 4
Rank (Obsolete): 7342

 Description   

Our small lustre system recently encountered a strange problem.
Lustre version – 2.1.4, precompiled binaries from Whamcloud site (we have CentOS5 compatible system, so we must stick to 2.1.X), upgraded from 2.0.0.1 several weeks before. System contained one MDS and one OSS (single OST with size 8Tb).

After the addition of the second, empty OSS (also 8Tb), a steady and slow migration of files began. Then the problem arose. When the second OSS was is unmounted (on a particular occasion), files which our users are were working with at the moment and many other unrelated (and untouched for months) randomly scattered files (may be those which were migrating while unmount happened?) became corrupted.

These files appeared to be of zero size, and can only be deleted.

If to access one of these files on the new OSS we observe messages like this:
kernel: LustreError: 18600:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 1501826: rc -2
(No new messages in MDS or old OSS logs.)

Even graceful unmount — MDS, than both OSS, than MGS, leads to files corruption. (or this is a wrong way to stop the system correctly?).

After the check, according to section 27.2 "Recovering from Corruption in the Lustre File System" of the manual, we have millions of (harmless?) messages:

[1] zero-length orphan objid 0:8371035"

and hundreds of kind:

Failed to find fid [0x20000a041:0x3f99:0x0]: DB_NOTFOUND: No matching key/data pair found
[0]: MDS FID [0x20000a041:0x3f99:0x0] object 0:984246 deleted?

After correcting these errors with lfsck -l -c, I have checked the filesystem one more time, and received many more errors of the same type. (The system was mounted and unmounted to perform the check, no other accesses, except some reads – cd/ls/cat were done).

The both OSS’s are in failout mode.
Due to historical reasons the old OSS is OSS1, (half a year ago we had to migrate from degrading OSS), so new OSS became OSS0.

For one of the corrupted files, /lustre0/users/kglukhov/Calcs/Abinit/SPS/para/fo+Cr_par/ Sn2P2S6o_DS3_WFK:

[root@compute-0-7 fo+Cr_par]# lfs getstripe Sn2P2S6o_DS3_WFK
Sn2P2S6o_DS3_WFK
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_offset: 0
obdidx objid objid group
0 1500007 0x16e367 0

On the corresponding (new) OSS node:
[root@lustre-compute-0-3 mnt]# debugfs -c -R "stat O/0/d$((1500007 % 32))/1500007" /dev/md3
debugfs 1.42.6.wc2 (10-Dec-2012)
/dev/md3: catastrophic mode - not reading inode or group bitmaps
O/0/d7/1500007: File not found by ext2_lookup

Can you help us to understand, what is going on and how to tackle it?

I attach an output of all commands of the first check (according to the manual section 27.2), an output of lfsck for first check (-n), error fixing (-l -c) and the second full check.



 Comments   
Comment by Oleg Farenyuk [ 25/Mar/13 ]

It turned out that the utility simply could fix those mistakes. So the message from lfsck were all about the same files. After removal of defective files manually, these messages have gone.

Then I've made an additional experiment.

1. According to section 14.4. of manual (Regenerating Lustre Configuration Logs), dismounted all and regenerated logs.
2. Mounted file system.
3. Wrote 20Gb of files on Lustre.
4. Checked - no new defective files appeared.
5. An hour later started to copy another 20Gb of files. While copying, unmounted OSS0 for 30 seconds.

Result: several hundred files being copied were damaged. And a few dozen files, copied an hour ago were damaged too! Fortunately, unlike last time, more old files were not affected.

(Damaged – means they show zero size, and only can be deleted, any other operation fails.)

Comment by Oleg Farenyuk [ 28/Mar/13 ]

Switching both OSS to failover mode reduced the likelihood of damage by 4-5 orders of magnitude, though (simulated) outage of OSS and/or clients sometimes still leads to files corruption - they become inaccessible and can not be cured by lfsck.

Comment by Minh-Nghia Nguyen [ 08/May/13 ]

We encountered the same problem after restarting our Lustre filesystem that was on 2.1.4. Our experience seems to be very similar with Oleg's Lustre system in his bug description. Many files untouched for months became corrupted (zero size) on many OST that were cleanly unmounted. I hope that there is a way to recover users files.

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:30:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.