[LU-3019] Files are corrupted after OSS unmount. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.4
Labels:
None
Environment:
Rocks 5.3 cluster distributive (CentOS5 based).

Severity:
4
Rank (Obsolete):
7342

Description

Our small lustre system recently encountered a strange problem.
Lustre version – 2.1.4, precompiled binaries from Whamcloud site (we have CentOS5 compatible system, so we must stick to 2.1.X), upgraded from 2.0.0.1 several weeks before. System contained one MDS and one OSS (single OST with size 8Tb).

After the addition of the second, empty OSS (also 8Tb), a steady and slow migration of files began. Then the problem arose. When the second OSS was is unmounted (on a particular occasion), files which our users are were working with at the moment and many other unrelated (and untouched for months) randomly scattered files (may be those which were migrating while unmount happened?) became corrupted.

These files appeared to be of zero size, and can only be deleted.

If to access one of these files on the new OSS we observe messages like this:
kernel: LustreError: 18600:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 1501826: rc -2
(No new messages in MDS or old OSS logs.)

Even graceful unmount — MDS, than both OSS, than MGS, leads to files corruption. (or this is a wrong way to stop the system correctly?).

After the check, according to section 27.2 "Recovering from Corruption in the Lustre File System" of the manual, we have millions of (harmless?) messages:

[1] zero-length orphan objid 0:8371035"

and hundreds of kind:

Failed to find fid [0x20000a041:0x3f99:0x0]: DB_NOTFOUND: No matching key/data pair found
[0]: MDS FID [0x20000a041:0x3f99:0x0] object 0:984246 deleted?

After correcting these errors with lfsck -l -c, I have checked the filesystem one more time, and received many more errors of the same type. (The system was mounted and unmounted to perform the check, no other accesses, except some reads – cd/ls/cat were done).

The both OSS’s are in failout mode.
Due to historical reasons the old OSS is OSS1, (half a year ago we had to migrate from degrading OSS), so new OSS became OSS0.

For one of the corrupted files, /lustre0/users/kglukhov/Calcs/Abinit/SPS/para/fo+Cr_par/ Sn2P2S6o_DS3_WFK:

[root@compute-0-7 fo+Cr_par]# lfs getstripe Sn2P2S6o_DS3_WFK
Sn2P2S6o_DS3_WFK
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_offset: 0
obdidx objid objid group
0 1500007 0x16e367 0

On the corresponding (new) OSS node:
[root@lustre-compute-0-3 mnt]# debugfs -c -R "stat O/0/d$((1500007 % 32))/1500007" /dev/md3
debugfs 1.42.6.wc2 (10-Dec-2012)
/dev/md3: catastrophic mode - not reading inode or group bitmaps
O/0/d7/1500007: File not found by ext2_lookup

Can you help us to understand, what is going on and how to tackle it?

I attach an output of all commands of the first check (according to the manual section 27.2), an output of lfsck for first check (-n), error fixing (-l -c) and the second full check.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

chklog_after_fix.gz
1.27 MB
22/Mar/13 2:11 PM
chklog_before_fix.gz
1.27 MB
22/Mar/13 2:11 PM
chklog_fixing.gz
1.26 MB
22/Mar/13 2:11 PM
first_check_logs.tar.gz
2 kB
22/Mar/13 2:11 PM

Activity

[LU-3019] Files are corrupted after OSS unmount.

Andreas Dilger added a comment - 09/Jan/20 6:58 AM

Close old bug

Andreas Dilger added a comment - 09/Jan/20 6:58 AM Close old bug

Minh-Nghia Nguyen added a comment - 08/May/13 3:47 PM

We encountered the same problem after restarting our Lustre filesystem that was on 2.1.4. Our experience seems to be very similar with Oleg's Lustre system in his bug description. Many files untouched for months became corrupted (zero size) on many OST that were cleanly unmounted. I hope that there is a way to recover users files.

Minh-Nghia Nguyen added a comment - 08/May/13 3:47 PM We encountered the same problem after restarting our Lustre filesystem that was on 2.1.4. Our experience seems to be very similar with Oleg's Lustre system in his bug description. Many files untouched for months became corrupted (zero size) on many OST that were cleanly unmounted. I hope that there is a way to recover users files.

Oleg Farenyuk added a comment - 28/Mar/13 7:00 PM

Switching both OSS to failover mode reduced the likelihood of damage by 4-5 orders of magnitude, though (simulated) outage of OSS and/or clients sometimes still leads to files corruption - they become inaccessible and can not be cured by lfsck.

Oleg Farenyuk added a comment - 28/Mar/13 7:00 PM Switching both OSS to failover mode reduced the likelihood of damage by 4-5 orders of magnitude, though (simulated) outage of OSS and/or clients sometimes still leads to files corruption - they become inaccessible and can not be cured by lfsck.

Oleg Farenyuk added a comment - 25/Mar/13 12:31 AM

It turned out that the utility simply could fix those mistakes. So the message from lfsck were all about the same files. After removal of defective files manually, these messages have gone.

Then I've made an additional experiment.

1. According to section 14.4. of manual (Regenerating Lustre Configuration Logs), dismounted all and regenerated logs.
2. Mounted file system.
3. Wrote 20Gb of files on Lustre.
4. Checked - no new defective files appeared.
5. An hour later started to copy another 20Gb of files. While copying, unmounted OSS0 for 30 seconds.

Result: several hundred files being copied were damaged. And a few dozen files, copied an hour ago were damaged too! Fortunately, unlike last time, more old files were not affected.

(Damaged – means they show zero size, and only can be deleted, any other operation fails.)

Oleg Farenyuk added a comment - 25/Mar/13 12:31 AM It turned out that the utility simply could fix those mistakes. So the message from lfsck were all about the same files. After removal of defective files manually, these messages have gone. Then I've made an additional experiment. 1. According to section 14.4. of manual (Regenerating Lustre Configuration Logs), dismounted all and regenerated logs. 2. Mounted file system. 3. Wrote 20Gb of files on Lustre. 4. Checked - no new defective files appeared. 5. An hour later started to copy another 20Gb of files. While copying, unmounted OSS0 for 30 seconds. Result: several hundred files being copied were damaged. And a few dozen files, copied an hour ago were damaged too ! Fortunately, unlike last time, more old files were not affected. (Damaged – means they show zero size, and only can be deleted, any other operation fails.)

Files are corrupted after OSS unmount.

Details

Description

Attachments

Attachments

Activity

People

Dates