[LU-3019] Files are corrupted after OSS unmount. Created: 22/Mar/13 Updated: 09/Jan/20 Resolved: 09/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Farenyuk | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Rocks 5.3 cluster distributive (CentOS5 based). |
||
| Attachments: |
|
| Severity: | 4 |
| Rank (Obsolete): | 7342 |
| Description |
|
Our small lustre system recently encountered a strange problem. After the addition of the second, empty OSS (also 8Tb), a steady and slow migration of files began. Then the problem arose. When the second OSS was is unmounted (on a particular occasion), files which our users are were working with at the moment and many other unrelated (and untouched for months) randomly scattered files (may be those which were migrating while unmount happened?) became corrupted. These files appeared to be of zero size, and can only be deleted. If to access one of these files on the new OSS we observe messages like this: Even graceful unmount — MDS, than both OSS, than MGS, leads to files corruption. (or this is a wrong way to stop the system correctly?). After the check, according to section 27.2 "Recovering from Corruption in the Lustre File System" of the manual, we have millions of (harmless?) messages: [1] zero-length orphan objid 0:8371035" and hundreds of kind: Failed to find fid [0x20000a041:0x3f99:0x0]: DB_NOTFOUND: No matching key/data pair found After correcting these errors with lfsck -l -c, I have checked the filesystem one more time, and received many more errors of the same type. (The system was mounted and unmounted to perform the check, no other accesses, except some reads – cd/ls/cat were done). The both OSS’s are in failout mode. For one of the corrupted files, /lustre0/users/kglukhov/Calcs/Abinit/SPS/para/fo+Cr_par/ Sn2P2S6o_DS3_WFK: [root@compute-0-7 fo+Cr_par]# lfs getstripe Sn2P2S6o_DS3_WFK On the corresponding (new) OSS node: Can you help us to understand, what is going on and how to tackle it? I attach an output of all commands of the first check (according to the manual section 27.2), an output of lfsck for first check (-n), error fixing (-l -c) and the second full check. |
| Comments |
| Comment by Oleg Farenyuk [ 25/Mar/13 ] |
|
It turned out that the utility simply could fix those mistakes. So the message from lfsck were all about the same files. After removal of defective files manually, these messages have gone. Then I've made an additional experiment. 1. According to section 14.4. of manual (Regenerating Lustre Configuration Logs), dismounted all and regenerated logs. Result: several hundred files being copied were damaged. And a few dozen files, copied an hour ago were damaged too! Fortunately, unlike last time, more old files were not affected. (Damaged – means they show zero size, and only can be deleted, any other operation fails.) |
| Comment by Oleg Farenyuk [ 28/Mar/13 ] |
|
Switching both OSS to failover mode reduced the likelihood of damage by 4-5 orders of magnitude, though (simulated) outage of OSS and/or clients sometimes still leads to files corruption - they become inaccessible and can not be cured by lfsck. |
| Comment by Minh-Nghia Nguyen [ 08/May/13 ] |
|
We encountered the same problem after restarting our Lustre filesystem that was on 2.1.4. Our experience seems to be very similar with Oleg's Lustre system in his bug description. Many files untouched for months became corrupted (zero size) on many OST that were cleanly unmounted. I hope that there is a way to recover users files. |
| Comment by Andreas Dilger [ 09/Jan/20 ] |
|
Close old bug |