[LU-8521] ZFS OST is unwritable Created: 22/Aug/16 Updated: 05/Dec/17 Resolved: 05/Dec/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Joe Mervini | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
TOSS 2.4-9 |
||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages: [ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5 I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST. zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed. |
| Comments |
| Comment by Andreas Dilger [ 22/Aug/16 ] |
|
Joe, you can mount a ZFS OST/MDT filesystem in a similar way as ldiskfs, after unmounting it from Lustre and enabling the canmount property (which prevents it from otherwise automatically mounting the dataset when the pool is imported: oss# zfs set canmount=on <pool/ost> oss# mount -t zfs <pool/ost> /mnt/ost oss# <do stuff> oss# umount /mnt/ost oss# zfs set canmount=off <pool/ost> It is important to know the list of files that were corrupted, since that may entail further recovery actions. |
| Comment by Joe Mervini [ 22/Aug/16 ] |
|
Thanks Andreas. I'll need to wait until after-hours to do this. Is it your expectation that eliminating the bad files will restore normal operations of the OST? |
| Comment by Andreas Dilger [ 22/Aug/16 ] |
|
It depends on which files are affected. At least one of them must be an internal Lustre file or it wouldn't have exhibited any problems precreating files, and only reported errors when a corrupted regular file object was accessed. Some files like last_rcvd can be deleted and are recreated automatically at the next mount (but cause client recovery to fail). Other files like LAST_ID can be recreated if we know they are the problem (manually until Lustre 2.6, automatically with LFSCK after 2.6). Regular file data cannot be recovered at this time (but we're working on that...). |
| Comment by Joe Mervini [ 23/Aug/16 ] |
|
This looks like it's more serious than I originally believed. My assumption was the the directories that were identified contained the files that were corrupted but now I'm not so sure. And I believe I might have gotten myself into the weeds a little bit. I wasn't able to mount the pool without doing a "zfs set mountpoint=legacy <device>. Once mounted, I couldn't find the LAST_ID (at least it wasn't in the directory <mntpt>O/0.) I then tried running the procedure for listing the objects and got a message saying it had problems with the d0 directory and pretty much hung (unfortunately I didn't capture the message.) Below is the output from zpool status -v: [root@foss1 ~]# zpool status -v fscratch-OST0001 NAME STATE READ WRITE CKSUM errors: Permanent errors have been detected in the following files: fscratch-OST0001/fscratch_ost01:<0x12b3205> At that point I decided to back off. I did mount another OST for comparison but didn't go any further. I was unsure how to reset the mountpoint to what it was originally. Initially I set the mountpoint for the pool to be the same as the Lustre mounted OST but it didn't get the "inherited from" on the target. So I set it to none and when I restart lustre I received "Unexpected return code from import of pool" for the OST and none of the 3 OST mounted lustre a. Luckily I was able to manually mount the OSTs but this will need to be resolved as well. Here is the output to zfs get mountpoint: fscratch-OST0000 mountpoint none local |
| Comment by Andreas Dilger [ 23/Aug/16 ] |
|
Joe, it looks like some of the OST object directories are corrupted (O/0/d0, O/0/d1, ..., O/0/d31) which is what is causing the object precreates to fail, and would also prevent access to large numbers of OST objects. With ext4/ldiskfs these directories would be rebuilt by e2fsck, but ZFS will only do block-level reconstruction based on RAID-z parity and block checksums, and no "zfsck" exists to rebuild directory contents. One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here. |
| Comment by nasf (Inactive) [ 23/Aug/16 ] |
Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files. |
| Comment by Joe Mervini [ 23/Aug/16 ] |
|
I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS? |
| Comment by Andreas Dilger [ 23/Aug/16 ] |
|
Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this. |
| Comment by Joe Mervini [ 23/Aug/16 ] |
|
Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability. |
| Comment by Andreas Dilger [ 23/Aug/16 ] |
|
Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV. My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive. As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS (lctl --device %fscratch-OST0001-osc-MDT0000 deactivate) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there. |
| Comment by nasf (Inactive) [ 23/Aug/16 ] |
The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory. |
| Comment by Joe Mervini [ 23/Aug/16 ] |
|
Andreas - Yes, you are correct with regard to both configurations and we were pretty much discussing the same things here. We have kicked around the idea of perhaps trying to migrate the data on the bad OST but then the file system is at >76% full and we start running into problems when it approaches 80% so that is problematic. As it turns out, even if I try force a file to be created on that OST via lfs setstripe -i 1 it will bounce it to a different OST. I have been able to access files on that OST identifying them with lfs find and can at least read/delete and actually modify existing files that I have examined in my own directory space (the write operation took some time but it surprisingly completed.) I have deactivated the OST in the meantime. I look forward to any additional feedback or suggestions. Thanks Andreas Fan Yong - Thanks for the info. |
| Comment by Andreas Dilger [ 23/Aug/16 ] |
|
OK, so it is good news at least that the OST is currently mounted and readable, and not totally offline. I assume you do not have a snapshot of the OST that would allow recovery, and it has likely been online long enough that the old uberblocks which might reference an uncorrupted version of the filesystem have been overwritten (there are only 256, and they are updated at 1s intervals). Note that deactivating the OST on the MDS will prevent the OST objects from being deleted due to |
| Comment by nasf (Inactive) [ 24/Aug/16 ] |
|
I will work on |
| Comment by nasf (Inactive) [ 27/Sep/16 ] |
|
What others can we do for this ticket in addition to the on-going ZFS OI scrub? |
| Comment by Andreas Dilger [ 27/Sep/16 ] |
|
Besides the OI scrub to repair the corruption in the ZFS filesystem, I think the only other option is to migrate the files off this OST onto other OSTs and then reformat it. How much space this will consume on the other OSTs depends on how many OSTs there are. That isn't a great solution, but the repair tools for ZFS are somewhat less robust than for ldiskfs since ZFS itself doesn't get corrupted very easily. Once the OI Scrub functionality is available for ZFS we will be able to repair a fair amount of corruption in the backing filesystem, though it still wouldn't be possible to recover the user data in any OST objects that were corrupted. |
| Comment by Joe Mervini [ 01/Nov/16 ] |
|
I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic. We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month. I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time? Thanks in advance. |
| Comment by Joe Mervini [ 01/Nov/16 ] |
|
I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic. We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month. I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time? Thanks in advance. |
| Comment by Andreas Dilger [ 01/Nov/16 ] |
|
With ldiskfs it is relatively safe to mount the MDT as type ldiskfs and collect read-only data from the filesystem. The local MDT data will not contain file sizes, only filenames, owners, timestamps, etc. We don't test this mode officially, but I use it to poke around in my home filesystem, though that filesystem has very low activity. It is not possible to do this with ZFS. How many OSTs do you have? I would think if there is currently 71% free space in the filesystem, and you have more than 10-12 OSTs (each 70% full), then removing one of them would only increase space usage by 6-7% or less on the remaining OSTs, which should still be OK? I'd suspect that the OST in question itself already has less than 70% usage because it can delete files but not create new ones, so it may be less data that needs to be migrated. |
| Comment by nasf (Inactive) [ 02/Nov/16 ] |
|
For a given Lustre striped regular file, its attributes are distributed to several targets (MDT and OSTs). The filename, owner, mode and permission are stored on the MDT. Each stripe (or OST-object) stores its own size/blocks attribute, client needs to calculate the file size/blocks attribute by sum all related OST-objects' size/blocks attributes. As for the xtime (atime, mtime, ctime) attribute, both the MDT-object and OST-objects store the timestamp. Which one will be used as the file xtime, depends on which ctime is newer. So directly mount the target as backend filesystem, in spite of "ldiskfs" or "ZFS", you can NOT directly get the file size and xtime attributes. But you can calculate those attributes. For a given OST-object, you can get its MDT-object's FID from the OST-object's PFID EA, the OST-objects with the same MDT-object's FID belong to the same file. Then you can sum related OST-objects' size, and also compare their ctime. So you can get the file's size/time attributes. But consider such process, I am not sure whether it is really more efficient than directly collecting the information from Lustre client. |
| Comment by Joe Mervini [ 02/Nov/16 ] |
|
So just to be clear, if I do a stat on a particular file in ROOT/O/<username> that could be wildly different than the associated object on the OST? For example I have a file similar to this: File: `jamervi/scripts/check-key-image-files' Is it possible that what is stored on the OST could be perhaps a year or more different? I am not particularly interested in the file sizes. What I need is a WAG at the age of the files on the file system. |
| Comment by nasf (Inactive) [ 03/Nov/16 ] |
Normally, when the file is closed, the file's 'atime' (Access time) on the MDT will be updated as the current time. But if there is some failure during the close (very rare case), the 'atime' will not be updated. On the other hand, the utime() operation can set the 'atime' to any value on the MDT, but at the same time, utime() operation will cause the 'ctime' updated as the current time. In your case, it seems NOT utime() changed the xtime. So I tend to say that the file `jamervi/scripts/check-key-image-files' has not been opened since 2015-04-14 13:54:09 (the last close timestamp). |
| Comment by nasf (Inactive) [ 04/Dec/17 ] |
|
jamervi, any further feedback for this ticket? Thanks! |
| Comment by Joe Mervini [ 04/Dec/17 ] |
|
You can cancel this ticket. Sorry to not have closed it before now. We decided to chuck ZFS on this file system due to other issues and go back to LDISKFS. All's good with that. |