[LU-8521] ZFS OST is unwritable - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
TOSS 2.4-9

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages:

[ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 360.762505] LustreError: 8963:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 370.784372] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5

I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST.

zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed.

Attachments

Issue Links

is related to

LU-7585 Implement OI Scrub for ZFS

Resolved

Activity

[LU-8521] ZFS OST is unwritable

Andreas Dilger added a comment - 27/Sep/16 10:09 AM

Besides the OI scrub to repair the corruption in the ZFS filesystem, I think the only other option is to migrate the files off this OST onto other OSTs and then reformat it. How much space this will consume on the other OSTs depends on how many OSTs there are.

That isn't a great solution, but the repair tools for ZFS are somewhat less robust than for ldiskfs since ZFS itself doesn't get corrupted very easily. Once the OI Scrub functionality is available for ZFS we will be able to repair a fair amount of corruption in the backing filesystem, though it still wouldn't be possible to recover the user data in any OST objects that were corrupted.

Andreas Dilger added a comment - 27/Sep/16 10:09 AM Besides the OI scrub to repair the corruption in the ZFS filesystem, I think the only other option is to migrate the files off this OST onto other OSTs and then reformat it. How much space this will consume on the other OSTs depends on how many OSTs there are. That isn't a great solution, but the repair tools for ZFS are somewhat less robust than for ldiskfs since ZFS itself doesn't get corrupted very easily. Once the OI Scrub functionality is available for ZFS we will be able to repair a fair amount of corruption in the backing filesystem, though it still wouldn't be possible to recover the user data in any OST objects that were corrupted.

nasf (Inactive) added a comment - 27/Sep/16 7:51 AM

What others can we do for this ticket in addition to the on-going ZFS OI scrub?

nasf (Inactive) added a comment - 27/Sep/16 7:51 AM What others can we do for this ticket in addition to the on-going ZFS OI scrub?

nasf (Inactive) added a comment - 24/Aug/16 9:42 PM

I will work on ~~LU-7585~~ to implement ZFS OI scrub, that will allow us to rebuild "/O" directory on the OST. On some degree, such functionality may be helpful for this issue. But the OI scrub can rebuild the OI mappings, it cannot repair the corrupted data. If related objects themselves corrupted (quite possible), not (only) related entries missed from the "/O" tree. Then ZFS OI scrub cannot help much. Anyway, such work will not be ready in short time, so your current solution is quite necessary.

nasf (Inactive) added a comment - 24/Aug/16 9:42 PM I will work on LU-7585 to implement ZFS OI scrub, that will allow us to rebuild "/O" directory on the OST. On some degree, such functionality may be helpful for this issue. But the OI scrub can rebuild the OI mappings, it cannot repair the corrupted data. If related objects themselves corrupted (quite possible), not (only) related entries missed from the "/O" tree. Then ZFS OI scrub cannot help much. Anyway, such work will not be ready in short time, so your current solution is quite necessary.

Andreas Dilger added a comment - 23/Aug/16 11:21 PM

OK, so it is good news at least that the OST is currently mounted and readable, and not totally offline. I assume you do not have a snapshot of the OST that would allow recovery, and it has likely been online long enough that the old uberblocks which might reference an uncorrupted version of the filesystem have been overwritten (there are only 256, and they are updated at 1s intervals).

Note that deactivating the OST on the MDS will prevent the OST objects from being deleted due to ~~LU-4295~~, but that may a benefit if the OST is corrupt and deleting files may cause more problems, and also not an issue if you would be draining the whole OST and reformatting it. Since the OST can't precreate objects it may be that marking it deactivated on the MDS is redundant, but not otherwise harmful.

Andreas Dilger added a comment - 23/Aug/16 11:21 PM OK, so it is good news at least that the OST is currently mounted and readable, and not totally offline. I assume you do not have a snapshot of the OST that would allow recovery, and it has likely been online long enough that the old uberblocks which might reference an uncorrupted version of the filesystem have been overwritten (there are only 256, and they are updated at 1s intervals). Note that deactivating the OST on the MDS will prevent the OST objects from being deleted due to LU-4295 , but that may a benefit if the OST is corrupt and deleting files may cause more problems, and also not an issue if you would be draining the whole OST and reformatting it. Since the OST can't precreate objects it may be that marking it deactivated on the MDS is redundant, but not otherwise harmful.

Joe Mervini added a comment - 23/Aug/16 10:33 PM

Andreas - Yes, you are correct with regard to both configurations and we were pretty much discussing the same things here. We have kicked around the idea of perhaps trying to migrate the data on the bad OST but then the file system is at >76% full and we start running into problems when it approaches 80% so that is problematic.

As it turns out, even if I try force a file to be created on that OST via lfs setstripe -i 1 it will bounce it to a different OST. I have been able to access files on that OST identifying them with lfs find and can at least read/delete and actually modify existing files that I have examined in my own directory space (the write operation took some time but it surprisingly completed.) I have deactivated the OST in the meantime.

I look forward to any additional feedback or suggestions. Thanks Andreas

Fan Yong - Thanks for the info.

Joe Mervini added a comment - 23/Aug/16 10:33 PM Andreas - Yes, you are correct with regard to both configurations and we were pretty much discussing the same things here. We have kicked around the idea of perhaps trying to migrate the data on the bad OST but then the file system is at >76% full and we start running into problems when it approaches 80% so that is problematic. As it turns out, even if I try force a file to be created on that OST via lfs setstripe -i 1 it will bounce it to a different OST. I have been able to access files on that OST identifying them with lfs find and can at least read/delete and actually modify existing files that I have examined in my own directory space (the write operation took some time but it surprisingly completed.) I have deactivated the OST in the meantime. I look forward to any additional feedback or suggestions. Thanks Andreas Fan Yong - Thanks for the info.

nasf (Inactive) added a comment - 23/Aug/16 9:50 PM

I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.

nasf (Inactive) added a comment - 23/Aug/16 9:50 PM I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS? The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.

Andreas Dilger added a comment - 23/Aug/16 9:48 PM - edited

Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV.

My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive.

As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS (lctl --device %fscratch-OST0001-osc-MDT0000 deactivate) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.

Andreas Dilger added a comment - 23/Aug/16 9:48 PM - edited Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV. My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive. As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS ( lctl --device %fscratch-OST0001-osc-MDT0000 deactivate ) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.

Joe Mervini added a comment - 23/Aug/16 8:48 PM

Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

Joe Mervini added a comment - 23/Aug/16 8:48 PM Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

Andreas Dilger added a comment - 23/Aug/16 8:02 PM

Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.

Andreas Dilger added a comment - 23/Aug/16 8:02 PM Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.

Joe Mervini added a comment - 23/Aug/16 7:46 PM

I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

Joe Mervini added a comment - 23/Aug/16 7:46 PM I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

nasf (Inactive) added a comment - 23/Aug/16 7:31 PM

One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

nasf (Inactive) added a comment - 23/Aug/16 7:31 PM One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here. Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

People

Assignee:: nasf (Inactive)

Reporter:: Joe Mervini

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Aug/16 3:27 PM

Updated:: 05/Dec/17 3:58 AM

Resolved:: 05/Dec/17 3:58 AM