[LU-8521] ZFS OST is unwritable - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
TOSS 2.4-9

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages:

[ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 360.762505] LustreError: 8963:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 370.784372] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5

I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST.

zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed.

Attachments

Issue Links

is related to

LU-7585 Implement OI Scrub for ZFS

Resolved

Activity

[LU-8521] ZFS OST is unwritable

nasf (Inactive) added a comment - 23/Aug/16 9:50 PM

I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.

nasf (Inactive) added a comment - 23/Aug/16 9:50 PM I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS? The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.

Andreas Dilger added a comment - 23/Aug/16 9:48 PM - edited

Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV.

My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive.

As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS (lctl --device %fscratch-OST0001-osc-MDT0000 deactivate) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.

Andreas Dilger added a comment - 23/Aug/16 9:48 PM - edited Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV. My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive. As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS ( lctl --device %fscratch-OST0001-osc-MDT0000 deactivate ) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.

Joe Mervini added a comment - 23/Aug/16 8:48 PM

Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

Joe Mervini added a comment - 23/Aug/16 8:48 PM Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

Andreas Dilger added a comment - 23/Aug/16 8:02 PM

Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.

Andreas Dilger added a comment - 23/Aug/16 8:02 PM Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.

Joe Mervini added a comment - 23/Aug/16 7:46 PM

I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

Joe Mervini added a comment - 23/Aug/16 7:46 PM I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

nasf (Inactive) added a comment - 23/Aug/16 7:31 PM

One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

nasf (Inactive) added a comment - 23/Aug/16 7:31 PM One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here. Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

Andreas Dilger added a comment - 23/Aug/16 5:26 PM

Joe, it looks like some of the OST object directories are corrupted (O/0/d0, O/0/d1, ..., O/0/d31) which is what is causing the object precreates to fail, and would also prevent access to large numbers of OST objects. With ext4/ldiskfs these directories would be rebuilt by e2fsck, but ZFS will only do block-level reconstruction based on RAID-z parity and block checksums, and no "zfsck" exists to rebuild directory contents.

One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

Andreas Dilger added a comment - 23/Aug/16 5:26 PM Joe, it looks like some of the OST object directories are corrupted ( O/0/d0 , O/0/d1 , ..., O/0/d31 ) which is what is causing the object precreates to fail, and would also prevent access to large numbers of OST objects. With ext4/ldiskfs these directories would be rebuilt by e2fsck, but ZFS will only do block-level reconstruction based on RAID-z parity and block checksums, and no "zfsck" exists to rebuild directory contents. One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken ) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

Joe Mervini added a comment - 23/Aug/16 3:20 AM

This looks like it's more serious than I originally believed. My assumption was the the directories that were identified contained the files that were corrupted but now I'm not so sure. And I believe I might have gotten myself into the weeds a little bit.

I wasn't able to mount the pool without doing a "zfs set mountpoint=legacy <device>. Once mounted, I couldn't find the LAST_ID (at least it wasn't in the directory <mntpt>O/0.) I then tried running the procedure for listing the objects and got a message saying it had problems with the d0 directory and pretty much hung (unfortunately I didn't capture the message.)

Below is the output from zpool status -v:

[root@foss1 ~]# zpool status -v fscratch-OST0001
pool: fscratch-OST0001
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 84.5K in 126h42m with 50 errors on Sat Aug 20 15:58:43 2016
config:

NAME STATE READ WRITE CKSUM
fscratch-OST0001 ONLINE 0 0 63
360080e5000297410000008c5543fd7ef ONLINE 0 0 252

errors: Permanent errors have been detected in the following files:

fscratch-OST0001/fscratch_ost01:<0x12b3205>
fscratch-OST0001/fscratch_ost01:<0x3da4a0a>
fscratch-OST0001/fscratch_ost01:<0x8a340f>
fscratch-OST0001/fscratch_ost01:<0x286a10>
fscratch-OST0001/fscratch_ost01:<0x11b691f>
fscratch-OST0001/fscratch_ost01:<0x2425d25>
fscratch-OST0001/fscratch_ost01:<0x2331526>
fscratch-OST0001/fscratch_ost01:<0x5c24038>
fscratch-OST0001/fscratch_ost01:<0x1292849>
fscratch-OST0001/fscratch_ost01:<0x3e23e4a>
fscratch-OST0001/fscratch_ost01:<0x1e1eb4a>
fscratch-OST0001/fscratch_ost01:<0x112bf4b>
fscratch-OST0001/fscratch_ost01:<0x1e1eb4e>
fscratch-OST0001/fscratch_ost01:<0x112bf4f>
fscratch-OST0001/fscratch_ost01:<0x737f56>
fscratch-OST0001/fscratch_ost01:<0x1e33371>
fscratch-OST0001/fscratch_ost01:<0x232e72>
fscratch-OST0001/fscratch_ost01:<0x333887b>
fscratch-OST0001/fscratch_ost01:<0x3e23f8a>
fscratch-OST0001/fscratch_ost01:<0x12b9699>
fscratch-OST0001/fscratch_ost01:<0x31c099b>
fscratch-OST0001/fscratch_ost01:/O/0/d0
fscratch-OST0001/fscratch_ost01:/O/0/d1
fscratch-OST0001/fscratch_ost01:<0x34b6dc4>
fscratch-OST0001/fscratch_ost01:<0x12b31c5>
fscratch-OST0001/fscratch_ost01:<0xe334d0>
fscratch-OST0001/fscratch_ost01:<0xe334d3>
fscratch-OST0001/fscratch_ost01:/O/0/d28
fscratch-OST0001/fscratch_ost01:/O/0/d29
fscratch-OST0001/fscratch_ost01:/O/0/d30
fscratch-OST0001/fscratch_ost01:/O/0/d31
fscratch-OST0001/fscratch_ost01:<0x1e818f3>

At that point I decided to back off. I did mount another OST for comparison but didn't go any further. I was unsure how to reset the mountpoint to what it was originally. Initially I set the mountpoint for the pool to be the same as the Lustre mounted OST but it didn't get the "inherited from" on the target. So I set it to none and when I restart lustre I received "Unexpected return code from import of pool" for the OST and none of the 3 OST mounted lustre a. Luckily I was able to manually mount the OSTs but this will need to be resolved as well.

Here is the output to zfs get mountpoint:

fscratch-OST0000 mountpoint none local
fscratch-OST0000/fscratch_ost00 mountpoint /mnt/lustre/local/zpool/fscratch_ost00 local
fscratch-OST0001 mountpoint none local
fscratch-OST0001/fscratch_ost01 mountpoint /mnt/lustre/local/zpool/fscratch_ost01 local
fscratch-OST0002 mountpoint /mnt/lustre/local/zpool local
fscratch-OST0002/fscratch_ost02 mountpoint /mnt/lustre/local/zpool/fscratch_ost02 inherited from fscratch-OST0002

Joe Mervini added a comment - 23/Aug/16 3:20 AM This looks like it's more serious than I originally believed. My assumption was the the directories that were identified contained the files that were corrupted but now I'm not so sure. And I believe I might have gotten myself into the weeds a little bit. I wasn't able to mount the pool without doing a "zfs set mountpoint=legacy <device>. Once mounted, I couldn't find the LAST_ID (at least it wasn't in the directory <mntpt>O/0.) I then tried running the procedure for listing the objects and got a message saying it had problems with the d0 directory and pretty much hung (unfortunately I didn't capture the message.) Below is the output from zpool status -v: [root@foss1 ~] # zpool status -v fscratch-OST0001 pool: fscratch-OST0001 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 84.5K in 126h42m with 50 errors on Sat Aug 20 15:58:43 2016 config: NAME STATE READ WRITE CKSUM fscratch-OST0001 ONLINE 0 0 63 360080e5000297410000008c5543fd7ef ONLINE 0 0 252 errors: Permanent errors have been detected in the following files: fscratch-OST0001/fscratch_ost01:<0x12b3205> fscratch-OST0001/fscratch_ost01:<0x3da4a0a> fscratch-OST0001/fscratch_ost01:<0x8a340f> fscratch-OST0001/fscratch_ost01:<0x286a10> fscratch-OST0001/fscratch_ost01:<0x11b691f> fscratch-OST0001/fscratch_ost01:<0x2425d25> fscratch-OST0001/fscratch_ost01:<0x2331526> fscratch-OST0001/fscratch_ost01:<0x5c24038> fscratch-OST0001/fscratch_ost01:<0x1292849> fscratch-OST0001/fscratch_ost01:<0x3e23e4a> fscratch-OST0001/fscratch_ost01:<0x1e1eb4a> fscratch-OST0001/fscratch_ost01:<0x112bf4b> fscratch-OST0001/fscratch_ost01:<0x1e1eb4e> fscratch-OST0001/fscratch_ost01:<0x112bf4f> fscratch-OST0001/fscratch_ost01:<0x737f56> fscratch-OST0001/fscratch_ost01:<0x1e33371> fscratch-OST0001/fscratch_ost01:<0x232e72> fscratch-OST0001/fscratch_ost01:<0x333887b> fscratch-OST0001/fscratch_ost01:<0x3e23f8a> fscratch-OST0001/fscratch_ost01:<0x12b9699> fscratch-OST0001/fscratch_ost01:<0x31c099b> fscratch-OST0001/fscratch_ost01:/O/0/d0 fscratch-OST0001/fscratch_ost01:/O/0/d1 fscratch-OST0001/fscratch_ost01:<0x34b6dc4> fscratch-OST0001/fscratch_ost01:<0x12b31c5> fscratch-OST0001/fscratch_ost01:<0xe334d0> fscratch-OST0001/fscratch_ost01:<0xe334d3> fscratch-OST0001/fscratch_ost01:/O/0/d28 fscratch-OST0001/fscratch_ost01:/O/0/d29 fscratch-OST0001/fscratch_ost01:/O/0/d30 fscratch-OST0001/fscratch_ost01:/O/0/d31 fscratch-OST0001/fscratch_ost01:<0x1e818f3> At that point I decided to back off. I did mount another OST for comparison but didn't go any further. I was unsure how to reset the mountpoint to what it was originally. Initially I set the mountpoint for the pool to be the same as the Lustre mounted OST but it didn't get the "inherited from" on the target. So I set it to none and when I restart lustre I received "Unexpected return code from import of pool" for the OST and none of the 3 OST mounted lustre a. Luckily I was able to manually mount the OSTs but this will need to be resolved as well. Here is the output to zfs get mountpoint: fscratch-OST0000 mountpoint none local fscratch-OST0000/fscratch_ost00 mountpoint /mnt/lustre/local/zpool/fscratch_ost00 local fscratch-OST0001 mountpoint none local fscratch-OST0001/fscratch_ost01 mountpoint /mnt/lustre/local/zpool/fscratch_ost01 local fscratch-OST0002 mountpoint /mnt/lustre/local/zpool local fscratch-OST0002/fscratch_ost02 mountpoint /mnt/lustre/local/zpool/fscratch_ost02 inherited from fscratch-OST0002

Andreas Dilger added a comment - 22/Aug/16 6:51 PM

It depends on which files are affected. At least one of them must be an internal Lustre file or it wouldn't have exhibited any problems precreating files, and only reported errors when a corrupted regular file object was accessed. Some files like last_rcvd can be deleted and are recreated automatically at the next mount (but cause client recovery to fail). Other files like LAST_ID can be recreated if we know they are the problem (manually until Lustre 2.6, automatically with LFSCK after 2.6). Regular file data cannot be recovered at this time (but we're working on that...).

Andreas Dilger added a comment - 22/Aug/16 6:51 PM It depends on which files are affected. At least one of them must be an internal Lustre file or it wouldn't have exhibited any problems precreating files, and only reported errors when a corrupted regular file object was accessed. Some files like last_rcvd can be deleted and are recreated automatically at the next mount (but cause client recovery to fail). Other files like LAST_ID can be recreated if we know they are the problem (manually until Lustre 2.6, automatically with LFSCK after 2.6). Regular file data cannot be recovered at this time (but we're working on that...).

Joe Mervini added a comment - 22/Aug/16 5:57 PM

Thanks Andreas. I'll need to wait until after-hours to do this. Is it your expectation that eliminating the bad files will restore normal operations of the OST?

Joe Mervini added a comment - 22/Aug/16 5:57 PM Thanks Andreas. I'll need to wait until after-hours to do this. Is it your expectation that eliminating the bad files will restore normal operations of the OST?

Andreas Dilger added a comment - 22/Aug/16 4:29 PM

Joe, you can mount a ZFS OST/MDT filesystem in a similar way as ldiskfs, after unmounting it from Lustre and enabling the canmount property (which prevents it from otherwise automatically mounting the dataset when the pool is imported:

oss# zfs set canmount=on <pool/ost>
oss# mount -t zfs <pool/ost> /mnt/ost
oss# <do stuff>
oss# umount /mnt/ost
oss# zfs set canmount=off <pool/ost>

It is important to know the list of files that were corrupted, since that may entail further recovery actions.

Andreas Dilger added a comment - 22/Aug/16 4:29 PM Joe, you can mount a ZFS OST/MDT filesystem in a similar way as ldiskfs, after unmounting it from Lustre and enabling the canmount property (which prevents it from otherwise automatically mounting the dataset when the pool is imported: oss# zfs set canmount=on <pool/ost> oss# mount -t zfs <pool/ost> /mnt/ost oss# <do stuff> oss# umount /mnt/ost oss# zfs set canmount=off <pool/ost> It is important to know the list of files that were corrupted, since that may entail further recovery actions.

People

Assignee:: nasf (Inactive)

Reporter:: Joe Mervini

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Aug/16 3:27 PM

Updated:: 05/Dec/17 3:58 AM

Resolved:: 05/Dec/17 3:58 AM