Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.3
    • None
    • TOSS 2.4-9
    • 2
    • 9223372036854775807

    Description

      After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages:

      [ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
      [ 360.762505] LustreError: 8963:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
      [ 370.784372] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5

      I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST.

      zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed.

      Attachments

        Issue Links

          Activity

            [LU-8521] ZFS OST is unwritable

            I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

            The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.

            yong.fan nasf (Inactive) added a comment - I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS? The "LAST_ID" for ldiskfs backend is named as "LAST_ID" under "/O/<seq>" directory. For ZFS backend, it is named as "0" under "/O/<seq>/d0" directory.
            adilger Andreas Dilger added a comment - - edited

            Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV.

            My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive.

            As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS (lctl --device %fscratch-OST0001-osc-MDT0000 deactivate) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.

            adilger Andreas Dilger added a comment - - edited Ah, so there is a single RAID LUN == VDEV per zpool? That would mean that the ZFS-level metadata redundancy (ditto blocks) would all go to the same LUN, and if that LUN becomes corrupted there is a high likelihood that the ditto copy on the same LUN would also be corrupted, unlike the common ZFS case where the ditto copy is on a separate LUN/VDEV. My understanding is that the ZFS configuration on LLNL Sequoia has 3 separate RAID LUNs (each one a separate VDEV) so that the ditto copies (2 or 3 copies depending on what metadata it is) are at least on separate devices. In that case, the corruption of one RAID LUN might cause corruption of regular file data, but the ZFS and Lustre metadata would survive. As for moving forward with this problem, I can discuss with Fan Yong whether it is possible to implement the ability for LFSCK to fix the O/0/d* directories on ZFS during the OST object traversal. This would not necessarily be a quick fix, but might be worthwhile to wait for if the data on this OST is critical. In the meantime, it may be possible to deactivate the OST on the MDS ( lctl --device %fscratch-OST0001-osc-MDT0000 deactivate ) so that it doesn't try to do the object precreate, and then mount the OST to at least make it available in a read-only mode to clients. There will still be I/O errors because at least 6 of 32 OST object directories are reporting errors and there may be problems doing object lookups, but it may also be that only a few blocks in each directory is bad and the OST may be mostly usable to get read access to files stored there. Since the OST is deactivated on the MDS then new files will not be created there.
            jamervi Joe Mervini added a comment -

            Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

            jamervi Joe Mervini added a comment - Andreas - there's the rub: On this particular file system (our only ZFS FS by the way) we more or less mirrored Livermore's configuration with the LSI (Netapp) 5560 storage arrays but unlike Livermore, we did not have the huge number of OSS servers they did to assign a single OST/server and so we created our zpools with a single device opting to rely on the storage array for reliability.

            Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.

            adilger Andreas Dilger added a comment - Joe, could you provide some information on your ZFS RAID configuration and how ZFS was corrupted? For the Lustre metadata directories there should be at least RAID-Z2 plus an extra copy at the ZFS level, but it seems that all of this redundancy wasn't enough. We've seen problems in the past where the same pool was imported on two different nodes at the same time, and the last_rcvd file was corrupted but could be deleted and rebuilt, but we haven't seen a case where ZFS directories were corrupted so badly as this.
            jamervi Joe Mervini added a comment -

            I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

            jamervi Joe Mervini added a comment - I did some examination of ZFS OSTs on a working test file system and I could not find a LAST_ID anywhere. Is it supposed to exist under ZFS?

            One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

            Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

            yong.fan nasf (Inactive) added a comment - One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here. Unfortunately, rebuilding the /O directory on the OSTs is only available for ldiskfs backend currently, even if it is Lustre-2.6 or newer. For ZFS backend, the layout LFSCK since Lustre-2.6 can verify or re-geneated the corrupted or lost LAST_ID files.

            Joe, it looks like some of the OST object directories are corrupted (O/0/d0, O/0/d1, ..., O/0/d31) which is what is causing the object precreates to fail, and would also prevent access to large numbers of OST objects. With ext4/ldiskfs these directories would be rebuilt by e2fsck, but ZFS will only do block-level reconstruction based on RAID-z parity and block checksums, and no "zfsck" exists to rebuild directory contents.

            One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.

            adilger Andreas Dilger added a comment - Joe, it looks like some of the OST object directories are corrupted ( O/0/d0 , O/0/d1 , ..., O/0/d31 ) which is what is causing the object precreates to fail, and would also prevent access to large numbers of OST objects. With ext4/ldiskfs these directories would be rebuilt by e2fsck, but ZFS will only do block-level reconstruction based on RAID-z parity and block checksums, and no "zfsck" exists to rebuild directory contents. One option is to use LFSCK to do the rebuild of these directories, after renaming them (e.g. to d0.broken ) so that it doesn't get errors from ZFS when trying to rebuild these directories. However, AFAIK this functionality to repair OSTs and to run OI Scrub with ZFS is only available in Lustre 2.6 and later. I've added Fan Yong, the LFSCK developer, to comment on this. While we have tested this functionality, I'm not sure that it has been tested with persistent ZFS corruption, so some caution would be needed here.
            jamervi Joe Mervini added a comment -

            This looks like it's more serious than I originally believed. My assumption was the the directories that were identified contained the files that were corrupted but now I'm not so sure. And I believe I might have gotten myself into the weeds a little bit.

            I wasn't able to mount the pool without doing a "zfs set mountpoint=legacy <device>. Once mounted, I couldn't find the LAST_ID (at least it wasn't in the directory <mntpt>O/0.) I then tried running the procedure for listing the objects and got a message saying it had problems with the d0 directory and pretty much hung (unfortunately I didn't capture the message.)

            Below is the output from zpool status -v:

            [root@foss1 ~]# zpool status -v fscratch-OST0001
            pool: fscratch-OST0001
            state: ONLINE
            status: One or more devices has experienced an error resulting in data
            corruption. Applications may be affected.
            action: Restore the file in question if possible. Otherwise restore the
            entire pool from backup.
            see: http://zfsonlinux.org/msg/ZFS-8000-8A
            scan: scrub repaired 84.5K in 126h42m with 50 errors on Sat Aug 20 15:58:43 2016
            config:

            NAME STATE READ WRITE CKSUM
            fscratch-OST0001 ONLINE 0 0 63
            360080e5000297410000008c5543fd7ef ONLINE 0 0 252

            errors: Permanent errors have been detected in the following files:

            fscratch-OST0001/fscratch_ost01:<0x12b3205>
            fscratch-OST0001/fscratch_ost01:<0x3da4a0a>
            fscratch-OST0001/fscratch_ost01:<0x8a340f>
            fscratch-OST0001/fscratch_ost01:<0x286a10>
            fscratch-OST0001/fscratch_ost01:<0x11b691f>
            fscratch-OST0001/fscratch_ost01:<0x2425d25>
            fscratch-OST0001/fscratch_ost01:<0x2331526>
            fscratch-OST0001/fscratch_ost01:<0x5c24038>
            fscratch-OST0001/fscratch_ost01:<0x1292849>
            fscratch-OST0001/fscratch_ost01:<0x3e23e4a>
            fscratch-OST0001/fscratch_ost01:<0x1e1eb4a>
            fscratch-OST0001/fscratch_ost01:<0x112bf4b>
            fscratch-OST0001/fscratch_ost01:<0x1e1eb4e>
            fscratch-OST0001/fscratch_ost01:<0x112bf4f>
            fscratch-OST0001/fscratch_ost01:<0x737f56>
            fscratch-OST0001/fscratch_ost01:<0x1e33371>
            fscratch-OST0001/fscratch_ost01:<0x232e72>
            fscratch-OST0001/fscratch_ost01:<0x333887b>
            fscratch-OST0001/fscratch_ost01:<0x3e23f8a>
            fscratch-OST0001/fscratch_ost01:<0x12b9699>
            fscratch-OST0001/fscratch_ost01:<0x31c099b>
            fscratch-OST0001/fscratch_ost01:/O/0/d0
            fscratch-OST0001/fscratch_ost01:/O/0/d1
            fscratch-OST0001/fscratch_ost01:<0x34b6dc4>
            fscratch-OST0001/fscratch_ost01:<0x12b31c5>
            fscratch-OST0001/fscratch_ost01:<0xe334d0>
            fscratch-OST0001/fscratch_ost01:<0xe334d3>
            fscratch-OST0001/fscratch_ost01:/O/0/d28
            fscratch-OST0001/fscratch_ost01:/O/0/d29
            fscratch-OST0001/fscratch_ost01:/O/0/d30
            fscratch-OST0001/fscratch_ost01:/O/0/d31
            fscratch-OST0001/fscratch_ost01:<0x1e818f3>

            At that point I decided to back off. I did mount another OST for comparison but didn't go any further. I was unsure how to reset the mountpoint to what it was originally. Initially I set the mountpoint for the pool to be the same as the Lustre mounted OST but it didn't get the "inherited from" on the target. So I set it to none and when I restart lustre I received "Unexpected return code from import of pool" for the OST and none of the 3 OST mounted lustre a. Luckily I was able to manually mount the OSTs but this will need to be resolved as well.

            Here is the output to zfs get mountpoint:

            fscratch-OST0000 mountpoint none local
            fscratch-OST0000/fscratch_ost00 mountpoint /mnt/lustre/local/zpool/fscratch_ost00 local
            fscratch-OST0001 mountpoint none local
            fscratch-OST0001/fscratch_ost01 mountpoint /mnt/lustre/local/zpool/fscratch_ost01 local
            fscratch-OST0002 mountpoint /mnt/lustre/local/zpool local
            fscratch-OST0002/fscratch_ost02 mountpoint /mnt/lustre/local/zpool/fscratch_ost02 inherited from fscratch-OST0002

            jamervi Joe Mervini added a comment - This looks like it's more serious than I originally believed. My assumption was the the directories that were identified contained the files that were corrupted but now I'm not so sure. And I believe I might have gotten myself into the weeds a little bit. I wasn't able to mount the pool without doing a "zfs set mountpoint=legacy <device>. Once mounted, I couldn't find the LAST_ID (at least it wasn't in the directory <mntpt>O/0.) I then tried running the procedure for listing the objects and got a message saying it had problems with the d0 directory and pretty much hung (unfortunately I didn't capture the message.) Below is the output from zpool status -v: [root@foss1 ~] # zpool status -v fscratch-OST0001 pool: fscratch-OST0001 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 84.5K in 126h42m with 50 errors on Sat Aug 20 15:58:43 2016 config: NAME STATE READ WRITE CKSUM fscratch-OST0001 ONLINE 0 0 63 360080e5000297410000008c5543fd7ef ONLINE 0 0 252 errors: Permanent errors have been detected in the following files: fscratch-OST0001/fscratch_ost01:<0x12b3205> fscratch-OST0001/fscratch_ost01:<0x3da4a0a> fscratch-OST0001/fscratch_ost01:<0x8a340f> fscratch-OST0001/fscratch_ost01:<0x286a10> fscratch-OST0001/fscratch_ost01:<0x11b691f> fscratch-OST0001/fscratch_ost01:<0x2425d25> fscratch-OST0001/fscratch_ost01:<0x2331526> fscratch-OST0001/fscratch_ost01:<0x5c24038> fscratch-OST0001/fscratch_ost01:<0x1292849> fscratch-OST0001/fscratch_ost01:<0x3e23e4a> fscratch-OST0001/fscratch_ost01:<0x1e1eb4a> fscratch-OST0001/fscratch_ost01:<0x112bf4b> fscratch-OST0001/fscratch_ost01:<0x1e1eb4e> fscratch-OST0001/fscratch_ost01:<0x112bf4f> fscratch-OST0001/fscratch_ost01:<0x737f56> fscratch-OST0001/fscratch_ost01:<0x1e33371> fscratch-OST0001/fscratch_ost01:<0x232e72> fscratch-OST0001/fscratch_ost01:<0x333887b> fscratch-OST0001/fscratch_ost01:<0x3e23f8a> fscratch-OST0001/fscratch_ost01:<0x12b9699> fscratch-OST0001/fscratch_ost01:<0x31c099b> fscratch-OST0001/fscratch_ost01:/O/0/d0 fscratch-OST0001/fscratch_ost01:/O/0/d1 fscratch-OST0001/fscratch_ost01:<0x34b6dc4> fscratch-OST0001/fscratch_ost01:<0x12b31c5> fscratch-OST0001/fscratch_ost01:<0xe334d0> fscratch-OST0001/fscratch_ost01:<0xe334d3> fscratch-OST0001/fscratch_ost01:/O/0/d28 fscratch-OST0001/fscratch_ost01:/O/0/d29 fscratch-OST0001/fscratch_ost01:/O/0/d30 fscratch-OST0001/fscratch_ost01:/O/0/d31 fscratch-OST0001/fscratch_ost01:<0x1e818f3> At that point I decided to back off. I did mount another OST for comparison but didn't go any further. I was unsure how to reset the mountpoint to what it was originally. Initially I set the mountpoint for the pool to be the same as the Lustre mounted OST but it didn't get the "inherited from" on the target. So I set it to none and when I restart lustre I received "Unexpected return code from import of pool" for the OST and none of the 3 OST mounted lustre a. Luckily I was able to manually mount the OSTs but this will need to be resolved as well. Here is the output to zfs get mountpoint: fscratch-OST0000 mountpoint none local fscratch-OST0000/fscratch_ost00 mountpoint /mnt/lustre/local/zpool/fscratch_ost00 local fscratch-OST0001 mountpoint none local fscratch-OST0001/fscratch_ost01 mountpoint /mnt/lustre/local/zpool/fscratch_ost01 local fscratch-OST0002 mountpoint /mnt/lustre/local/zpool local fscratch-OST0002/fscratch_ost02 mountpoint /mnt/lustre/local/zpool/fscratch_ost02 inherited from fscratch-OST0002

            It depends on which files are affected. At least one of them must be an internal Lustre file or it wouldn't have exhibited any problems precreating files, and only reported errors when a corrupted regular file object was accessed. Some files like last_rcvd can be deleted and are recreated automatically at the next mount (but cause client recovery to fail). Other files like LAST_ID can be recreated if we know they are the problem (manually until Lustre 2.6, automatically with LFSCK after 2.6). Regular file data cannot be recovered at this time (but we're working on that...).

            adilger Andreas Dilger added a comment - It depends on which files are affected. At least one of them must be an internal Lustre file or it wouldn't have exhibited any problems precreating files, and only reported errors when a corrupted regular file object was accessed. Some files like last_rcvd can be deleted and are recreated automatically at the next mount (but cause client recovery to fail). Other files like LAST_ID can be recreated if we know they are the problem (manually until Lustre 2.6, automatically with LFSCK after 2.6). Regular file data cannot be recovered at this time (but we're working on that...).
            jamervi Joe Mervini added a comment -

            Thanks Andreas. I'll need to wait until after-hours to do this. Is it your expectation that eliminating the bad files will restore normal operations of the OST?

            jamervi Joe Mervini added a comment - Thanks Andreas. I'll need to wait until after-hours to do this. Is it your expectation that eliminating the bad files will restore normal operations of the OST?

            Joe, you can mount a ZFS OST/MDT filesystem in a similar way as ldiskfs, after unmounting it from Lustre and enabling the canmount property (which prevents it from otherwise automatically mounting the dataset when the pool is imported:

            oss# zfs set canmount=on <pool/ost>
            oss# mount -t zfs <pool/ost> /mnt/ost
            oss# <do stuff>
            oss# umount /mnt/ost
            oss# zfs set canmount=off <pool/ost>
            

            It is important to know the list of files that were corrupted, since that may entail further recovery actions.

            adilger Andreas Dilger added a comment - Joe, you can mount a ZFS OST/MDT filesystem in a similar way as ldiskfs, after unmounting it from Lustre and enabling the canmount property (which prevents it from otherwise automatically mounting the dataset when the pool is imported: oss# zfs set canmount=on <pool/ost> oss# mount -t zfs <pool/ost> /mnt/ost oss# <do stuff> oss# umount /mnt/ost oss# zfs set canmount=off <pool/ost> It is important to know the list of files that were corrupted, since that may entail further recovery actions.

            People

              yong.fan nasf (Inactive)
              jamervi Joe Mervini
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: