Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.3
    • None
    • TOSS 2.4-9
    • 2
    • 9223372036854775807

    Description

      After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages:

      [ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
      [ 360.762505] LustreError: 8963:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
      [ 370.784372] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5

      I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST.

      zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed.

      Attachments

        Issue Links

          Activity

            [LU-8521] ZFS OST is unwritable
            yong.fan nasf (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]
            jamervi Joe Mervini added a comment -

            You can cancel this ticket. Sorry to not have closed it before now.

            We decided to chuck ZFS on this file system due to other issues and go back to LDISKFS. All's good with that.

            jamervi Joe Mervini added a comment - You can cancel this ticket. Sorry to not have closed it before now. We decided to chuck ZFS on this file system due to other issues and go back to LDISKFS. All's good with that.

            jamervi, any further feedback for this ticket? Thanks!

            yong.fan nasf (Inactive) added a comment - jamervi , any further feedback for this ticket? Thanks!
            yong.fan nasf (Inactive) added a comment - - edited

            File: `jamervi/scripts/check-key-image-files'
            Size: 0 Blocks: 0 IO Block: 4096 regular empty file
            Device: fd00h/64768d Inode: 844107533 Links: 1
            Access: (0700/rwx-----) Uid: ( 0/ root) Gid: ( 0/ root)
            Access: 2015-04-14 13:54:09.000000000 -0600
            Modify: 2015-04-14 11:31:58.000000000 -0600
            Change: 2015-04-14 11:31:58.000000000 -0600
            Is it possible that what is stored on the OST could be perhaps a year or more different?
            I am not particularly interested in the file sizes. What I need is a WAG at the age of the files on the file system.

            Normally, when the file is closed, the file's 'atime' (Access time) on the MDT will be updated as the current time. But if there is some failure during the close (very rare case), the 'atime' will not be updated.

            On the other hand, the utime() operation can set the 'atime' to any value on the MDT, but at the same time, utime() operation will cause the 'ctime' updated as the current time. In your case, it seems NOT utime() changed the xtime.

            So I tend to say that the file `jamervi/scripts/check-key-image-files' has not been opened since 2015-04-14 13:54:09 (the last close timestamp).

            yong.fan nasf (Inactive) added a comment - - edited File: `jamervi/scripts/check-key-image-files' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 844107533 Links: 1 Access: (0700/rwx-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-04-14 13:54:09.000000000 -0600 Modify: 2015-04-14 11:31:58.000000000 -0600 Change: 2015-04-14 11:31:58.000000000 -0600 Is it possible that what is stored on the OST could be perhaps a year or more different? I am not particularly interested in the file sizes. What I need is a WAG at the age of the files on the file system. Normally, when the file is closed, the file's 'atime' (Access time) on the MDT will be updated as the current time. But if there is some failure during the close (very rare case), the 'atime' will not be updated. On the other hand, the utime() operation can set the 'atime' to any value on the MDT, but at the same time, utime() operation will cause the 'ctime' updated as the current time. In your case, it seems NOT utime() changed the xtime. So I tend to say that the file `jamervi/scripts/check-key-image-files' has not been opened since 2015-04-14 13:54:09 (the last close timestamp).
            jamervi Joe Mervini added a comment -

            So just to be clear, if I do a stat on a particular file in ROOT/O/<username> that could be wildly different than the associated object on the OST? For example I have a file similar to this:

            File: `jamervi/scripts/check-key-image-files'
            Size: 0 Blocks: 0 IO Block: 4096 regular empty file
            Device: fd00h/64768d Inode: 844107533 Links: 1
            Access: (0700/rwx-----) Uid: ( 0/ root) Gid: ( 0/ root)
            Access: 2015-04-14 13:54:09.000000000 -0600
            Modify: 2015-04-14 11:31:58.000000000 -0600
            Change: 2015-04-14 11:31:58.000000000 -0600

            Is it possible that what is stored on the OST could be perhaps a year or more different?

            I am not particularly interested in the file sizes. What I need is a WAG at the age of the files on the file system.

            jamervi Joe Mervini added a comment - So just to be clear, if I do a stat on a particular file in ROOT/O/<username> that could be wildly different than the associated object on the OST? For example I have a file similar to this: File: `jamervi/scripts/check-key-image-files' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 844107533 Links: 1 Access: (0700/ rwx -----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-04-14 13:54:09.000000000 -0600 Modify: 2015-04-14 11:31:58.000000000 -0600 Change: 2015-04-14 11:31:58.000000000 -0600 Is it possible that what is stored on the OST could be perhaps a year or more different? I am not particularly interested in the file sizes. What I need is a WAG at the age of the files on the file system.
            yong.fan nasf (Inactive) added a comment - - edited

            For a given Lustre striped regular file, its attributes are distributed to several targets (MDT and OSTs). The filename, owner, mode and permission are stored on the MDT. Each stripe (or OST-object) stores its own size/blocks attribute, client needs to calculate the file size/blocks attribute by sum all related OST-objects' size/blocks attributes. As for the xtime (atime, mtime, ctime) attribute, both the MDT-object and OST-objects store the timestamp. Which one will be used as the file xtime, depends on which ctime is newer.

            So directly mount the target as backend filesystem, in spite of "ldiskfs" or "ZFS", you can NOT directly get the file size and xtime attributes. But you can calculate those attributes. For a given OST-object, you can get its MDT-object's FID from the OST-object's PFID EA, the OST-objects with the same MDT-object's FID belong to the same file. Then you can sum related OST-objects' size, and also compare their ctime. So you can get the file's size/time attributes. But consider such process, I am not sure whether it is really more efficient than directly collecting the information from Lustre client.

            yong.fan nasf (Inactive) added a comment - - edited For a given Lustre striped regular file, its attributes are distributed to several targets (MDT and OSTs). The filename, owner, mode and permission are stored on the MDT. Each stripe (or OST-object) stores its own size/blocks attribute, client needs to calculate the file size/blocks attribute by sum all related OST-objects' size/blocks attributes. As for the xtime (atime, mtime, ctime) attribute, both the MDT-object and OST-objects store the timestamp. Which one will be used as the file xtime, depends on which ctime is newer. So directly mount the target as backend filesystem, in spite of "ldiskfs" or "ZFS", you can NOT directly get the file size and xtime attributes. But you can calculate those attributes. For a given OST-object, you can get its MDT-object's FID from the OST-object's PFID EA, the OST-objects with the same MDT-object's FID belong to the same file. Then you can sum related OST-objects' size, and also compare their ctime. So you can get the file's size/time attributes. But consider such process, I am not sure whether it is really more efficient than directly collecting the information from Lustre client.

            With ldiskfs it is relatively safe to mount the MDT as type ldiskfs and collect read-only data from the filesystem. The local MDT data will not contain file sizes, only filenames, owners, timestamps, etc. We don't test this mode officially, but I use it to poke around in my home filesystem, though that filesystem has very low activity. It is not possible to do this with ZFS.

            How many OSTs do you have? I would think if there is currently 71% free space in the filesystem, and you have more than 10-12 OSTs (each 70% full), then removing one of them would only increase space usage by 6-7% or less on the remaining OSTs, which should still be OK? I'd suspect that the OST in question itself already has less than 70% usage because it can delete files but not create new ones, so it may be less data that needs to be migrated.

            adilger Andreas Dilger added a comment - With ldiskfs it is relatively safe to mount the MDT as type ldiskfs and collect read-only data from the filesystem. The local MDT data will not contain file sizes, only filenames, owners, timestamps, etc. We don't test this mode officially, but I use it to poke around in my home filesystem, though that filesystem has very low activity. It is not possible to do this with ZFS. How many OSTs do you have? I would think if there is currently 71% free space in the filesystem, and you have more than 10-12 OSTs (each 70% full), then removing one of them would only increase space usage by 6-7% or less on the remaining OSTs, which should still be OK? I'd suspect that the OST in question itself already has less than 70% usage because it can delete files but not create new ones, so it may be less data that needs to be migrated.
            jamervi Joe Mervini added a comment -

            I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic.

            We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month.

            I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time?

            Thanks in advance.

            jamervi Joe Mervini added a comment - I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic. We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month. I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time? Thanks in advance.
            jamervi Joe Mervini added a comment -

            I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic.

            We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month.

            I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time?

            Thanks in advance.

            jamervi Joe Mervini added a comment - I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic. We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month. I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time? Thanks in advance.

            Besides the OI scrub to repair the corruption in the ZFS filesystem, I think the only other option is to migrate the files off this OST onto other OSTs and then reformat it. How much space this will consume on the other OSTs depends on how many OSTs there are.

            That isn't a great solution, but the repair tools for ZFS are somewhat less robust than for ldiskfs since ZFS itself doesn't get corrupted very easily. Once the OI Scrub functionality is available for ZFS we will be able to repair a fair amount of corruption in the backing filesystem, though it still wouldn't be possible to recover the user data in any OST objects that were corrupted.

            adilger Andreas Dilger added a comment - Besides the OI scrub to repair the corruption in the ZFS filesystem, I think the only other option is to migrate the files off this OST onto other OSTs and then reformat it. How much space this will consume on the other OSTs depends on how many OSTs there are. That isn't a great solution, but the repair tools for ZFS are somewhat less robust than for ldiskfs since ZFS itself doesn't get corrupted very easily. Once the OI Scrub functionality is available for ZFS we will be able to repair a fair amount of corruption in the backing filesystem, though it still wouldn't be possible to recover the user data in any OST objects that were corrupted.

            People

              yong.fan nasf (Inactive)
              jamervi Joe Mervini
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: