Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.5.3
-
None
-
TOSS 2.4-9
-
2
-
9223372036854775807
Description
After a power outage we encountered a hardware error on one of our storage devices that essentially corrupted ~30 files on one of the OSTs. Since then the OST has been read-only and is throwing the following log messages:
[ 351.029519] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 360.762505] LustreError: 8963:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
[ 370.784372] LustreError: 8974:0:(ofd_obd.c:1376:ofd_create()) fscratch-OST0001: unable to precreate: rc = -5
I have scrubbed the device in question and rebooted the system bring up the server normally but I am still unable to create a file on that OST.
zpool status -v reports the damaged files and recommended restoring from backup and I'm inclined to simply removing the files. I know how to do this with ldiskfs but I don't know how to with ZFS. At this point I don't know how to proceed.
Attachments
Issue Links
- is related to
-
LU-7585 Implement OI Scrub for ZFS
-
- Resolved
-
I've circled back to this issue and since we are unable to repair the target our only option is to either migrate the files on the read only target to other OSTs or punt. One of the problems we have at the moment is that the capacity of the file system is at ~71% and as a result migration is problematic.
We concluded that the best course of action would be to purge older data from the file system to free up space and tried using lfs find to gather file stats. Unfortunately, this method has to be abandoned because of the ~1400 top level directories we have only been able to collect data on roughly 300 of those directories in the course of a month.
I was playing around on one of my test file systems and discovered that I can get all the data I want from the MDT (i.e., create, modify and access timestamps) by mounting the MDT ldiskfs and running stat on the files in /ROOT/O. My question is, would it be prudent to take a system outage to collect this data or am I safe having the device mounted both lustre and ldiskfs (read only) at the same time?
Thanks in advance.