Comment from Andreas:
> Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
Exactly. It may be that there are only a few entries in each block (e.g. an output file saved after some thousands of temporary files are written and then deleted), and there are few or no empty blocks.
> But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad.
That is not always clear. If the blocks are sparsely used, then the hash wrapping scheme would definitely help.
> Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
True, but we've increased the number of inodes for 2.x releases to use up more of that excess space than in the past. I agree that for large filesystems it should be less of a risk, and for small test filesystems we can hope that it helps enough under test loads to avoid problems.
I wouldn't object to combining both solutions for 2.3 so that we can be sure this problem does not hit us again in the future. I also like your ideas that OI scrub could fix this problem, but would it require significant effort to back port this code to 2.1? It is definitely more of a feature than I would like to include into 2.1, but it is also one of the major holes in the ability to support 2.1 for the long term if any OI problem results in an unusable filesystem.
> My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing.
I haven't looked at your patch yet, but need to know more about how the solution works. Does it keep a persistent list of empty blocks on disk, or only in memory, or does it just delete the free blocks from the file?
Does the file size/offset of the OI file continue to grow during its lifetime? If it does, will it hit the 16TB size limit in heavy usage within, say, 5 years?
> We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
It would be better to re-use the OI scrub code than to spend time developing a new tool for this. The OI scrub has more uses, and could be done online.
What might be needed at some point in the future is to allow a "mirrored OI" mode where the new OI file can be build while the old one is used for reference. That would avoid any threads hanging while the FID is not in the new OI file.
The issue isn't about recovering from a crash, but rather if this is a "garbage collection" action that needs to be done on a regular basis, but the only way to do it is by deleting the OI file(s) and running an urgent scan, this will have serious performance impact, and block threads that are doing by-FID lookups.
My goal is to allow this "maintenance" action to be done without significant performance impact or delay. I agree that this would be more complex, but I don't know how much more. If we always create a new OI file when running LFSCK, it will also solve the problem of stale FID entries in the OI file. But to do this, it is better to do it at the "background scrub" speed, and allow the cases of lookup-by-FID not being found in the new OI file to be handled from the backup OI file.
For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?