Comment from Andreas:
> Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
Exactly. It may be that there are only a few entries in each block (e.g. an output file saved after some thousands of temporary files are written and then deleted), and there are few or no empty blocks.
> But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad.
That is not always clear. If the blocks are sparsely used, then the hash wrapping scheme would definitely help.
> Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
True, but we've increased the number of inodes for 2.x releases to use up more of that excess space than in the past. I agree that for large filesystems it should be less of a risk, and for small test filesystems we can hope that it helps enough under test loads to avoid problems.
I wouldn't object to combining both solutions for 2.3 so that we can be sure this problem does not hit us again in the future. I also like your ideas that OI scrub could fix this problem, but would it require significant effort to back port this code to 2.1? It is definitely more of a feature than I would like to include into 2.1, but it is also one of the major holes in the ability to support 2.1 for the long term if any OI problem results in an unusable filesystem.
> My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing.
I haven't looked at your patch yet, but need to know more about how the solution works. Does it keep a persistent list of empty blocks on disk, or only in memory, or does it just delete the free blocks from the file?
Does the file size/offset of the OI file continue to grow during its lifetime? If it does, will it hit the 16TB size limit in heavy usage within, say, 5 years?
> We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
It would be better to re-use the OI scrub code than to spend time developing a new tool for this. The OI scrub has more uses, and could be done online.
What might be needed at some point in the future is to allow a "mirrored OI" mode where the new OI file can be build while the old one is used for reference. That would avoid any threads hanging while the FID is not in the new OI file.
The "backup OI" mode for OI scrub to rebuild OI file will introduce more complexity, because there may be concurrent create/unlink during the OI scrub, it need to process both the old OI file and new OI file, and should kept them in consistent on somehow, which will cause normal logic changed for lookup/create/unlink. Such changes may introduce some race bugs.
In fact, we do not care the system crash during OI scrub, because we have the support to resume OI scrub from the breakpoint. We can guarantee the OI file rebuild correctly eventually, even if the system crash many times.
Oleg, what is your suggestion for back-porting OI scrub to Lustre-2.1.x?