So, I hit a similar problem on my test system just now, but it appears something strange is happening. The oi.16.16 file is large, along with a few other OIs, and the rest are tiny:
So, oi.16.0, oi.16.1, oi.16.17, oi.16.32, oi.16.33 are the only ones that appear to be in use.
This is running with a 200MB MDT for "SLOW=no sh acceptance-small.sh" and an additional change to runtests to create 10000 files. It also appears that sanity.sh test_51b is trying to create 70000 subdirectories, but there aren't very many files in the filesystem:
It would seem to me that the OI selection function is imbalanced. The osd_fid2oi() code appears to be selecting the OI index based on (seq % oi_count), which should be OK. The seq should be updated every LUSTRE_SEQ_MAX_WIDTH (0x20000 = 131072 objects), so the inter-OI distributions should be relatively well balanced on even a slightly larger filesystem.
I don't think there is a huge problem with the OI itself not releasing space, so long as the space that is allocated is re-used. That means the internal hashing function needs to re-use buckets after some time, rather than always allocating blocks for new buckets.
It seems another related problem of having many OI files in a small filesystem is that the space allocated to each OI is not being used again, but rather new space is allocated to each new OI. A workaround for the test filesystems is to create fewer OI files in the case of smaller MDT size, and only allocate all 64 OIs for the case of large MDTs. This is not the original problem seen here, since multi-OI support is only in 2.2, but it can be a major contributor, since the total space used by the OI would increase by 64x compared to the single-OI case.
Fan Yong, I can't believe that there is NO LIMIT on the size of the OI file? Surely there must be some upper bound of the use of the OID
as the hash index before it begins to wrap? It is impossible for a 128-bit value to be fit into a smaller hash space without any risk of collision, and it is impossible to store a linear index with even a reasonable number of files created in the filesystem over time, so there HAS to be some code to take this into account? Was the OI/IAM code implemented with so little foresight that it will just grow without limit to fill the MDT as new entries are inserted?
I would expect at least some simple modulus would provide an upper limit to the OI size, at which point we need to size the MDT taking this into account, and limit the OI count to ensure that these files do not fill the MDT.
Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad. Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing. We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
How do you think?