Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6
-
b2_1 g636ddbf
-
3
-
4236
Description
I have a smallish filesystem to which I only allocated a 5GB MDT since the overall dataset was always intended to be very small. This filesystem is simply being used to add and remove files in a loop with something along the lines of:
while true; do cp -a /lib /mnt/lustre/foo rm -rf /mnt/lustre/foo done
It seems in doing this I have filled up my MDT with an "oi.16" file that is now 94% of the space of the MDT:
# stat /mnt/lustre/mdt/oi.16 File: `/mnt/lustre/mdt/oi.16' Size: 4733702144 Blocks: 9254568 IO Block: 4096 regular file Device: fd05h/64773d Inode: 13 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2012-05-27 11:55:00.175323551 +0000 Modify: 2012-05-27 11:55:00.175323551 +0000 Change: 2012-05-27 11:55:00.175323551 +0000 # df -k /mnt/lustre/mdt/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/LustreVG-mdt0 5240128 5240128 0 100% /mnt/lustre/mdt # ls -ls /mnt/lustre/mdt/oi.16 4627284 -rw-r--r-- 1 root root 4733702144 May 27 11:55 /mnt/lustre/mdt/oi.16
It seems the OI is leaking and not being reaped when files are removed.
Attachments
Activity
The issue isn't about recovering from a crash, but rather if this is a "garbage collection" action that needs to be done on a regular basis, but the only way to do it is by deleting the OI file(s) and running an urgent scan, this will have serious performance impact, and block threads that are doing by-FID lookups.
My goal is to allow this "maintenance" action to be done without significant performance impact or delay. I agree that this would be more complex, but I don't know how much more. If we always create a new OI file when running LFSCK, it will also solve the problem of stale FID entries in the OI file. But to do this, it is better to do it at the "background scrub" speed, and allow the cases of lookup-by-FID not being found in the new OI file to be handled from the backup OI file.
For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?
The "backup OI" mode for OI scrub to rebuild OI file will introduce more complexity, because there may be concurrent create/unlink during the OI scrub, it need to process both the old OI file and new OI file, and should kept them in consistent on somehow, which will cause normal logic changed for lookup/create/unlink. Such changes may introduce some race bugs.
In fact, we do not care the system crash during OI scrub, because we have the support to resume OI scrub from the breakpoint. We can guarantee the OI file rebuild correctly eventually, even if the system crash many times.
Oleg, what is your suggestion for back-porting OI scrub to Lustre-2.1.x?
My preferred path would be for OI scrub to be backported to 2.1. This would allow fixing this issue (though not in an ideal manner, currently), and also improve maintenance/support for 2.1 itself (allowing recovery from all sorts of OI corruption, backup/ restore, eyc.
First, however, please ask Oleg if he would also be in flavour of landing this code onto b2_1 as well. It is rather large for a maintenance release, though it could be argued for the above reasons that this is really necessary for making 2.1 more supportable in the future.
The one twain I say that this doesn't really resolve the OI size problem very well is because it requires manually deleting the OI file(s), then running OI scrub in urgent mode, which will block threads if they cannot find the FID they are looking for, and cause high load on the MDS.
It would be better to have some kind of "backup OI" mode where the new OI file is created, while the old one is used to find any missing FID. if the old OI file were kept around, this would also help during OI scrub in case the primary were lost or corupted. Only in case of backup/restore, where the old file is useless would it make sense to delete it right away.
I think that porting OI scrub code to Lustre-2.1 is simpler than implementing new tools to find out existing OI blocks. OI scrub is a completely solution for OI file size issue, because it can shrink the OI file size and take back the unused space. The work for re-using empty OI blocks and wrap FID hash only can slow down the OI file size increasing speed, but cannot shrink the OI file size. So anyway, we need OI scrub to resolve OI file size issue.
About the method of wrap FID hash, it can reuse some idle OI mapping slot, but it depends on the hash function to hash the new FID mapping to some idle slot properly. But good hash function also means more OI mapping insert instead of OI mapping append, which will affect create performance. On the other hand, it will introduce compatibility issue for old format OI file, so it cannot be used to resolve the OI file size issue on Lustre-2.1.
For the patch of re-using empty OI blocks, the empty OI blocks are recorded by a special on-disk blocks list. It does not really release the empty OI blocks.
Comment from Andreas:
> Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
Exactly. It may be that there are only a few entries in each block (e.g. an output file saved after some thousands of temporary files are written and then deleted), and there are few or no empty blocks.
> But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad.
That is not always clear. If the blocks are sparsely used, then the hash wrapping scheme would definitely help.
> Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
True, but we've increased the number of inodes for 2.x releases to use up more of that excess space than in the past. I agree that for large filesystems it should be less of a risk, and for small test filesystems we can hope that it helps enough under test loads to avoid problems.
I wouldn't object to combining both solutions for 2.3 so that we can be sure this problem does not hit us again in the future. I also like your ideas that OI scrub could fix this problem, but would it require significant effort to back port this code to 2.1? It is definitely more of a feature than I would like to include into 2.1, but it is also one of the major holes in the ability to support 2.1 for the long term if any OI problem results in an unusable filesystem.
> My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing.
I haven't looked at your patch yet, but need to know more about how the solution works. Does it keep a persistent list of empty blocks on disk, or only in memory, or does it just delete the free blocks from the file?
Does the file size/offset of the OI file continue to grow during its lifetime? If it does, will it hit the 16TB size limit in heavy usage within, say, 5 years?
> We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
It would be better to re-use the OI scrub code than to spend time developing a new tool for this. The OI scrub has more uses, and could be done online.
What might be needed at some point in the future is to allow a "mirrored OI" mode where the new OI file can be build while the old one is used for reference. That would avoid any threads hanging while the FID is not in the new OI file.
Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad. Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing. We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
How do you think?
This is comment from Andreas:
This will help in our limited test case of creating and deleting files in a loop. The real question is whether there will be so many empty OI blocks in real life, when all files are not deleted in strict sequence?
I like the idea that this can be applied to fix the problem even on 2.1 releases that have already seen the problem, but it is important to know whether it will really help. This is especially true if this adds complexity to the code and doesn't actually help muh in the end.
One path forward is to create a debug patch that can be included into 2.1.3 that will print out (at mount time or via /proc?) how many empty blocks there really are in the OIs. The one drawback is that this may cause a LOT of seeking to read large OI files at mount, which may be unacceptable in production. This could be used by CEA and/or LLNL on their production to report the state of the OI file(s).
Cheers, Andreas
The patch contains sanity update: test_228, which will verity whether OI file size will increase when new files created with some empty OI blocks there.
Patch for reusing empty OI blocks:
http://review.whamcloud.com/#change,3153,set4
For old Lustre-2.x release, this patch only effects the create/unlink after applying the patch, will not affect the existing empty OI blocks.
Andreas, Is it necessary to introduce some tool to find out all the empty OI blocks for reusing against the existing OI files? or give it to be rebuilt by OI scrub until Lustre-2.3?
After some test, I found that warping FID hash to reuse some idle OI slots may be not an efficient solution for OI file size issues. Because the positions for idle OI slots is random, depends on which files are removed. It is almost impossible to find a suitable hash function which can hash the new OI mappings evenly to those random idle OI slots.
On the other hand, warping FID hash is inefficient for OI slot inserting because of more memmove() in related OI blocks. But for original flat hash, most of the OI slot inserting are append() ops in related OI blocks. So the create performance may be worse.
In fact, the most serious issue for OI file size increasing is the empty but non-released OI blocks. As long as we can reuse those empty but non-released OI blocks, then we can much slow down the OI file size increasing.
My current idea is to introduce inode::i_idle_blocks to record these non-released OI blocks when they become empty. And adjust the strategy for OI block allocation: reuse the empty block in inode::i_idle_blocks with priority, and allocate new block from system volume only when no idle OI blocks can be reused.
Another advantage is that such changes will not introduce OI compatibility issues. Means new OI file can be accessed by old MDT, and new MDT can access old OI file also.
> For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?
It is not so simple. If we only delete the OI mapping in the new OI file, and leave it in the old OI mapping. Then what will happen if someone does lookup-by-FID after the unlink operation? He/she will find the stale OI mapping in the old OI file, but related object does not exist, under such case, it is not easy to distinguish whether it is normal case, or abnormal case of object lost because of disk issues or system errors.