Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6
-
b2_1 g636ddbf
-
3
-
4236
Description
I have a smallish filesystem to which I only allocated a 5GB MDT since the overall dataset was always intended to be very small. This filesystem is simply being used to add and remove files in a loop with something along the lines of:
while true; do cp -a /lib /mnt/lustre/foo rm -rf /mnt/lustre/foo done
It seems in doing this I have filled up my MDT with an "oi.16" file that is now 94% of the space of the MDT:
# stat /mnt/lustre/mdt/oi.16 File: `/mnt/lustre/mdt/oi.16' Size: 4733702144 Blocks: 9254568 IO Block: 4096 regular file Device: fd05h/64773d Inode: 13 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2012-05-27 11:55:00.175323551 +0000 Modify: 2012-05-27 11:55:00.175323551 +0000 Change: 2012-05-27 11:55:00.175323551 +0000 # df -k /mnt/lustre/mdt/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/LustreVG-mdt0 5240128 5240128 0 100% /mnt/lustre/mdt # ls -ls /mnt/lustre/mdt/oi.16 4627284 -rw-r--r-- 1 root root 4733702144 May 27 11:55 /mnt/lustre/mdt/oi.16
It seems the OI is leaking and not being reaped when files are removed.
Attachments
Activity
Then current idea for "backup mode" OI scrub will be like as following:
For create: it will insert the OI mapping into the old OI file firstly, if the target ino is in front of OI scrub current postion, then OI scrub can add the mapping to new OI file also, otherwise the OI mapping should be inserted into the new OI file by the creator.
For unlink: it will delete the OI mapping from the new OI file firstly (if it is there).
For lookup: it will check old OI file only, if there is no relate OI mapping, then return -ENOENT; if found related OI mapping, but fail to load related inode, then return -EIO; if found related OI mapping, but the loaded inode is not the expected one, then return -ENOENT.
When should we do that? Now or LFSCK phase IV?
I see two options in that case. It would be possible to also delete FID entries from backup OI, but this would hurt performance during OI scrub. It would instead be possible to detect this (hopefuly rare) error during lookup, where a FID entry exists in the backup OI, but the inode is deleted or does not have a matching LMA FID, and return ENOENT or ESTALE as it would if no such entry existed in the first place.
Since the FID entry would have been lost anyway during OI rebuild, this by-FID lookup is just a rare race condition that only happens during scrub.
> For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?
It is not so simple. If we only delete the OI mapping in the new OI file, and leave it in the old OI mapping. Then what will happen if someone does lookup-by-FID after the unlink operation? He/she will find the stale OI mapping in the old OI file, but related object does not exist, under such case, it is not easy to distinguish whether it is normal case, or abnormal case of object lost because of disk issues or system errors.
The issue isn't about recovering from a crash, but rather if this is a "garbage collection" action that needs to be done on a regular basis, but the only way to do it is by deleting the OI file(s) and running an urgent scan, this will have serious performance impact, and block threads that are doing by-FID lookups.
My goal is to allow this "maintenance" action to be done without significant performance impact or delay. I agree that this would be more complex, but I don't know how much more. If we always create a new OI file when running LFSCK, it will also solve the problem of stale FID entries in the OI file. But to do this, it is better to do it at the "background scrub" speed, and allow the cases of lookup-by-FID not being found in the new OI file to be handled from the backup OI file.
For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?
The "backup OI" mode for OI scrub to rebuild OI file will introduce more complexity, because there may be concurrent create/unlink during the OI scrub, it need to process both the old OI file and new OI file, and should kept them in consistent on somehow, which will cause normal logic changed for lookup/create/unlink. Such changes may introduce some race bugs.
In fact, we do not care the system crash during OI scrub, because we have the support to resume OI scrub from the breakpoint. We can guarantee the OI file rebuild correctly eventually, even if the system crash many times.
Oleg, what is your suggestion for back-porting OI scrub to Lustre-2.1.x?
My preferred path would be for OI scrub to be backported to 2.1. This would allow fixing this issue (though not in an ideal manner, currently), and also improve maintenance/support for 2.1 itself (allowing recovery from all sorts of OI corruption, backup/ restore, eyc.
First, however, please ask Oleg if he would also be in flavour of landing this code onto b2_1 as well. It is rather large for a maintenance release, though it could be argued for the above reasons that this is really necessary for making 2.1 more supportable in the future.
The one twain I say that this doesn't really resolve the OI size problem very well is because it requires manually deleting the OI file(s), then running OI scrub in urgent mode, which will block threads if they cannot find the FID they are looking for, and cause high load on the MDS.
It would be better to have some kind of "backup OI" mode where the new OI file is created, while the old one is used to find any missing FID. if the old OI file were kept around, this would also help during OI scrub in case the primary were lost or corupted. Only in case of backup/restore, where the old file is useless would it make sense to delete it right away.
I think that porting OI scrub code to Lustre-2.1 is simpler than implementing new tools to find out existing OI blocks. OI scrub is a completely solution for OI file size issue, because it can shrink the OI file size and take back the unused space. The work for re-using empty OI blocks and wrap FID hash only can slow down the OI file size increasing speed, but cannot shrink the OI file size. So anyway, we need OI scrub to resolve OI file size issue.
About the method of wrap FID hash, it can reuse some idle OI mapping slot, but it depends on the hash function to hash the new FID mapping to some idle slot properly. But good hash function also means more OI mapping insert instead of OI mapping append, which will affect create performance. On the other hand, it will introduce compatibility issue for old format OI file, so it cannot be used to resolve the OI file size issue on Lustre-2.1.
For the patch of re-using empty OI blocks, the empty OI blocks are recorded by a special on-disk blocks list. It does not really release the empty OI blocks.
Comment from Andreas:
> Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
Exactly. It may be that there are only a few entries in each block (e.g. an output file saved after some thousands of temporary files are written and then deleted), and there are few or no empty blocks.
> But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad.
That is not always clear. If the blocks are sparsely used, then the hash wrapping scheme would definitely help.
> Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
True, but we've increased the number of inodes for 2.x releases to use up more of that excess space than in the past. I agree that for large filesystems it should be less of a risk, and for small test filesystems we can hope that it helps enough under test loads to avoid problems.
I wouldn't object to combining both solutions for 2.3 so that we can be sure this problem does not hit us again in the future. I also like your ideas that OI scrub could fix this problem, but would it require significant effort to back port this code to 2.1? It is definitely more of a feature than I would like to include into 2.1, but it is also one of the major holes in the ability to support 2.1 for the long term if any OI problem results in an unusable filesystem.
> My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing.
I haven't looked at your patch yet, but need to know more about how the solution works. Does it keep a persistent list of empty blocks on disk, or only in memory, or does it just delete the free blocks from the file?
Does the file size/offset of the OI file continue to grow during its lifetime? If it does, will it hit the 16TB size limit in heavy usage within, say, 5 years?
> We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
It would be better to re-use the OI scrub code than to spend time developing a new tool for this. The OI scrub has more uses, and could be done online.
What might be needed at some point in the future is to allow a "mirrored OI" mode where the new OI file can be build while the old one is used for reference. That would avoid any threads hanging while the FID is not in the new OI file.
Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.
But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad. Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.
My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing. We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.
How do you think?
This is comment from Andreas:
This will help in our limited test case of creating and deleting files in a loop. The real question is whether there will be so many empty OI blocks in real life, when all files are not deleted in strict sequence?
I like the idea that this can be applied to fix the problem even on 2.1 releases that have already seen the problem, but it is important to know whether it will really help. This is especially true if this adds complexity to the code and doesn't actually help muh in the end.
One path forward is to create a debug patch that can be included into 2.1.3 that will print out (at mount time or via /proc?) how many empty blocks there really are in the OIs. The one drawback is that this may cause a LOT of seeking to read large OI files at mount, which may be unacceptable in production. This could be used by CEA and/or LLNL on their production to report the state of the OI file(s).
Cheers, Andreas
For backup OI, I think it makes more sense to do the opposite - update only the new OI, and leave the old OI as only the backup. For created, only add newly created FIDs into the new OI. For normal lookup by name, the existing OI rebuild will add the FID into the new OI already. In the case of a by-FID lookup that is missed in the new OI do we need to do a lookup in the backup OI. For unlinked files, only delete the FID from the new OI.
If there is an old (invalid) lookup by FID for a deleted file that is missed in the new OI, but found in the old OI, there will still be a chance to return an error if the inode is not found.
I think this will reduce the amount of updates to disk, with only changes being made to the new OI file, and the old OI file will not be modified.
As for when to do this, I think the OI rebuild should be ported to b2_1 first (subject to pre-approval from Oleg), and the "backup OI" handling can be done in Phase IV, since this is largely only a performance/usability improvement after the base OI scrub is available.