[LU-1512] OI leaks - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0
Affects Version/s: Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6
Labels:
- mq213
- mq313
Environment:
b2_1 g636ddbf

Severity:
3
Rank (Obsolete):
4236

Description

I have a smallish filesystem to which I only allocated a 5GB MDT since the overall dataset was always intended to be very small. This filesystem is simply being used to add and remove files in a loop with something along the lines of:

while true; do
    cp -a /lib /mnt/lustre/foo
    rm -rf /mnt/lustre/foo
done

It seems in doing this I have filled up my MDT with an "oi.16" file that is now 94% of the space of the MDT:

# stat /mnt/lustre/mdt/oi.16 
  File: `/mnt/lustre/mdt/oi.16'
  Size: 4733702144	Blocks: 9254568    IO Block: 4096   regular file
Device: fd05h/64773d	Inode: 13          Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-27 11:55:00.175323551 +0000
Modify: 2012-05-27 11:55:00.175323551 +0000
Change: 2012-05-27 11:55:00.175323551 +0000

# df -k /mnt/lustre/mdt/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/LustreVG-mdt0
                       5240128   5240128         0 100% /mnt/lustre/mdt

# ls -ls /mnt/lustre/mdt/oi.16 
4627284 -rw-r--r-- 1 root root 4733702144 May 27 11:55 /mnt/lustre/mdt/oi.16

It seems the OI is leaking and not being reaped when files are removed.

Attachments

Activity

[LU-1512] OI leaks

nasf (Inactive) added a comment - 02/Aug/12 12:37 AM

Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty.

But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad. Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space.

My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing. We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks.

How do you think?

nasf (Inactive) added a comment - 02/Aug/12 12:37 AM Your worry about is not unnecessary, because in really use cases, the file deleting is random, nobody can guarantee the deleting operations will cause related OI blocks to be empty. But on the other hand, if there are no empty OI blocks in the OI files, on some how, that means the OI space utilization in such system is not so bad. Because the starting point for OI file is performance, several single OI files needs to support all the OI operations on the server. So the original policy for OI design was that using more space for more performance. In the real world, the MDT device is often TB sized, nobody will mind the OI files use GB space. My current patch can reuse new empty OI blocks (against any Lustre-2.x release), the existing OI block will be kept there without reusing. We can implement new tool to find out all the existing empty OI blocks by traveling the OI file. But I just wonder whether it is worth to do that or not. Because we will have OI scrub in Lustre-2.3. We can back port OI scrub to Lustre-2.1, which may be more easy than implement new tools to find out empty OI blocks. And rebuilding OI files can take back more space than only reuse empty OI blocks. How do you think?

nasf (Inactive) added a comment - 02/Aug/12 12:37 AM

This is comment from Andreas:

This will help in our limited test case of creating and deleting files in a loop. The real question is whether there will be so many empty OI blocks in real life, when all files are not deleted in strict sequence?

I like the idea that this can be applied to fix the problem even on 2.1 releases that have already seen the problem, but it is important to know whether it will really help. This is especially true if this adds complexity to the code and doesn't actually help muh in the end.

One path forward is to create a debug patch that can be included into 2.1.3 that will print out (at mount time or via /proc?) how many empty blocks there really are in the OIs. The one drawback is that this may cause a LOT of seeking to read large OI files at mount, which may be unacceptable in production. This could be used by CEA and/or LLNL on their production to report the state of the OI file(s).

Cheers, Andreas

nasf (Inactive) added a comment - 02/Aug/12 12:37 AM This is comment from Andreas: This will help in our limited test case of creating and deleting files in a loop. The real question is whether there will be so many empty OI blocks in real life, when all files are not deleted in strict sequence? I like the idea that this can be applied to fix the problem even on 2.1 releases that have already seen the problem, but it is important to know whether it will really help. This is especially true if this adds complexity to the code and doesn't actually help muh in the end. One path forward is to create a debug patch that can be included into 2.1.3 that will print out (at mount time or via /proc?) how many empty blocks there really are in the OIs. The one drawback is that this may cause a LOT of seeking to read large OI files at mount, which may be unacceptable in production. This could be used by CEA and/or LLNL on their production to report the state of the OI file(s). Cheers, Andreas

nasf (Inactive) added a comment - 31/Jul/12 10:53 AM

The patch contains sanity update: test_228, which will verity whether OI file size will increase when new files created with some empty OI blocks there.

nasf (Inactive) added a comment - 31/Jul/12 10:53 AM The patch contains sanity update: test_228, which will verity whether OI file size will increase when new files created with some empty OI blocks there.

nasf (Inactive) added a comment - 31/Jul/12 12:57 AM - edited

Patch for reusing empty OI blocks:

http://review.whamcloud.com/#change,3153,set4

For old Lustre-2.x release, this patch only effects the create/unlink after applying the patch, will not affect the existing empty OI blocks.

Andreas, Is it necessary to introduce some tool to find out all the empty OI blocks for reusing against the existing OI files? or give it to be rebuilt by OI scrub until Lustre-2.3?

nasf (Inactive) added a comment - 31/Jul/12 12:57 AM - edited Patch for reusing empty OI blocks: http://review.whamcloud.com/#change,3153,set4 For old Lustre-2.x release, this patch only effects the create/unlink after applying the patch, will not affect the existing empty OI blocks. Andreas, Is it necessary to introduce some tool to find out all the empty OI blocks for reusing against the existing OI files? or give it to be rebuilt by OI scrub until Lustre-2.3?

nasf (Inactive) added a comment - 31/Jul/12 12:55 AM

After some test, I found that warping FID hash to reuse some idle OI slots may be not an efficient solution for OI file size issues. Because the positions for idle OI slots is random, depends on which files are removed. It is almost impossible to find a suitable hash function which can hash the new OI mappings evenly to those random idle OI slots.

On the other hand, warping FID hash is inefficient for OI slot inserting because of more memmove() in related OI blocks. But for original flat hash, most of the OI slot inserting are append() ops in related OI blocks. So the create performance may be worse.

In fact, the most serious issue for OI file size increasing is the empty but non-released OI blocks. As long as we can reuse those empty but non-released OI blocks, then we can much slow down the OI file size increasing.

My current idea is to introduce inode::i_idle_blocks to record these non-released OI blocks when they become empty. And adjust the strategy for OI block allocation: reuse the empty block in inode::i_idle_blocks with priority, and allocate new block from system volume only when no idle OI blocks can be reused.

Another advantage is that such changes will not introduce OI compatibility issues. Means new OI file can be accessed by old MDT, and new MDT can access old OI file also.

nasf (Inactive) added a comment - 31/Jul/12 12:55 AM After some test, I found that warping FID hash to reuse some idle OI slots may be not an efficient solution for OI file size issues. Because the positions for idle OI slots is random, depends on which files are removed. It is almost impossible to find a suitable hash function which can hash the new OI mappings evenly to those random idle OI slots. On the other hand, warping FID hash is inefficient for OI slot inserting because of more memmove() in related OI blocks. But for original flat hash, most of the OI slot inserting are append() ops in related OI blocks. So the create performance may be worse. In fact, the most serious issue for OI file size increasing is the empty but non-released OI blocks. As long as we can reuse those empty but non-released OI blocks, then we can much slow down the OI file size increasing. My current idea is to introduce inode::i_idle_blocks to record these non-released OI blocks when they become empty. And adjust the strategy for OI block allocation: reuse the empty block in inode::i_idle_blocks with priority, and allocate new block from system volume only when no idle OI blocks can be reused. Another advantage is that such changes will not introduce OI compatibility issues. Means new OI file can be accessed by old MDT, and new MDT can access old OI file also.

Andreas Dilger added a comment - 27/Jul/12 3:12 PM

We need to consider this patch for 2.1.3.

While it is an incompatible change to the OI format, it should only affect newly formatted filesystems, and is backward compatible with existing 2.1 filesystems. Since 2.1 does not have OI scrub, there would be no way to handle any problems hit with the OI growing too large.

Andreas Dilger added a comment - 27/Jul/12 3:12 PM We need to consider this patch for 2.1.3. While it is an incompatible change to the OI format, it should only affect newly formatted filesystems, and is backward compatible with existing 2.1 filesystems. Since 2.1 does not have OI scrub, there would be no way to handle any problems hit with the OI growing too large.

Liang Zhen (Inactive) added a comment - 20/Jul/12 5:30 AM

I'm thinking we should have this fix for 2.3, it's really important because users have already started to complain about this, please check ~~LU-1648~~

Liang Zhen (Inactive) added a comment - 20/Jul/12 5:30 AM I'm thinking we should have this fix for 2.3, it's really important because users have already started to complain about this, please check LU-1648

Chris Gearing (Inactive) added a comment - 10/Jul/12 11:20 AM

Are we going to update the test scripts to include a set of tests that would find this and other similar issues in future?

Chris Gearing (Inactive) added a comment - 10/Jul/12 11:20 AM Are we going to update the test scripts to include a set of tests that would find this and other similar issues in future?

nasf (Inactive) added a comment - 20/Jun/12 9:19 PM

This is the patch:
http://review.whamcloud.com/#change,3153

nasf (Inactive) added a comment - 20/Jun/12 9:19 PM This is the patch: http://review.whamcloud.com/#change,3153

nasf (Inactive) added a comment - 20/Jun/12 2:45 AM - edited

I do not think there was OI file size limitation before. Because if we never delete files from Lustre, but only create, the OI file size should increase only, and should not hit some upper bound.

Anyway, I agree with that we should introduce some FID hash function to make the OI mapping hash value can wrap back at some point. Then it is possible that the new OI mapping can reuse some former idel OI mapping slot.

My current idea is that warp FID hash back per 1K sequences. For example [1 - 1000] is the first sequences range. Then the 1001 sequence will be hashed to the value between seq[1]'s and seq[2]'s, the 1002 sequence will be hashed to the value between seq[2]'s and seq[3]'s, and so on. If some files belong to seq[1] are removed before new files belong to seq[1001] created, then the new files OI mapping can reuse the idel OI mapping slots which were occupied by the seq[1]'s old files. For the 2001 sequence, it will be hashed to the value between seq[1001]'s and seq[2]'s, the (1000 * N + M) sequence, it will be hashed to the value between seq[1000 * (N - 1)]'s and seq[M + 1]'s.

Andreas, any suggestion?

nasf (Inactive) added a comment - 20/Jun/12 2:45 AM - edited I do not think there was OI file size limitation before. Because if we never delete files from Lustre, but only create, the OI file size should increase only, and should not hit some upper bound. Anyway, I agree with that we should introduce some FID hash function to make the OI mapping hash value can wrap back at some point. Then it is possible that the new OI mapping can reuse some former idel OI mapping slot. My current idea is that warp FID hash back per 1K sequences. For example [1 - 1000] is the first sequences range. Then the 1001 sequence will be hashed to the value between seq [1] 's and seq [2] 's, the 1002 sequence will be hashed to the value between seq [2] 's and seq [3] 's, and so on. If some files belong to seq [1] are removed before new files belong to seq [1001] created, then the new files OI mapping can reuse the idel OI mapping slots which were occupied by the seq [1] 's old files. For the 2001 sequence, it will be hashed to the value between seq [1001] 's and seq [2] 's, the (1000 * N + M) sequence, it will be hashed to the value between seq [1000 * (N - 1)] 's and seq [M + 1] 's. Andreas, any suggestion?

Andreas Dilger added a comment - 18/Jun/12 7:19 PM

So, I hit a similar problem on my test system just now, but it appears something strange is happening. The oi.16.16 file is large, along with a few other OIs, and the rest are tiny:

total 10188
   4 capa_keys            8 oi.16.19     8 oi.16.37     8 oi.16.55
   4 CATALOGS*            8 oi.16.2      8 oi.16.38     8 oi.16.56
   4 CONFIGS/             8 oi.16.20     8 oi.16.39     8 oi.16.57
   8 fld                  8 oi.16.21     8 oi.16.4      8 oi.16.58
   8 last_rcvd            8 oi.16.22     8 oi.16.40     8 oi.16.59
  16 lost+found/          8 oi.16.23     8 oi.16.41     8 oi.16.6
   4 lov_objid            8 oi.16.24     8 oi.16.42     8 oi.16.60
   4 NIDTBL_VERSIONS/     8 oi.16.25     8 oi.16.43     8 oi.16.61
   4 OBJECTS/             8 oi.16.26     8 oi.16.44     8 oi.16.62
 108 oi.16.0              8 oi.16.27     8 oi.16.45     8 oi.16.63
 392 oi.16.1              8 oi.16.28     8 oi.16.46     8 oi.16.7
   8 oi.16.10             8 oi.16.29     8 oi.16.47     8 oi.16.8
   8 oi.16.11             8 oi.16.3      8 oi.16.48     8 oi.16.9
   8 oi.16.12             8 oi.16.30     8 oi.16.49     4 OI_scrub
   8 oi.16.13             8 oi.16.31     8 oi.16.5      4 PENDING/
   8 oi.16.14          6224 oi.16.32     8 oi.16.50    16 ROOT/
   8 oi.16.15          1844 oi.16.33     8 oi.16.51     4 seq_ctl
1060 oi.16.16             8 oi.16.34     8 oi.16.52     4 seq_srv
   8 oi.16.17             8 oi.16.35     8 oi.16.53
   8 oi.16.18             8 oi.16.36     8 oi.16.54

So, oi.16.0, oi.16.1, oi.16.17, oi.16.32, oi.16.33 are the only ones that appear to be in use.

This is running with a 200MB MDT for "SLOW=no sh acceptance-small.sh" and an additional change to runtests to create 10000 files. It also appears that sanity.sh test_51b is trying to create 70000 subdirectories, but there aren't very many files in the filesystem:

# ../utils/lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
testfs-MDT0000_UUID       114688       34398       80290  30% /mnt/lustre[MDT:0]
testfs-OST0000_UUID        57344         143       57201   0% /mnt/lustre[OST:0]
testfs-OST0001_UUID        57344         296       57048   1% /mnt/lustre[OST:1]
testfs-OST0002_UUID        57344         138       57206   0% /mnt/lustre[OST:2]

filesystem summary:       114688       34398       80290  30% /mnt/lustre

It would seem to me that the OI selection function is imbalanced. The osd_fid2oi() code appears to be selecting the OI index based on (seq % oi_count), which should be OK. The seq should be updated every LUSTRE_SEQ_MAX_WIDTH (0x20000 = 131072 objects), so the inter-OI distributions should be relatively well balanced on even a slightly larger filesystem.

I don't think there is a huge problem with the OI itself not releasing space, so long as the space that is allocated is re-used. That means the internal hashing function needs to re-use buckets after some time, rather than always allocating blocks for new buckets.

It seems another related problem of having many OI files in a small filesystem is that the space allocated to each OI is not being used again, but rather new space is allocated to each new OI. A workaround for the test filesystems is to create fewer OI files in the case of smaller MDT size, and only allocate all 64 OIs for the case of large MDTs. This is not the original problem seen here, since multi-OI support is only in 2.2, but it can be a major contributor, since the total space used by the OI would increase by 64x compared to the single-OI case.

Fan Yong, I can't believe that there is NO LIMIT on the size of the OI file? Surely there must be some upper bound of the use of the OID as the hash index before it begins to wrap? It is impossible for a 128-bit value to be fit into a smaller hash space without any risk of collision, and it is impossible to store a linear index with even a reasonable number of files created in the filesystem over time, so there HAS to be some code to take this into account? Was the OI/IAM code implemented with so little foresight that it will just grow without limit to fill the MDT as new entries are inserted?

I would expect at least some simple modulus would provide an upper limit to the OI size, at which point we need to size the MDT taking this into account, and limit the OI count to ensure that these files do not fill the MDT.

Andreas Dilger added a comment - 18/Jun/12 7:19 PM So, I hit a similar problem on my test system just now, but it appears something strange is happening. The oi.16.16 file is large, along with a few other OIs, and the rest are tiny: total 10188 4 capa_keys 8 oi.16.19 8 oi.16.37 8 oi.16.55 4 CATALOGS* 8 oi.16.2 8 oi.16.38 8 oi.16.56 4 CONFIGS/ 8 oi.16.20 8 oi.16.39 8 oi.16.57 8 fld 8 oi.16.21 8 oi.16.4 8 oi.16.58 8 last_rcvd 8 oi.16.22 8 oi.16.40 8 oi.16.59 16 lost+found/ 8 oi.16.23 8 oi.16.41 8 oi.16.6 4 lov_objid 8 oi.16.24 8 oi.16.42 8 oi.16.60 4 NIDTBL_VERSIONS/ 8 oi.16.25 8 oi.16.43 8 oi.16.61 4 OBJECTS/ 8 oi.16.26 8 oi.16.44 8 oi.16.62 108 oi.16.0 8 oi.16.27 8 oi.16.45 8 oi.16.63 392 oi.16.1 8 oi.16.28 8 oi.16.46 8 oi.16.7 8 oi.16.10 8 oi.16.29 8 oi.16.47 8 oi.16.8 8 oi.16.11 8 oi.16.3 8 oi.16.48 8 oi.16.9 8 oi.16.12 8 oi.16.30 8 oi.16.49 4 OI_scrub 8 oi.16.13 8 oi.16.31 8 oi.16.5 4 PENDING/ 8 oi.16.14 6224 oi.16.32 8 oi.16.50 16 ROOT/ 8 oi.16.15 1844 oi.16.33 8 oi.16.51 4 seq_ctl 1060 oi.16.16 8 oi.16.34 8 oi.16.52 4 seq_srv 8 oi.16.17 8 oi.16.35 8 oi.16.53 8 oi.16.18 8 oi.16.36 8 oi.16.54 So, oi.16.0, oi.16.1, oi.16.17, oi.16.32, oi.16.33 are the only ones that appear to be in use. This is running with a 200MB MDT for "SLOW=no sh acceptance-small.sh" and an additional change to runtests to create 10000 files. It also appears that sanity.sh test_51b is trying to create 70000 subdirectories, but there aren't very many files in the filesystem: # ../utils/lfs df -i UUID Inodes IUsed IFree IUse% Mounted on testfs-MDT0000_UUID 114688 34398 80290 30% /mnt/lustre[MDT:0] testfs-OST0000_UUID 57344 143 57201 0% /mnt/lustre[OST:0] testfs-OST0001_UUID 57344 296 57048 1% /mnt/lustre[OST:1] testfs-OST0002_UUID 57344 138 57206 0% /mnt/lustre[OST:2] filesystem summary: 114688 34398 80290 30% /mnt/lustre It would seem to me that the OI selection function is imbalanced. The osd_fid2oi() code appears to be selecting the OI index based on (seq % oi_count), which should be OK. The seq should be updated every LUSTRE_SEQ_MAX_WIDTH (0x20000 = 131072 objects), so the inter-OI distributions should be relatively well balanced on even a slightly larger filesystem. I don't think there is a huge problem with the OI itself not releasing space, so long as the space that is allocated is re-used. That means the internal hashing function needs to re-use buckets after some time, rather than always allocating blocks for new buckets. It seems another related problem of having many OI files in a small filesystem is that the space allocated to each OI is not being used again, but rather new space is allocated to each new OI. A workaround for the test filesystems is to create fewer OI files in the case of smaller MDT size, and only allocate all 64 OIs for the case of large MDTs. This is not the original problem seen here, since multi-OI support is only in 2.2, but it can be a major contributor, since the total space used by the OI would increase by 64x compared to the single-OI case. Fan Yong, I can't believe that there is NO LIMIT on the size of the OI file? Surely there must be some upper bound of the use of the OID as the hash index before it begins to wrap? It is impossible for a 128-bit value to be fit into a smaller hash space without any risk of collision, and it is impossible to store a linear index with even a reasonable number of files created in the filesystem over time, so there HAS to be some code to take this into account? Was the OI/IAM code implemented with so little foresight that it will just grow without limit to fill the MDT as new entries are inserted? I would expect at least some simple modulus would provide an upper limit to the OI size, at which point we need to size the MDT taking this into account, and limit the OI count to ensure that these files do not fill the MDT.

People

Assignee:: nasf (Inactive)

Reporter:: Brian Murrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 12/Jun/12 11:32 PM

Updated:: 04/Mar/16 2:45 AM

Resolved:: 04/Mar/16 2:45 AM