[LU-5726] MDS buffer not freed when deleting files Created: 11/Oct/14 Updated: 03/Mar/15 Resolved: 05/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Rick Mohr | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 6.5 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 16083 | ||||||||||||
| Description |
|
When deleting large numbers of files, memory usage on the MDS server grows significantly. Attempts to reclaim memory by dropping caches only results in some of the memory being freed. The buffer usage continues to grow until eventually the MDS server starts OOMing. The rate at which the buffer usage grows seems to vary but looks like it might be based on the number of clients that are deleting files and the speed at which the files are deleted. |
| Comments |
| Comment by Rick Mohr [ 11/Oct/14 ] |
|
Below are some test results that I originally sent to the hpdd-discuss mailing list: In these tests, I created directory trees populated with empty files (stripe_count=1) and then used various methods to delete the files. Before and after each test, I ran "echo 1 > /proc/sys/vm/drop_caches" on the MDS and recorded the "base" buffer usage. Lustre servers are running 2.4.3. Test #1) On Lustre 2.5.0 client, used "rm -rf" to remove approx 700K files. Buffer usage: before = 14.3 GB, after = 15.4 GB Test #2) On Lustre 1.8.9 client, used "rm -rf" to remove approx 730K files. Buffer usage: before = 15.4 GB, after = 15.8 GB Test #3) On Lustre 1.8.9 client, used "clean_dirs.sh" script (thanks to Steve Ayers) to remove approx 650K files. Buffer usage: before = 15.8 GB, after = 16.0 GB Test #4) On Lustre 2.5.0 client, used "clean_dirs.sh" script to remove approx 730K files. Buffer usage: before = 16.0 GB, after = 16.25 GB Test #5) On one Lustre 2.5.0 client and two Lustre 1.8.9 clients, used "rm -rf" to delete approx 330K files on each host simultaneously. Buffer usage: before = 16.26 GB, after = 17.63 GB Test #6) Similar to test #5, but used "rm -rf" to delete files in groups of approx 110K. The deletion of these groups was staggered in time across the three nodes. At some points two nodes were deleting simultaneously and at other times only one node was deleting files. Buffer usage: before = 17.63 GB, after = 17.8 GB. Test #7) On Lustre 2.5.0 client, deleted 9 groups of 110K files each. The groups were deleted sequentially with some pauses between groups. Buffer usage: before = 17.8 GB, after = 17.9 GB Test #8) On Lustre 1.8.9 client, used "find $DIRNAME -delete" to remove approx 1M files. Buffer usage: before = 17.9 GB, after = 19.4 GB The tests showed quite a bit of variance between nodes and also the tools used to delete the files. The lowest increase in buffer usage seemed to occur when files were deleted sequentially in smaller batches. I don't see any consistent pattern except for the fact that the base buffer usage always seems to increase when files are deleted in large numbers. The setup was not completely ideal since other users were actively using the file system at the same time. However, there are a couple of things to note: 1) Over the course of the week, the MDS base buffer usage increased from 14.3 GB to 19.4 GB. These increases only occurred during my file removal tests, and there was never any decrease in base buffer usage at any point. 2) Other file system activity did not seem to contribute to the base buffer usage increase. During the nights/weekend when I did not do testing, the overall buffer usage did increase. However, when I would drop the caches to measure the base buffer usage, it always returned to the same (or at least very nearly the same) value as it was the day before. I also observed an application doing millions of file open/read/close operations, and none of this increased the base buffer usage. |
| Comment by Rick Mohr [ 11/Oct/14 ] |
|
At the suggestion of Andreas, I ran another test with the lustre debug flag +malloc enabled and captured the debug log on the MDS server. The test involved running "rm -rf" to remove a directory tree containing approx 1 million files. I have attached this log file along with the contents of the slabinfo and meminfo files captures before and after the test. The slabinfo/meminfo was gathered after running "echo 1 > /proc/sys/vm/drop_caches" so I could see how much memory was unreclaimable. |
| Comment by Niu Yawei (Inactive) [ 15/Oct/14 ] |
|
Did you sync filesystem after "rm -f" and before drop_caches? I think maybe the buffers is still being held by uncommitted request. |
| Comment by Dmitry Eremin (Inactive) [ 17/Oct/14 ] |
|
The command "echo 1 > /proc/sys/vm/drop_caches" is free page cache only. But most of allocated Lustre memory is associated with dentries and inodes. It can be freed by command "echo 3 > /proc/sys/vm/drop_caches". |
| Comment by Rick Mohr [ 20/Oct/14 ] |
|
I tried using a sync before dropping caches, and it did not make a difference. I also tried echoing "3" into the drop_caches file, and that did not make a difference either. |
| Comment by Rick Mohr [ 21/Oct/14 ] |
|
Tried some tests using Lustre 2.5.3, and the problem still exists there. Also, a coworker was deleting a larger number of files across 10 different Lustre 2.5.3 client nodes, and this caused the MDS (w/ 64 GB RAM) to OOM in less than an hour. So this bug is easy to trigger, and in principle, any user could crash the file system just by deleting enough files. |
| Comment by Rick Mohr [ 22/Oct/14 ] |
|
Some more info: If we unmount the MDT, all of the unreclaimable buffer memory is freed. We don't need to unload any kernel modules in order to get the memory back. |
| Comment by Niu Yawei (Inactive) [ 22/Oct/14 ] |
|
I'm wondering if it's related to Hi, Rick
How many files were deleted in total? |
| Comment by Rick Mohr [ 22/Oct/14 ] |
|
I deleted about 1 million files. The *.before files were collected after I dropped caches but before the files were deleted. The *.after files were collected after the files were deleted and the cache had been dropped. For this small test, the difference is not huge. From the meminfo files, the main difference is that Buffers increased by about 0.5 GB and that increase also roughly equals the increase in "Inactive(file)". Other memory numbers stayed about the same, and a few even decreased. But this trend continues as more and more files are deleted so that the buffer usage keeps growing. At first I thought that there might be a lot of locks consuming memory, but the slab usage doesn't seem to increase. In fact, if you look at the before/after slabinfo and sort by the number of active objects, you'll see that several of those categories (ldlm_locks, size-192, selinux_inode_security, ldiskfs_inode_cache, lod_obj, mdt_obj, mdd_obj, osp_obj, ldlm_resources, size-32) have roughly 1 million fewer active objects after the files were deleted. So the memory usage keeps increasing, but I can't seem to tie it to any particular slab structures. I had come across I should note that in some other tests, we tried dropping locks and caches on the client side so see if that would free up memory on the MDS. We also tried unmounting the lustre file system. None of those approaches freed up any MDS memory. |
| Comment by Niu Yawei (Inactive) [ 23/Oct/14 ] |
You mean umount MDT or client? |
| Comment by Rick Mohr [ 23/Oct/14 ] |
|
Sorry for not being clear. Unmounting the file system on the client did not free up the MDS memory. However, unmounting the MDT did free up the memory. |
| Comment by Rick Mohr [ 24/Oct/14 ] |
|
After doing some more testing yesterday and looking back over the slabinfo/meminfo data, a coworker and I found what appears to be a connection with the buffer_head objects. Prior to the test, the "Buffer(Inactive file)" usage was 867,196 kB and the number of active buffer_head objects was 216,313. After the test, "Buffer(Inactive file)" was 1,386,604 kB and the active buffer_head objects was 347,257. If the buffer_head objects accounted for the increased buffer usage, this would work out to about 4KB per buffer_head (basically, one page of memory). I don't know a lot about the Linux buffer_head structure or how it is used in I/O, but based on some online reading I was wondering if the file deletions were possibly creating a bunch of small disk I/O requests which results in buffer_head structures that point to small 4KB I/O buffers. If those I/O requests get completed, but for some reason the buffer_head structures aren't released, then maybe there is a continuous increase in memory usage in 4KB chunks. This wouldn't matter so much for a small number of file deletions, but when several million files are deleted, it starts to consume significant amounts of memory. Is something like that possible? |
| Comment by Niu Yawei (Inactive) [ 27/Oct/14 ] |
|
Deleting file will use buffer_head definitely (deleting file itself and writing the unlink log record), however, I think those buffers should be reclaimed by "echo 3 > drop_caches" once the unlink operation committed to disk (synced to disk), otherwise, I think MDT won't be umount cleanly. |
| Comment by Niu Yawei (Inactive) [ 27/Oct/14 ] |
|
BTW, I see lots of "mdt_mfd_new()" and "mdt_mfd_set_mode()" in the debug log, I think these should belong to open operations, is there lots of ongoing open operations while you do "rm"? |
| Comment by Rick Mohr [ 27/Oct/14 ] |
|
I was capturing the data from our production file system, so it was actively being used by our user community. It is entirely possible that some using was issuing a lot of open calls. |
| Comment by Niu Yawei (Inactive) [ 29/Oct/14 ] |
|
I did some local testing against master branch, looks the "buffer memory" did increased after an "rm 5000 files" operation, and it can't be reclaimed by "echo 3 > /proc/sys/vm/drop_caches", I didn't find where the "buffer memory" was used so far, but looks "chown 5000 files" or "create 5000 files" doesn't have the problem. |
| Comment by Niu Yawei (Inactive) [ 29/Oct/14 ] |
|
Hi, Rick I think it could be the same problem of |
| Comment by Rick Mohr [ 29/Oct/14 ] |
|
Niu, There are some similarities with Sep 13 08:03:40 medusa-mds1 kernel: LNetError: 2702:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 172.16.30.1@o2ib (141): c: 62, oc: 0, rc: 63 This was followed immediately with a stack trace like this: Sep 13 08:03:40 medusa-mds1 kernel: INFO: task kswapd0:178 blocked for more than 120 seconds. There were some more of these (I will attach a more complete log later), and while they weren't all identical, they did all seem to contain the same two lines: Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8119d7a4>] shrink_icache_memory+0x274/0x2e0 So I would assume that all of those instances were caused by the system trying to free up memory but being unable to do so. The main difference I see with the stack traces from [<ffffffff81134b99>] ? zone_reclaim+0x349/0x400 Shortly after we upgraded from 1.8 to 2.4 in May, we saw cases where the MDS would seem to grind to a halt, and based on those stack traces, it looked like the system was having problems allocating memory. Our early stack traces in May contained that "zone_reclaim" line. However, we soon realized that we forgot to reapply the sysctl setting "vm.zone_reclaim_mode = 0" on the MDS server after the upgrade. Once we did that, we didn't have any further problems until just recently. If the systems in |
| Comment by Rick Mohr [ 29/Oct/14 ] |
|
Attached log messages around the time of first incident. |
| Comment by Rick Mohr [ 29/Oct/14 ] |
|
Niu, There are a couple of things about the log that I should point out. The first events were around 8AM on Sept 13 and you'll notice that the oom-killer fired to kill off the nslcd process. However, the server seems to have remained functional for at least some time after that. Our Nagios monitoring checks did not report any problems until almost 5PM that same day, and it wasn't until about 6 or 7PM that the system became completely unresponsive. So it seems like it was a slow decline. Also, we ran into the same (or at least we assume the same) problem on Sep 15 around 1PM. We noticed the memory usage kept increasing until the oom-killer fired and killed several processes, after which point the system became unresponsive. However, in that incident, the oom-killer was not preceded by any lustre stack traces in the log file. |
| Comment by Niu Yawei (Inactive) [ 30/Oct/14 ] |
Is it possible that the min_free_kbytes is too low for your MDS? Following is quoted from RH manual: Setting min_free_kbytes too low prevents the system from reclaiming memory, This can result in system hangs and OOM-killing multiple processes.
However, setting this parameter too high (5% - 10% of total system memory) will cause your system out-of-memory immediately.
I'm wondering if you can try to increase the min_free_kbytes and see if it can alleviate the situation? |
| Comment by Rick Mohr [ 30/Oct/14 ] |
|
We have vm.min_free_kbytes set to 131072. This is about 0.2% of the total memory. Is there a recommended value for this? |
| Comment by Niu Yawei (Inactive) [ 03/Nov/14 ] |
|
I don't have experience on tuning these parameters, so I'm afraid that I can't give any suggestions on this. Probably you can increase it little by little and see how it works. |
| Comment by Robin Humble [ 03/Nov/14 ] |
|
a couple of thoughts... maybe not relevant, but hey... modern lustre sets memory affinity set for a lot of its threads (which IMHO is an over-optimisation). forced affinity means that if allocates are imbalanced then it is possible for one zone (eg. normal1) on a server to run out of ram, whilst there is still heaps of memory free in another zone (eg. normal0). we have seen OOM's on our OSS's due to this (although only when aggressive inode caching is turned on with vfs_cache_pressure=0, which probably isn't a common setting) and now set "options libcfs cpu_npartitions=1" on our OSS's for this reason. my testing hasn't shown such a severe problem with affinity on MDS's, but still, you could check "egrep -i 'zone|slab|present|free' /proc/zoneinfo" and make sure the big zones have approximately equal free pages and equal used slab. also, have you tried dropping ldlm locks on the client before drop_caches? eg. "lctl set_param ldlm.namespaces.fsName-'*'.lru_size=clear". without ldlm locks being dropped on the client, I suspect drop_caches on the client won't drop many inodes at all, and in turn that may mean that drop_caches on the MDS can't free much either? |
| Comment by Andreas Dilger [ 05/Nov/14 ] |
|
I would agree with Robin that I think the majority of the memory pressure is coming from MDT/MDD/LOD/OSD objects->DLM locks->ldiskfs inodes being cached/pinned in memory on the MDT. I wonder if we are caching the locks too aggressively? In theory, the LDLM pools should be pushing memory pressure from the lock server back to the clients, so the clients cancel DLM locks to relieve memory pressure on the server, but it is entirely possible that this isn't working as well as it should. |
| Comment by Minh Diep [ 05/Nov/14 ] |
|
Do we know if 2.5.3 server have this issue? |
| Comment by Rick Mohr [ 05/Nov/14 ] |
|
Niu: I can try increasing vm.min_free_kbytes a little, but since the documentation says that there can be problems if this value is too high, I am reluctant to play around with it too much. Robin: We ran into some issues with our MDS server slowly grinding to a halt earlier this year and resolved those by setting vm.zone_reclaim_mode=0. I have not looked at /proc/zoneinfo, but I will check that out to see if I notice anything unusual. We do not set vfs_cache_pressure=0, but I may look into "options libcfs cpu_npartitions" since I am not familiar with that option. As far as ldlm locks go, that is one of the things I suspected too. However, we ran a test where we dropped all client locks on the system that was deleting files. Unfortunately, even after dropping caches on the server, it did not seem to have any affect at all on the MDS memory usage growth. We did of course see a decrease in memory used by the locks, but in the course of our testing, the total amount of memory used by locks seemed to pretty consistently hover around 2-3 GB. So while the MDS memory usage steadily grew, the ldlm lock memory usage did not. I took this to mean that locks were not really the cause of the memory pressure. Nonetheless, we have applied some limits to lru_size and lru_max_age on our Lustre clients on the off chance that it might help. Unfortunately, this attempt is somewhat complicated by bug |
| Comment by Rick Mohr [ 05/Nov/14 ] |
|
Minh: In our testing, we have seen this on 2.5.3 servers. |
| Comment by Andreas Dilger [ 05/Nov/14 ] |
|
If you are hitting this problem of the available memory running out on the MDS, it would be useful to see if running lctl set_param ldlm.namespaces.MDT.lru_size=clear on all (or some subset) of clients, and then on the MDS sysctl -w vm.drop_caches=3 to drop the page and inode caches to see if this will reduce the MDS memory usage. Without dropping the locks on the clients, the MDS inodes cannot be freed from cache (this includes one of each { mdt_obj, mdd_obj, lod_obj, osp_obj, ldlm_lock * N, ldlm_resource, ldiskfs_inode_cache } per inode so it definitely adds up (about 3KB per inode by my quick calculations). If this helps reduce MDS memory usage, the next question is why the LDLM pool is not shrinking the client DLM lock cache under memory pressure. |
| Comment by Niu Yawei (Inactive) [ 06/Nov/14 ] |
|
Hi, Andreas I don't think the LDLM pool is able to shrink client dlm locks when MDS is under memory pressure: The server ldlm shrinker is called when MDS is under memory pressure, however, it can't reclaim memory directly, it can only bump the SLV and wait for clients to aware of the increased SLV then start cancel locks. I think there are several problems in above schema:
Probably it's time to reconsider the whole ldlm pool mechanism? Hi, Rick Could you try to apply the fix of |
| Comment by Andreas Dilger [ 08/Nov/14 ] |
|
Niu, I think we need to make the ldlm pool kick in much earlier to avoid memory pressure on the MDS. It doesn't make sense to have so many locks on the MDS that it is running out of memory. Since each lock is also keeping lots of other memory pinned (inode, buffer, MDT, MDD, LOD, OSD objects) we need to start shrinking the ldlm pool sooner. I haven't looked at this code in a long time, but is there an upper limit that can be imposed on the number of locks on the server? What is used to calculate this limit, and is it reasonable? |
| Comment by Niu Yawei (Inactive) [ 10/Nov/14 ] |
Indeed, kernel is always supposing shrinker can reclaim memory immediately, so the way of not reclaim memory but only decreasing SLV in shrinker (then hope clients start to cancel locks once they received the decreased SLV) looks inappropriate to me. The SLV recalculate thread is supposed to kick off lock cancel in early stage, so the server shrinker looks not necessary to me.
The upper limit is 50 locks per 1M MDS memory. /*
* 50 ldlm locks for 1MB of RAM.
*/
#define LDLM_POOL_HOST_L ((NUM_CACHEPAGES >> (20 - PAGE_CACHE_SHIFT)) * 50)
But this isn't a hard limit, it's just a factor for SLV calculating. I ran some local unlink testing, it showed that "Buffers/Inactive(file)" grows after unlink (same as Rick's test result), and that didn't happen on creating or unlink from ldiskfs directly. I didn't figure out the reason yet, and I'm not sure if it's really related to the OOM problem. While investigating this problem further, I'm also trying to find machines with huge memory which can reproduce the OOM problem. Hi, Rick |
| Comment by Rick Mohr [ 11/Nov/14 ] |
|
(Sorry for taking a while to respond. I have been busy putting out some other fires.) In response to some of the comments: 1) I have not tried increasing vm.min_free_kbytes yet, but will do so this week. (Although I noticed that in 2) We have already tried dropping client locks before clearing the MDS server cache. This did not help free up any memory. 3) A coworker and I did some digging into how the MDS server determines how many locks it can support (and how many locks a client can cache). We came across the SLV, but struggled to understand how it was supposed to work. We were also unable to find any way to cap the total number of locks a MDS server would grant. (We tried setting ldlm.namespaces.<mdt>.pool.limit but that didn't work.) If such a way does not exist, it might be a very good thing to include. It would be handy to have a simple way to ensure that the server's mem usage for locks can be controlled. 4) I can try to apply the 5) I did not capture the numbers of locks on our Lustre clients when we had the MDS crashes. We did look at the number of locks on one client that was purging large numbers of files while we observed the MDS mem usage increasing. That client had over a million locks. (We had placed a limit of 2000 locks on the client, and this behavior is what brought our attention to |
| Comment by Niu Yawei (Inactive) [ 12/Nov/14 ] |
I was wondering if the OOM isn't caused by the growing "Buffers" (which won't be decreased by cancelling locks as you mentioned)? That's why I want to know many locks were cached on clients on OOM happened. |
| Comment by Rick Mohr [ 14/Nov/14 ] |
|
Niu, Just wanted to let you know that I increased vm.min_free_kbytes by a factor of 10 to 1310720 (~1.2 GB) which is about 2% of total memory. I will let you know what happens. |
| Comment by Rick Mohr [ 14/Nov/14 ] |
|
Niu, If I want to know how many client locks are cached on the MDS server when it OOMs, is that info in the /proc/fs/lustre/ldlm/namespaces/mdt-<fsname>-MDT0000_UUID/lock_count file? Since it isn't really feasible to query the lock counts on each client individually, I wanted to verify that I could get the same info from the server-side. (Although even that might not work if the MDS server becomes unresponsive when it OOMs.) I also wanted to let you know that we have started to run some purges from Lustre 2.4.3 clients, and it looks like maybe the server memory usage doesn't grow as fast compared to using a Lustre 2.5 client. We don't have quantitative info yet, but if we are able to run some tests and gather numbers, I will pass them along to you. |
| Comment by Niu Yawei (Inactive) [ 17/Nov/14 ] |
Probably you can write a script to read the proc file from all clients remotely? As you mentioned, MDS will be unresponsive when OOM, reading proc file on MDS isn't practical.
It could because of the |
| Comment by Rick Mohr [ 25/Nov/14 ] |
|
Niu, We had another MDS crash last Friday while running purges. The vm.min_free_kbytes value had been increased to 1.2 GB from 128 MB, but this did not prevent the crash (and most likely caused it to crash slightly sooner because it was trying to keep more memory in reserve). As a result, we have reverted to using our previous value for that parameter. Prior to the crash, I had run some tests to see if there was a difference in the rate of mem usage between clients running Lustre 2.5.3 and 2.4.3. Based on my tests of deleting ~1M files, it looked like the 2.5.3 client caused the buffer usage to grow by about 2.7 GB and the 2.4.3 client caused usage to grow by about 2 GB. (It should be noted that these tests were run on a file system that was in production. This might skew the absolute value of the numbers, but I think the relative difference is still reasonably accurate.) Based on this info, we started using Lustre 2.4.3 clients to run our file system purges. Since these clients also do not suffer from The MDS was rebooted after the crash 4 days ago. Since then we have not run any more purges. When I checked the MDS this morning, the memory usage looked like this: MemTotal: 66053640 kB After running "echo 1 > /proc/sys/vm/drop_caches", the memory stats were like this: MemTotal: 66053640 kB So during normal usage, the buffer usage grew to about 14.5 GB (as would be expected) but dropping caches easily reclaimed the memory and buffer usage dropped to 0.25 GB. As far as I can tell, it is only when we delete large numbers of files that we get into situations where the buffer usage will not drop. This pretty much guarantees that the MDS will crash whenever we run purges. For the moment, we are closely monitoring mem usage during purges, and if it grows too much, we will preemptively unmount/remount the MDT to free up memory. This makes it a little easier for clients to recover, but we always run the risk that some client(s) will not handle the MDT disappearance well and end up getting evicted (which can cause IO errors for end users). This is definitely a regression from Lustre 1.8.9 where we could go months without any sort of Lustre failure. |
| Comment by Rick Mohr [ 25/Nov/14 ] |
|
Is this issue related to |
| Comment by Niu Yawei (Inactive) [ 26/Nov/14 ] |
|
Thank you for the information, Rick. I was investigating on the problem of "growed Buffers", but had no luck so far. I'll keep investigating further. As for |
| Comment by Rick Mohr [ 17/Dec/14 ] |
|
Is there any update on this issue? |
| Comment by Haisong Cai (Inactive) [ 17/Dec/14 ] |
|
SDSC too, is desperately waiting on a fix of this problem. We have had several filesystem down-times caused by this problem recently. thank you, |
| Comment by Niu Yawei (Inactive) [ 18/Dec/14 ] |
Rick, I'm still working on this, but didn't make a dent so far. As a temporary workaround, could you modify your purge program to deleting files step by step instead of deleting all the files in one shot? You mentioned earlier in this ticket that deleting small amount of file won't increase the buffers footprint, I'm wondering if it's helpful for your situation. |
| Comment by Haisong Cai (Inactive) [ 18/Dec/14 ] |
|
Would like to add 2 comments: 1) filesystem deletion often happens in user land. If they choose to delete 1million files with one "rm" command, there is little sysadm can control over. This is true at lease in our environment thanks, |
| Comment by Rick Mohr [ 19/Dec/14 ] |
|
I think Haisong's comments are spot on. This problem was originally brought to our attention because a user triggered the OOM by deleting files. Slowing down the deletions only postpones the inevitable, and we still are forced to carefully monitor the system and eventually do a controlled MDS reboot. Plus, when we have a user that pushes the file system's inode usage up to 80%, we need to delete files faster, not slower. --Rick |
| Comment by Haisong Cai (Inactive) [ 07/Jan/15 ] |
|
Over the last 2 weeks, we have had 3 MDS crashes because of this bug - the pattern is easily distinguishable and problem is reproduceable. We request Intel engineers to please provide a fix for the bug. thanks, |
| Comment by Minh Diep [ 07/Jan/15 ] |
|
Niu, have you been able to reproduce this problem. according to SDSC, this can be very easy to reproduce. All you do is create and remove over 1 million files. Please let me know if you need any help. |
| Comment by Niu Yawei (Inactive) [ 08/Jan/15 ] |
I didn't reproduce the OOM problem, I reproduced the problem of "Buffers" not shrinking after "rm" with less files, but unfortunately I didn't find out the reason yet. I'm now wondering if that is the real cause of OOM? If you could help me to book some real machines to verify this (rm 1 million files to trigger OOM) it'll be helpful. (all my tests were done in my local vm). Thank you. |
| Comment by Minh Diep [ 08/Jan/15 ] |
|
I could be wrong but if the buffer is not freed after 'rm', eventually, it will cause the system to OOM. so I think OOM is the result from 'buffer not freed' |
| Comment by Niu Yawei (Inactive) [ 08/Jan/15 ] |
Well, I'm not 100 percent sure about this. As far as I can see (in my local vm testing), those "Buffers" should be the buffer_heads against the MDT block device, and I'm not sure if it could be reclaimed for other purpose when the buffer size reaches some point. |
| Comment by Haisong Cai (Inactive) [ 08/Jan/15 ] |
|
Hi Niu, I also think that OOM is a byproduct of buffer memory not free up. On our MDS servers, we have reserved 10% of total physical RAM, set aggressive flashing dirty cache policy (ie vm.dirty_background_ratio=5 & vm.dirty_ratio=5), but none of these measures helped. We are watching memory buffer increasing to the point where MDS threads start to hang, kernel trace dumping, clients disconnecting, eventually MDS itself would become unresponsive to it's own shell commands. Sometimes it ends with kernel panic and sometimes it ends with completely lock up of the machine. Most of the cases we don't see OOM. Haisong |
| Comment by Rick Mohr [ 08/Jan/15 ] |
|
Niu, A couple of things I wanted to clarify: 1) Removing 1M files is not necessarily enough to cause an OOM. I just used that many files in my test because it was a large enough so that there was a noticeable change in the MDS memory usage, and I could verify that the memory never seemed to be reclaimed. In production, the MDS memory usage seems to grow continuously over time until it eventually reaches a point where it causes problems. 2) When I initially created this ticket, we had observed an OOM event on the MDS. However, in subsequent events, this has not always been the case. Several times, the situation was more like Haisong described: The MDS server "loses" memory until it reaches a point where Lustre threads slow to a crawl and eventually the system becomes completely unresponsive. So I don't think reproducing the OOM is necessarily needed to investigate the issue. --Rick |
| Comment by Niu Yawei (Inactive) [ 09/Jan/15 ] |
|
Thank you for the information, Haisong & Rick. |
| Comment by Minh Diep [ 14/Jan/15 ] |
|
Niu, do you have any update on this? |
| Comment by Tommy Minyard [ 15/Jan/15 ] |
|
We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue. We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory. |
| Comment by Niu Yawei (Inactive) [ 19/Jan/15 ] |
|
After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon. |
| Comment by Gerrit Updater [ 19/Jan/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452 |
| Comment by Niu Yawei (Inactive) [ 19/Jan/15 ] |
|
patch to master: http://review.whamcloud.com/13452 |
| Comment by Andreas Dilger [ 20/Jan/15 ] |
|
Niu, Lai, excellent work finding and fixing this bug. A question for the users hitting this problem - is the ea_inode (also named large_xattr) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc). While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem. |
| Comment by Haisong Cai (Inactive) [ 20/Jan/15 ] |
|
We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT): [root@puma-mds-10-5 ~]# dumpe2fs -h /dev/md0 | grep features In addition, all our filesystems hit this bug have less than 160 OSTs. Haisong |
| Comment by Niu Yawei (Inactive) [ 20/Jan/15 ] |
|
Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode. int ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode, struct ldiskfs_xattr_ino_array **lea_ino_array) { struct buffer_head *bh = NULL; struct ldiskfs_xattr_ibody_header *header; struct ldiskfs_inode *raw_inode; struct ldiskfs_iloc iloc; struct ldiskfs_xattr_entry *entry; int error = 0; if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR)) goto delete_external_ea; error = ldiskfs_get_inode_loc(inode, &iloc); As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh. |
| Comment by Gerrit Updater [ 20/Jan/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464 |
| Comment by Niu Yawei (Inactive) [ 20/Jan/15 ] |
|
Port to b2_5: http://review.whamcloud.com/13464 |
| Comment by Gerrit Updater [ 22/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13452/ |
| Comment by Rick Mohr [ 22/Jan/15 ] |
|
In response to Andreas' question: dumpe2fs 1.42.12.wc1 (15-Sep-2014) Our file system has 90 OSTs. |
| Comment by Niu Yawei (Inactive) [ 23/Jan/15 ] |
|
Rick, could you verify that if the patch can fix your problem? It works for me, after applied the patch, I didn't see the "growing buffers" problem anymore. |
| Comment by Rick Mohr [ 23/Jan/15 ] |
|
Niu, Our testbed is currently down, but we are trying to get it back up and running again. Once that is done, we will work on applying your patch and testing it. (Although this might not happen for another week.) |
| Comment by Gerrit Updater [ 26/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13464/ |
| Comment by Rick Mohr [ 30/Jan/15 ] |
|
Niu, We were able to get a test file system setup and applied the patch. Initial testing seems to show that the problem is fixed. We have applied the patch to Lustre 2.4.3 and plan to roll it out to our production file system next week. After that, we will run some further tests and let you know if there are any problems. |
| Comment by Peter Jones [ 30/Jan/15 ] |
|
That's great news - thanks Rick! |
| Comment by Rick Mohr [ 04/Feb/15 ] |
|
Niu, We just applied the patch today to our production file system (Lustre 2.4.3) and are running some heavy purges right now. I collected some info about the memory usage. Prior to the patch, it seemed like the memory growth was dominated by the "Inactive(file)" in /proc/meminfo. I dropped the cache in the MDS server (echo 3 > /proc/sys/vm/drop_caches) and collected Inactive(file) usage every minute: nactive(file): 1146656 kB The number initially ramped up fast, but then leveled off a bit. Just to double check, I dropped the cache again: Inactive(file): 401152 kB We got the same behavior, and more importantly, we seem to be reclaiming the memory from Inactive(file). I also checked MemFree and Buffers before/after dropping caches: (Before) (After) Buffer usage dropped below 200 MB. Given the rate at which we are purging, that never would have happened prior to applying the patch. It feel 90% confident this patch solved the problem. If we can continue purging at this rate over the couple of days without increased memory usage, then I think I will be 100% confident. |
| Comment by Gerrit Updater [ 04/Feb/15 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: http://review.whamcloud.com/13655 |
| Comment by Rick Mohr [ 05/Feb/15 ] |
|
Niu, I checked the MDS mem usage again this morning: MemTotal: 66053640 kB MemFree: 5568288 kB Buffers: 55504980 kB Active: 22374284 kB Inactive: 33260116 kB After I dropped caches: MemTotal: 66053640 kB MemFree: 63146420 kB Buffers: 59788 kB Active: 57960 kB Inactive: 93452 kB Looks like the patch is successful. |
| Comment by Peter Jones [ 05/Feb/15 ] |
|
Great news. Landed for 2.5.4 and 2.7 |