[LU-5726] MDS buffer not freed when deleting files Created: 11/Oct/14  Updated: 03/Mar/15  Resolved: 05/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Blocker
Reporter: Rick Mohr Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 6.5
Kernel 2.6.32-358.23.2


Attachments: Text File lustre-debug-malloc.gz     HTML File mds-crash-log-20140913     File meminfo.after     File meminfo.before     File slabinfo.after     File slabinfo.before    
Issue Links:
Related
is related to LU-5333 rm cause MDS to complain hung tasks a... Resolved
is related to LU-5841 Lustre 2.4.2 MDS, hitting OOM errors Resolved
Severity: 3
Rank (Obsolete): 16083

 Description   

When deleting large numbers of files, memory usage on the MDS server grows significantly. Attempts to reclaim memory by dropping caches only results in some of the memory being freed. The buffer usage continues to grow until eventually the MDS server starts OOMing.

The rate at which the buffer usage grows seems to vary but looks like it might be based on the number of clients that are deleting files and the speed at which the files are deleted.



 Comments   
Comment by Rick Mohr [ 11/Oct/14 ]

Below are some test results that I originally sent to the hpdd-discuss mailing list:

In these tests, I created directory trees populated with empty files (stripe_count=1) and then used various methods to delete the files. Before and after each test, I ran "echo 1 > /proc/sys/vm/drop_caches" on the MDS and recorded the "base" buffer usage. Lustre servers are running 2.4.3.

Test #1) On Lustre 2.5.0 client, used "rm -rf" to remove approx 700K files. Buffer usage: before = 14.3 GB, after = 15.4 GB

Test #2) On Lustre 1.8.9 client, used "rm -rf" to remove approx 730K files. Buffer usage: before = 15.4 GB, after = 15.8 GB

Test #3) On Lustre 1.8.9 client, used "clean_dirs.sh" script (thanks to Steve Ayers) to remove approx 650K files. Buffer usage: before = 15.8 GB, after = 16.0 GB

Test #4) On Lustre 2.5.0 client, used "clean_dirs.sh" script to remove approx 730K files. Buffer usage: before = 16.0 GB, after = 16.25 GB

Test #5) On one Lustre 2.5.0 client and two Lustre 1.8.9 clients, used "rm -rf" to delete approx 330K files on each host simultaneously. Buffer usage: before = 16.26 GB, after = 17.63 GB

Test #6) Similar to test #5, but used "rm -rf" to delete files in groups of approx 110K. The deletion of these groups was staggered in time across the three nodes. At some points two nodes were deleting simultaneously and at other times only one node was deleting files. Buffer usage: before = 17.63 GB, after = 17.8 GB.

Test #7) On Lustre 2.5.0 client, deleted 9 groups of 110K files each. The groups were deleted sequentially with some pauses between groups. Buffer usage: before = 17.8 GB, after = 17.9 GB

Test #8) On Lustre 1.8.9 client, used "find $DIRNAME -delete" to remove approx 1M files. Buffer usage: before = 17.9 GB, after = 19.4 GB

The tests showed quite a bit of variance between nodes and also the tools used to delete the files. The lowest increase in buffer usage seemed to occur when files were deleted sequentially in smaller batches. I don't see any consistent pattern except for the fact that the base buffer usage always seems to increase when files are deleted in large numbers. The setup was not completely ideal since other users were actively using the file system at the same time. However, there are a couple of things to note:

1) Over the course of the week, the MDS base buffer usage increased from 14.3 GB to 19.4 GB. These increases only occurred during my file removal tests, and there was never any decrease in base buffer usage at any point.

2) Other file system activity did not seem to contribute to the base buffer usage increase. During the nights/weekend when I did not do testing, the overall buffer usage did increase. However, when I would drop the caches to measure the base buffer usage, it always returned to the same (or at least very nearly the same) value as it was the day before. I also observed an application doing millions of file open/read/close operations, and none of this increased the base buffer usage.

Comment by Rick Mohr [ 11/Oct/14 ]

At the suggestion of Andreas, I ran another test with the lustre debug flag +malloc enabled and captured the debug log on the MDS server. The test involved running "rm -rf" to remove a directory tree containing approx 1 million files. I have attached this log file along with the contents of the slabinfo and meminfo files captures before and after the test. The slabinfo/meminfo was gathered after running "echo 1 > /proc/sys/vm/drop_caches" so I could see how much memory was unreclaimable.

Comment by Niu Yawei (Inactive) [ 15/Oct/14 ]

Did you sync filesystem after "rm -f" and before drop_caches? I think maybe the buffers is still being held by uncommitted request.

Comment by Dmitry Eremin (Inactive) [ 17/Oct/14 ]

The command "echo 1 > /proc/sys/vm/drop_caches" is free page cache only. But most of allocated Lustre memory is associated with dentries and inodes. It can be freed by command "echo 3 > /proc/sys/vm/drop_caches".

Comment by Rick Mohr [ 20/Oct/14 ]

I tried using a sync before dropping caches, and it did not make a difference. I also tried echoing "3" into the drop_caches file, and that did not make a difference either.

Comment by Rick Mohr [ 21/Oct/14 ]

Tried some tests using Lustre 2.5.3, and the problem still exists there. Also, a coworker was deleting a larger number of files across 10 different Lustre 2.5.3 client nodes, and this caused the MDS (w/ 64 GB RAM) to OOM in less than an hour. So this bug is easy to trigger, and in principle, any user could crash the file system just by deleting enough files.

Comment by Rick Mohr [ 22/Oct/14 ]

Some more info: If we unmount the MDT, all of the unreclaimable buffer memory is freed. We don't need to unload any kernel modules in order to get the memory back.

Comment by Niu Yawei (Inactive) [ 22/Oct/14 ]

I'm wondering if it's related to LU-4053 (client will acquire a layout lock when unlink file and that lock will be cached on client)

Hi, Rick
Is the meminfo.before & slabinfo.before captured before unlink files? I didn't see much difference between before & after.

Tried some tests using Lustre 2.5.3, and the problem still exists there. Also, a coworker was deleting a larger number of files across 10 different Lustre 2.5.3 client nodes, and this caused the MDS (w/ 64 GB RAM) to OOM in less than an hour. So this bug is easy to trigger, and in principle, any user could crash the file system just by deleting enough files.

How many files were deleted in total?

Comment by Rick Mohr [ 22/Oct/14 ]

I deleted about 1 million files. The *.before files were collected after I dropped caches but before the files were deleted. The *.after files were collected after the files were deleted and the cache had been dropped.

For this small test, the difference is not huge. From the meminfo files, the main difference is that Buffers increased by about 0.5 GB and that increase also roughly equals the increase in "Inactive(file)". Other memory numbers stayed about the same, and a few even decreased. But this trend continues as more and more files are deleted so that the buffer usage keeps growing. At first I thought that there might be a lot of locks consuming memory, but the slab usage doesn't seem to increase. In fact, if you look at the before/after slabinfo and sort by the number of active objects, you'll see that several of those categories (ldlm_locks, size-192, selinux_inode_security, ldiskfs_inode_cache, lod_obj, mdt_obj, mdd_obj, osp_obj, ldlm_resources, size-32) have roughly 1 million fewer active objects after the files were deleted. So the memory usage keeps increasing, but I can't seem to tie it to any particular slab structures.

I had come across LU-4053 before filing this ticket thinking that it might be related, but upon looking at the details, I am not sure they are the same. That ticket seemed to indicate that the memory increase was due to increased slab usage (which is not the case here). That ticket also mentioned that dropping caches released the memory, which does not work in my case.

I should note that in some other tests, we tried dropping locks and caches on the client side so see if that would free up memory on the MDS. We also tried unmounting the lustre file system. None of those approaches freed up any MDS memory.

Comment by Niu Yawei (Inactive) [ 23/Oct/14 ]

We also tried unmounting the lustre file system. None of those approaches freed up any MDS memory.

You mean umount MDT or client?

Comment by Rick Mohr [ 23/Oct/14 ]

Sorry for not being clear. Unmounting the file system on the client did not free up the MDS memory. However, unmounting the MDT did free up the memory.

Comment by Rick Mohr [ 24/Oct/14 ]

After doing some more testing yesterday and looking back over the slabinfo/meminfo data, a coworker and I found what appears to be a connection with the buffer_head objects. Prior to the test, the "Buffer(Inactive file)" usage was 867,196 kB and the number of active buffer_head objects was 216,313. After the test, "Buffer(Inactive file)" was 1,386,604 kB and the active buffer_head objects was 347,257. If the buffer_head objects accounted for the increased buffer usage, this would work out to about 4KB per buffer_head (basically, one page of memory).

I don't know a lot about the Linux buffer_head structure or how it is used in I/O, but based on some online reading I was wondering if the file deletions were possibly creating a bunch of small disk I/O requests which results in buffer_head structures that point to small 4KB I/O buffers. If those I/O requests get completed, but for some reason the buffer_head structures aren't released, then maybe there is a continuous increase in memory usage in 4KB chunks. This wouldn't matter so much for a small number of file deletions, but when several million files are deleted, it starts to consume significant amounts of memory.

Is something like that possible?

Comment by Niu Yawei (Inactive) [ 27/Oct/14 ]

Deleting file will use buffer_head definitely (deleting file itself and writing the unlink log record), however, I think those buffers should be reclaimed by "echo 3 > drop_caches" once the unlink operation committed to disk (synced to disk), otherwise, I think MDT won't be umount cleanly.

Comment by Niu Yawei (Inactive) [ 27/Oct/14 ]

BTW, I see lots of "mdt_mfd_new()" and "mdt_mfd_set_mode()" in the debug log, I think these should belong to open operations, is there lots of ongoing open operations while you do "rm"?

Comment by Rick Mohr [ 27/Oct/14 ]

I was capturing the data from our production file system, so it was actively being used by our user community. It is entirely possible that some using was issuing a lot of open calls.

Comment by Niu Yawei (Inactive) [ 29/Oct/14 ]

I did some local testing against master branch, looks the "buffer memory" did increased after an "rm 5000 files" operation, and it can't be reclaimed by "echo 3 > /proc/sys/vm/drop_caches", I didn't find where the "buffer memory" was used so far, but looks "chown 5000 files" or "create 5000 files" doesn't have the problem.

Comment by Niu Yawei (Inactive) [ 29/Oct/14 ]

Hi, Rick
The memory showed as "Buffer"/"Inactive(file)" are available and can be reclaimed for other purpose, so I tend to think it's not the real cause of OOM. Though I don't know how to force kernel to immediately reclaim them as "MemFree", I believe they could be reclaimed when necessary.

I think it could be the same problem of LU-5333 & LU-5503. Did you ever observe similar stack trace reported in that two tickets?

Comment by Rick Mohr [ 29/Oct/14 ]

Niu,

There are some similarities with LU-5333 and LU-5503 but also some differences. Our stack traces do seem to point to the cause being memory related. Before one of the first crashes, we saw errors like this:

Sep 13 08:03:40 medusa-mds1 kernel: LNetError: 2702:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 172.16.30.1@o2ib (141): c: 62, oc: 0, rc: 63

This was followed immediately with a stack trace like this:

Sep 13 08:03:40 medusa-mds1 kernel: INFO: task kswapd0:178 blocked for more than 120 seconds.
Sep 13 08:03:40 medusa-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 13 08:03:40 medusa-mds1 kernel: kswapd0 D 0000000000000006 0 178 2 0x00000000
Sep 13 08:03:40 medusa-mds1 kernel: ffff8808322a1a80 0000000000000046 ffffea0001cc0f98 ffff8808322a1a50
Sep 13 08:03:40 medusa-mds1 kernel: ffffea0000fe4ed8 ffff8808322a1b50 0000000000000020 000000000000001f
Sep 13 08:03:40 medusa-mds1 kernel: ffff88083226fab8 ffff8808322a1fd8 000000000000fb88 ffff88083226fab8
Sep 13 08:03:40 medusa-mds1 kernel: Call Trace:
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8109708e>] ? prepare_to_wait+0x4e/0x80
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffffa0b7518a>] start_this_handle+0x27a/0x4a0 [jbd2]
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffffa0b755b0>] jbd2_journal_start+0xd0/0x110 [jbd2]
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffffa0c36546>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffffa0c368c4>] ldiskfs_dquot_drop+0x34/0x80 [ldiskfs]
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff811e04e2>] vfs_dq_drop+0x52/0x60
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8119d363>] clear_inode+0x93/0x140
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8119d450>] dispose_list+0x40/0x120
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8119d7a4>] shrink_icache_memory+0x274/0x2e0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81131ffa>] shrink_slab+0x12a/0x1a0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff811351ea>] balance_pgdat+0x59a/0x820
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff811355a4>] kswapd+0x134/0x3c0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81135470>] ? kswapd+0x0/0x3c0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

There were some more of these (I will attach a more complete log later), and while they weren't all identical, they did all seem to contain the same two lines:

Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff8119d7a4>] shrink_icache_memory+0x274/0x2e0
Sep 13 08:03:40 medusa-mds1 kernel: [<ffffffff81131ffa>] shrink_slab+0x12a/0x1a0

So I would assume that all of those instances were caused by the system trying to free up memory but being unable to do so.

The main difference I see with the stack traces from LU-5333/LU-5503 is that those stack traces contain this line:

[<ffffffff81134b99>] ? zone_reclaim+0x349/0x400

Shortly after we upgraded from 1.8 to 2.4 in May, we saw cases where the MDS would seem to grind to a halt, and based on those stack traces, it looked like the system was having problems allocating memory. Our early stack traces in May contained that "zone_reclaim" line. However, we soon realized that we forgot to reapply the sysctl setting "vm.zone_reclaim_mode = 0" on the MDS server after the upgrade. Once we did that, we didn't have any further problems until just recently.

If the systems in LU-5333/LU-5503 have zone_recaim_mode enabled, perhaps those problems are the same as mine. My "solution" to disable zone_reclaim_mode may have just made it easier for the MDS server to find free memory and so it just took longer for the same underlying problem to become evident again.

Comment by Rick Mohr [ 29/Oct/14 ]

Attached log messages around the time of first incident.

Comment by Rick Mohr [ 29/Oct/14 ]

Niu,

There are a couple of things about the log that I should point out. The first events were around 8AM on Sept 13 and you'll notice that the oom-killer fired to kill off the nslcd process. However, the server seems to have remained functional for at least some time after that. Our Nagios monitoring checks did not report any problems until almost 5PM that same day, and it wasn't until about 6 or 7PM that the system became completely unresponsive. So it seems like it was a slow decline.

Also, we ran into the same (or at least we assume the same) problem on Sep 15 around 1PM. We noticed the memory usage kept increasing until the oom-killer fired and killed several processes, after which point the system became unresponsive. However, in that incident, the oom-killer was not preceded by any lustre stack traces in the log file.

Comment by Niu Yawei (Inactive) [ 30/Oct/14 ]

There are a couple of things about the log that I should point out. The first events were around 8AM on Sept 13 and you'll notice that the oom-killer fired to kill off the nslcd process. However, the server seems to have remained functional for at least some time after that. Our Nagios monitoring checks did not report any problems until almost 5PM that same day, and it wasn't until about 6 or 7PM that the system became completely unresponsive. So it seems like it was a slow decline.

Is it possible that the min_free_kbytes is too low for your MDS? Following is quoted from RH manual:

Setting min_free_kbytes too low prevents the system from reclaiming memory, This can result in system hangs and OOM-killing multiple processes.
However, setting this parameter too high (5% - 10% of total system memory) will cause your system out-of-memory immediately.

I'm wondering if you can try to increase the min_free_kbytes and see if it can alleviate the situation?

Comment by Rick Mohr [ 30/Oct/14 ]

We have vm.min_free_kbytes set to 131072. This is about 0.2% of the total memory. Is there a recommended value for this?

Comment by Niu Yawei (Inactive) [ 03/Nov/14 ]

I don't have experience on tuning these parameters, so I'm afraid that I can't give any suggestions on this. Probably you can increase it little by little and see how it works.

Comment by Robin Humble [ 03/Nov/14 ]

a couple of thoughts... maybe not relevant, but hey...

modern lustre sets memory affinity set for a lot of its threads (which IMHO is an over-optimisation). forced affinity means that if allocates are imbalanced then it is possible for one zone (eg. normal1) on a server to run out of ram, whilst there is still heaps of memory free in another zone (eg. normal0). we have seen OOM's on our OSS's due to this (although only when aggressive inode caching is turned on with vfs_cache_pressure=0, which probably isn't a common setting) and now set "options libcfs cpu_npartitions=1" on our OSS's for this reason. my testing hasn't shown such a severe problem with affinity on MDS's, but still, you could check "egrep -i 'zone|slab|present|free' /proc/zoneinfo" and make sure the big zones have approximately equal free pages and equal used slab.

also, have you tried dropping ldlm locks on the client before drop_caches? eg. "lctl set_param ldlm.namespaces.fsName-'*'.lru_size=clear". without ldlm locks being dropped on the client, I suspect drop_caches on the client won't drop many inodes at all, and in turn that may mean that drop_caches on the MDS can't free much either?

Comment by Andreas Dilger [ 05/Nov/14 ]

I would agree with Robin that I think the majority of the memory pressure is coming from MDT/MDD/LOD/OSD objects->DLM locks->ldiskfs inodes being cached/pinned in memory on the MDT. I wonder if we are caching the locks too aggressively? In theory, the LDLM pools should be pushing memory pressure from the lock server back to the clients, so the clients cancel DLM locks to relieve memory pressure on the server, but it is entirely possible that this isn't working as well as it should.

Comment by Minh Diep [ 05/Nov/14 ]

Do we know if 2.5.3 server have this issue?

Comment by Rick Mohr [ 05/Nov/14 ]

Niu:

I can try increasing vm.min_free_kbytes a little, but since the documentation says that there can be problems if this value is too high, I am reluctant to play around with it too much.

Robin:

We ran into some issues with our MDS server slowly grinding to a halt earlier this year and resolved those by setting vm.zone_reclaim_mode=0. I have not looked at /proc/zoneinfo, but I will check that out to see if I notice anything unusual. We do not set vfs_cache_pressure=0, but I may look into "options libcfs cpu_npartitions" since I am not familiar with that option. As far as ldlm locks go, that is one of the things I suspected too. However, we ran a test where we dropped all client locks on the system that was deleting files. Unfortunately, even after dropping caches on the server, it did not seem to have any affect at all on the MDS memory usage growth. We did of course see a decrease in memory used by the locks, but in the course of our testing, the total amount of memory used by locks seemed to pretty consistently hover around 2-3 GB. So while the MDS memory usage steadily grew, the ldlm lock memory usage did not. I took this to mean that locks were not really the cause of the memory pressure. Nonetheless, we have applied some limits to lru_size and lru_max_age on our Lustre clients on the off chance that it might help. Unfortunately, this attempt is somewhat complicated by bug LU-5727 so the limits we apply on the client are not actually being honored.

Comment by Rick Mohr [ 05/Nov/14 ]

Minh: In our testing, we have seen this on 2.5.3 servers.

Comment by Andreas Dilger [ 05/Nov/14 ]

If you are hitting this problem of the available memory running out on the MDS, it would be useful to see if running lctl set_param ldlm.namespaces.MDT.lru_size=clear on all (or some subset) of clients, and then on the MDS sysctl -w vm.drop_caches=3 to drop the page and inode caches to see if this will reduce the MDS memory usage. Without dropping the locks on the clients, the MDS inodes cannot be freed from cache (this includes one of each { mdt_obj, mdd_obj, lod_obj, osp_obj, ldlm_lock * N, ldlm_resource, ldiskfs_inode_cache } per inode so it definitely adds up (about 3KB per inode by my quick calculations).

If this helps reduce MDS memory usage, the next question is why the LDLM pool is not shrinking the client DLM lock cache under memory pressure.

Comment by Niu Yawei (Inactive) [ 06/Nov/14 ]

Hi, Andreas

I don't think the LDLM pool is able to shrink client dlm locks when MDS is under memory pressure: The server ldlm shrinker is called when MDS is under memory pressure, however, it can't reclaim memory directly, it can only bump the SLV and wait for clients to aware of the increased SLV then start cancel locks. I think there are several problems in above schema:

  • The time span of [server under memory pressure -- client start cancel locks] could be very long (idle client can only get bumped SLV on next ping).
  • The SLV isn't increased much in such situation (see ldlm_srv_pool_shrink()), I'm afraid that client probably won't cancel locks as expected even if receives the new SLV in time.
  • The SLV could be overwritten by SLV recalculation thread immediately after it's bumped by shrinker.

Probably it's time to reconsider the whole ldlm pool mechanism?

Hi, Rick

Could you try to apply the fix of LU-5727 to your 2.5 clients and use lru_size on clients then retry your test to see if there is any difference? Thanks.

Comment by Andreas Dilger [ 08/Nov/14 ]

Niu, I think we need to make the ldlm pool kick in much earlier to avoid memory pressure on the MDS. It doesn't make sense to have so many locks on the MDS that it is running out of memory. Since each lock is also keeping lots of other memory pinned (inode, buffer, MDT, MDD, LOD, OSD objects) we need to start shrinking the ldlm pool sooner.

I haven't looked at this code in a long time, but is there an upper limit that can be imposed on the number of locks on the server? What is used to calculate this limit, and is it reasonable?

Comment by Niu Yawei (Inactive) [ 10/Nov/14 ]

Niu, I think we need to make the ldlm pool kick in much earlier to avoid memory pressure on the MDS. It doesn't make sense to have so many locks on the MDS that it is running out of memory. Since each lock is also keeping lots of other memory pinned (inode, buffer, MDT, MDD, LOD, OSD objects) we need to start shrinking the ldlm pool sooner.

Indeed, kernel is always supposing shrinker can reclaim memory immediately, so the way of not reclaim memory but only decreasing SLV in shrinker (then hope clients start to cancel locks once they received the decreased SLV) looks inappropriate to me. The SLV recalculate thread is supposed to kick off lock cancel in early stage, so the server shrinker looks not necessary to me.

I haven't looked at this code in a long time, but is there an upper limit that can be imposed on the number of locks on the server? What is used to calculate this limit, and is it reasonable?

The upper limit is 50 locks per 1M MDS memory.

/*
 * 50 ldlm locks for 1MB of RAM.
 */
#define LDLM_POOL_HOST_L ((NUM_CACHEPAGES >> (20 - PAGE_CACHE_SHIFT)) * 50)

But this isn't a hard limit, it's just a factor for SLV calculating.

I ran some local unlink testing, it showed that "Buffers/Inactive(file)" grows after unlink (same as Rick's test result), and that didn't happen on creating or unlink from ldiskfs directly. I didn't figure out the reason yet, and I'm not sure if it's really related to the OOM problem. While investigating this problem further, I'm also trying to find machines with huge memory which can reproduce the OOM problem.

Hi, Rick
I forgot to ask, did you observe how many locks cached on clients when the MDS is running out of memory?

Comment by Rick Mohr [ 11/Nov/14 ]

(Sorry for taking a while to respond. I have been busy putting out some other fires.)

In response to some of the comments:

1) I have not tried increasing vm.min_free_kbytes yet, but will do so this week. (Although I noticed that in LU-5841 a similar thing was tried and didn't seem to have any effect.)

2) We have already tried dropping client locks before clearing the MDS server cache. This did not help free up any memory.

3) A coworker and I did some digging into how the MDS server determines how many locks it can support (and how many locks a client can cache). We came across the SLV, but struggled to understand how it was supposed to work. We were also unable to find any way to cap the total number of locks a MDS server would grant. (We tried setting ldlm.namespaces.<mdt>.pool.limit but that didn't work.) If such a way does not exist, it might be a very good thing to include. It would be handy to have a simple way to ensure that the server's mem usage for locks can be controlled.

4) I can try to apply the LU-5727 patch to a client, but is there a reason to expect that this should make a difference? Since we have manually dropped client locks and seen no effect, I am not sure how this patch would change anything.

5) I did not capture the numbers of locks on our Lustre clients when we had the MDS crashes. We did look at the number of locks on one client that was purging large numbers of files while we observed the MDS mem usage increasing. That client had over a million locks. (We had placed a limit of 2000 locks on the client, and this behavior is what brought our attention to LU-5727.)

Comment by Niu Yawei (Inactive) [ 12/Nov/14 ]

4) I can try to apply the LU-5727 patch to a client, but is there a reason to expect that this should make a difference? Since we have manually dropped client locks and seen no effect, I am not sure how this patch would change anything.

I was wondering if the OOM isn't caused by the growing "Buffers" (which won't be decreased by cancelling locks as you mentioned)? That's why I want to know many locks were cached on clients on OOM happened.

Comment by Rick Mohr [ 14/Nov/14 ]

Niu,

Just wanted to let you know that I increased vm.min_free_kbytes by a factor of 10 to 1310720 (~1.2 GB) which is about 2% of total memory. I will let you know what happens.

Comment by Rick Mohr [ 14/Nov/14 ]

Niu,

If I want to know how many client locks are cached on the MDS server when it OOMs, is that info in the /proc/fs/lustre/ldlm/namespaces/mdt-<fsname>-MDT0000_UUID/lock_count file? Since it isn't really feasible to query the lock counts on each client individually, I wanted to verify that I could get the same info from the server-side. (Although even that might not work if the MDS server becomes unresponsive when it OOMs.)

I also wanted to let you know that we have started to run some purges from Lustre 2.4.3 clients, and it looks like maybe the server memory usage doesn't grow as fast compared to using a Lustre 2.5 client. We don't have quantitative info yet, but if we are able to run some tests and gather numbers, I will pass them along to you.

Comment by Niu Yawei (Inactive) [ 17/Nov/14 ]

If I want to know how many client locks are cached on the MDS server when it OOMs, is that info in the /proc/fs/lustre/ldlm/namespaces/mdt-<fsname>-MDT0000_UUID/lock_count file? Since it isn't really feasible to query the lock counts on each client individually, I wanted to verify that I could get the same info from the server-side. (Although even that might not work if the MDS server becomes unresponsive when it OOMs.)

Probably you can write a script to read the proc file from all clients remotely? As you mentioned, MDS will be unresponsive when OOM, reading proc file on MDS isn't practical.

I also wanted to let you know that we have started to run some purges from Lustre 2.4.3 clients, and it looks like maybe the server memory usage doesn't grow as fast compared to using a Lustre 2.5 client. We don't have quantitative info yet, but if we are able to run some tests and gather numbers, I will pass them along to you.

It could because of the LU-5727?

Comment by Rick Mohr [ 25/Nov/14 ]

Niu,

We had another MDS crash last Friday while running purges. The vm.min_free_kbytes value had been increased to 1.2 GB from 128 MB, but this did not prevent the crash (and most likely caused it to crash slightly sooner because it was trying to keep more memory in reserve). As a result, we have reverted to using our previous value for that parameter.

Prior to the crash, I had run some tests to see if there was a difference in the rate of mem usage between clients running Lustre 2.5.3 and 2.4.3. Based on my tests of deleting ~1M files, it looked like the 2.5.3 client caused the buffer usage to grow by about 2.7 GB and the 2.4.3 client caused usage to grow by about 2 GB. (It should be noted that these tests were run on a file system that was in production. This might skew the absolute value of the numbers, but I think the relative difference is still reasonably accurate.) Based on this info, we started using Lustre 2.4.3 clients to run our file system purges. Since these clients also do not suffer from LU-5727, we were able to apply lower limits on the number of cached locks (2000 per client) and verified that the clients were honoring the new limits. Nevertheless, buffer usage on the MDS continued to grow while the purges were happening, and this resulted in the latest MDS crash.

The MDS was rebooted after the crash 4 days ago. Since then we have not run any more purges. When I checked the MDS this morning, the memory usage looked like this:

MemTotal: 66053640 kB
MemFree: 41155360 kB
Buffers: 14507900 kB
Cached: 402160 kB
SwapCached: 0 kB
Active: 5331216 kB
Inactive: 9637736 kB
Active(anon): 149548 kB
Inactive(anon): 42128 kB
Active(file): 5181668 kB
Inactive(file): 9595608 kB

After running "echo 1 > /proc/sys/vm/drop_caches", the memory stats were like this:

MemTotal: 66053640 kB
MemFree: 56045260 kB
Buffers: 249556 kB
Cached: 144876 kB
SwapCached: 0 kB
Active: 66128 kB
Inactive: 387976 kB
Active(anon): 59276 kB
Inactive(anon): 132400 kB
Active(file): 6852 kB
Inactive(file): 255576 kB

So during normal usage, the buffer usage grew to about 14.5 GB (as would be expected) but dropping caches easily reclaimed the memory and buffer usage dropped to 0.25 GB. As far as I can tell, it is only when we delete large numbers of files that we get into situations where the buffer usage will not drop.

This pretty much guarantees that the MDS will crash whenever we run purges. For the moment, we are closely monitoring mem usage during purges, and if it grows too much, we will preemptively unmount/remount the MDT to free up memory. This makes it a little easier for clients to recover, but we always run the risk that some client(s) will not handle the MDT disappearance well and end up getting evicted (which can cause IO errors for end users). This is definitely a regression from Lustre 1.8.9 where we could go months without any sort of Lustre failure.

Comment by Rick Mohr [ 25/Nov/14 ]

Is this issue related to LU-4740? I came across that bug report when I was first researching my problem. The failure looked almost identical. It looked like someone came up with a potential patch at one point, but then it was suggested that the patch was not ready for production use. Has there been any progress made on a new version of that patch?

Comment by Niu Yawei (Inactive) [ 26/Nov/14 ]

Thank you for the information, Rick. I was investigating on the problem of "growed Buffers", but had no luck so far. I'll keep investigating further.

As for LU-4740, it looks like same problem, and the patch mentioned in LU-4740 is actually aiming for LU-4053 (which is introduced since 2.5, but no present in 2.4).

Comment by Rick Mohr [ 17/Dec/14 ]

Is there any update on this issue?

Comment by Haisong Cai (Inactive) [ 17/Dec/14 ]

SDSC too, is desperately waiting on a fix of this problem. We have had several filesystem down-times caused by this problem recently.

thank you,
Haisong

Comment by Niu Yawei (Inactive) [ 18/Dec/14 ]

Is there any update on this issue?

Rick, I'm still working on this, but didn't make a dent so far. As a temporary workaround, could you modify your purge program to deleting files step by step instead of deleting all the files in one shot? You mentioned earlier in this ticket that deleting small amount of file won't increase the buffers footprint, I'm wondering if it's helpful for your situation.

Comment by Haisong Cai (Inactive) [ 18/Dec/14 ]

Would like to add 2 comments:

1) filesystem deletion often happens in user land. If they choose to delete 1million files with one "rm" command, there is little sysadm can control over. This is true at lease in our environment
2) my experience tells me that, slow-deletion only slow down dying process.

thanks,
Haisong

Comment by Rick Mohr [ 19/Dec/14 ]

I think Haisong's comments are spot on. This problem was originally brought to our attention because a user triggered the OOM by deleting files. Slowing down the deletions only postpones the inevitable, and we still are forced to carefully monitor the system and eventually do a controlled MDS reboot. Plus, when we have a user that pushes the file system's inode usage up to 80%, we need to delete files faster, not slower.

--Rick

Comment by Haisong Cai (Inactive) [ 07/Jan/15 ]

Over the last 2 weeks, we have had 3 MDS crashes because of this bug - the pattern is easily distinguishable and problem is reproduceable. We request Intel engineers to please provide a fix for the bug.

thanks,
Haisong

Comment by Minh Diep [ 07/Jan/15 ]

Niu, have you been able to reproduce this problem. according to SDSC, this can be very easy to reproduce. All you do is create and remove over 1 million files. Please let me know if you need any help.

Comment by Niu Yawei (Inactive) [ 08/Jan/15 ]

Niu, have you been able to reproduce this problem. according to SDSC, this can be very easy to reproduce. All you do is create and remove over 1 million files. Please let me know if you need any help.

I didn't reproduce the OOM problem, I reproduced the problem of "Buffers" not shrinking after "rm" with less files, but unfortunately I didn't find out the reason yet. I'm now wondering if that is the real cause of OOM? If you could help me to book some real machines to verify this (rm 1 million files to trigger OOM) it'll be helpful. (all my tests were done in my local vm). Thank you.

Comment by Minh Diep [ 08/Jan/15 ]

I could be wrong but if the buffer is not freed after 'rm', eventually, it will cause the system to OOM. so I think OOM is the result from 'buffer not freed'

Comment by Niu Yawei (Inactive) [ 08/Jan/15 ]

I could be wrong but if the buffer is not freed after 'rm', eventually, it will cause the system to OOM. so I think OOM is the result from 'buffer not freed'

Well, I'm not 100 percent sure about this. As far as I can see (in my local vm testing), those "Buffers" should be the buffer_heads against the MDT block device, and I'm not sure if it could be reclaimed for other purpose when the buffer size reaches some point.

Comment by Haisong Cai (Inactive) [ 08/Jan/15 ]

Hi Niu,

I also think that OOM is a byproduct of buffer memory not free up. On our MDS servers, we have reserved 10% of total physical RAM, set aggressive flashing dirty cache policy (ie vm.dirty_background_ratio=5 & vm.dirty_ratio=5), but none of these measures helped. We are watching memory buffer increasing to the point where MDS threads start to hang, kernel trace dumping, clients disconnecting, eventually MDS itself would become unresponsive to it's own shell commands. Sometimes it ends with kernel panic and sometimes it ends with completely lock up of the machine. Most of the cases we don't see OOM.

Haisong

Comment by Rick Mohr [ 08/Jan/15 ]

Niu,

A couple of things I wanted to clarify:

1) Removing 1M files is not necessarily enough to cause an OOM. I just used that many files in my test because it was a large enough so that there was a noticeable change in the MDS memory usage, and I could verify that the memory never seemed to be reclaimed. In production, the MDS memory usage seems to grow continuously over time until it eventually reaches a point where it causes problems.

2) When I initially created this ticket, we had observed an OOM event on the MDS. However, in subsequent events, this has not always been the case. Several times, the situation was more like Haisong described: The MDS server "loses" memory until it reaches a point where Lustre threads slow to a crawl and eventually the system becomes completely unresponsive.

So I don't think reproducing the OOM is necessarily needed to investigate the issue.

--Rick

Comment by Niu Yawei (Inactive) [ 09/Jan/15 ]

Thank you for the information, Haisong & Rick.

Comment by Minh Diep [ 14/Jan/15 ]

Niu, do you have any update on this?

Comment by Tommy Minyard [ 15/Jan/15 ]

We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue.

We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.

Comment by Niu Yawei (Inactive) [ 19/Jan/15 ]

After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon.

Comment by Gerrit Updater [ 19/Jan/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452
Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1eb46ffbec85016db1054594094abde6d09a3616

Comment by Niu Yawei (Inactive) [ 19/Jan/15 ]

patch to master: http://review.whamcloud.com/13452

Comment by Andreas Dilger [ 20/Jan/15 ]

Niu, Lai, excellent work finding and fixing this bug.

A question for the users hitting this problem - is the ea_inode (also named large_xattr) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc).

While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem.

Comment by Haisong Cai (Inactive) [ 20/Jan/15 ]

We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT):

[root@puma-mds-10-5 ~]# dumpe2fs -h /dev/md0 | grep features
dumpe2fs 1.42.7.wc1 (12-Apr-2013)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota
Journal features: journal_incompat_revoke

In addition, all our filesystems hit this bug have less than 160 OSTs.

Haisong

Comment by Niu Yawei (Inactive) [ 20/Jan/15 ]

Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode.

int
ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode,
                        struct ldiskfs_xattr_ino_array **lea_ino_array)
{
        struct buffer_head *bh = NULL;
        struct ldiskfs_xattr_ibody_header *header;
        struct ldiskfs_inode *raw_inode;
        struct ldiskfs_iloc iloc;
        struct ldiskfs_xattr_entry *entry;
        int error = 0;

        if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR))
                goto delete_external_ea;

        error = ldiskfs_get_inode_loc(inode, &iloc);

As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh.

Comment by Gerrit Updater [ 20/Jan/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464
Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd

Comment by Niu Yawei (Inactive) [ 20/Jan/15 ]

Port to b2_5: http://review.whamcloud.com/13464

Comment by Gerrit Updater [ 22/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13452/
Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ffd42ff529f5823b5a04529e1db2ea3b32a9f59f

Comment by Rick Mohr [ 22/Jan/15 ]

In response to Andreas' question:

dumpe2fs 1.42.12.wc1 (15-Sep-2014)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
Journal features: journal_incompat_revoke

Our file system has 90 OSTs.

Comment by Niu Yawei (Inactive) [ 23/Jan/15 ]

Rick, could you verify that if the patch can fix your problem? It works for me, after applied the patch, I didn't see the "growing buffers" problem anymore.

Comment by Rick Mohr [ 23/Jan/15 ]

Niu,

Our testbed is currently down, but we are trying to get it back up and running again. Once that is done, we will work on applying your patch and testing it. (Although this might not happen for another week.)

Comment by Gerrit Updater [ 26/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13464/
Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 8f7181ee5553aa22ecfe51202f3db1a4162361e7

Comment by Rick Mohr [ 30/Jan/15 ]

Niu,

We were able to get a test file system setup and applied the patch. Initial testing seems to show that the problem is fixed. We have applied the patch to Lustre 2.4.3 and plan to roll it out to our production file system next week. After that, we will run some further tests and let you know if there are any problems.

Comment by Peter Jones [ 30/Jan/15 ]

That's great news - thanks Rick!

Comment by Rick Mohr [ 04/Feb/15 ]

Niu,

We just applied the patch today to our production file system (Lustre 2.4.3) and are running some heavy purges right now. I collected some info about the memory usage. Prior to the patch, it seemed like the memory growth was dominated by the "Inactive(file)" in /proc/meminfo. I dropped the cache in the MDS server (echo 3 > /proc/sys/vm/drop_caches) and collected Inactive(file) usage every minute:

nactive(file): 1146656 kB
Inactive(file): 3426128 kB
Inactive(file): 5510484 kB
Inactive(file): 6634728 kB
Inactive(file): 7514500 kB
Inactive(file): 8075948 kB
Inactive(file): 8662528 kB
Inactive(file): 9210796 kB
Inactive(file): 9576412 kB
Inactive(file): 9974336 kB
Inactive(file): 10400772 kB
Inactive(file): 10710464 kB
Inactive(file): 10964180 kB
Inactive(file): 11280900 kB
Inactive(file): 11591336 kB
Inactive(file): 11731164 kB
Inactive(file): 11817340 kB
Inactive(file): 11920016 kB
Inactive(file): 12040800 kB
Inactive(file): 12196232 kB
Inactive(file): 12148272 kB
Inactive(file): 12269224 kB
Inactive(file): 12251768 kB
Inactive(file): 12263596 kB

The number initially ramped up fast, but then leveled off a bit. Just to double check, I dropped the cache again:

Inactive(file): 401152 kB
Inactive(file): 2724788 kB
Inactive(file): 4409916 kB
Inactive(file): 6003208 kB
Inactive(file): 6532220 kB
Inactive(file): 7319768 kB
Inactive(file): 8154560 kB
Inactive(file): 8769084 kB
Inactive(file): 9271760 kB
Inactive(file): 9650020 kB
Inactive(file): 9918932 kB
Inactive(file): 10170456 kB
Inactive(file): 10303404 kB
Inactive(file): 10602256 kB
Inactive(file): 10972760 kB
Inactive(file): 11509680 kB
Inactive(file): 11986980 kB
Inactive(file): 12436528 kB
Inactive(file): 12770672 kB
Inactive(file): 13195352 kB
Inactive(file): 13463276 kB
Inactive(file): 13807816 kB
Inactive(file): 14029160 kB
Inactive(file): 14749976 kB
Inactive(file): 14879704 kB
Inactive(file): 14908984 kB
Inactive(file): 14988196 kB
Inactive(file): 15123316 kB
Inactive(file): 15240824 kB
Inactive(file): 15341328 kB
Inactive(file): 15464332 kB

We got the same behavior, and more importantly, we seem to be reclaiming the memory from Inactive(file). I also checked MemFree and Buffers before/after dropping caches:

(Before)
MemTotal: 66053640 kB
MemFree: 51291028 kB
Buffers: 10685976 kB

(After)
MemTotal: 66053640 kB
MemFree: 63239432 kB
Buffers: 198148 kB

Buffer usage dropped below 200 MB. Given the rate at which we are purging, that never would have happened prior to applying the patch.

It feel 90% confident this patch solved the problem. If we can continue purging at this rate over the couple of days without increased memory usage, then I think I will be 100% confident.

Comment by Gerrit Updater [ 04/Feb/15 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: http://review.whamcloud.com/13655
Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: b2_4
Current Patch Set: 1
Commit: 8e29ef136f426fef66b6008d379afc5e5ddc4ab5

Comment by Rick Mohr [ 05/Feb/15 ]

Niu,

I checked the MDS mem usage again this morning:

MemTotal:       66053640 kB
MemFree:         5568288 kB
Buffers:        55504980 kB
Active:         22374284 kB
Inactive:       33260116 kB

After I dropped caches:

MemTotal:       66053640 kB
MemFree:        63146420 kB
Buffers:           59788 kB
Active:            57960 kB
Inactive:          93452 kB

Looks like the patch is successful.

Comment by Peter Jones [ 05/Feb/15 ]

Great news. Landed for 2.5.4 and 2.7

Generated at Sat Feb 10 01:53:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.