[LU-2084] Kernel freeze allocating more memory than there is RAM Created: 03/Oct/12 Updated: 27/Oct/21 Resolved: 27/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.3, Lustre 1.8.8 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Doug Oucharek (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4350 |
| Description |
|
While working with router buffers, I set the number of large buffers to a number beyond the amount of memory I had assigned to the VM running Lustre. Number of large buffer: 1024, amount of memory: 1G. The VM froze with all 3 virtual cpu's running at 100%. Looking deeper into this, I found that the Linux memory allocation system will keep trying to free up memory to satisfy the request. However, even after waiting 15 minutes, the VM did not "unfreeze". I changed the default flags we use for memory allocation to include __GFP_NORETRY to stop the memory allocator from looping. When re-running the above test, I found the system no longer froze but returned -ENOMEM to the caller as expected. This bug is to track a discussion as to whether we should start using __GFP_NORETRY and if so, how widespread. |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 03/Oct/12 ] |
|
In general: I have seen near OOM situations take hours (think overnight to days) to work themselves out if it is possible, over committal of kernel size code fairs badly. 15min is just getting started with these things. In your "frozen" state your kernel was likely not broken in any way just busy trying to accomplish what you asked it to do. There are going to be some critical sections of code that cannot fail and should block until fulfilled. It is definitely a per-allocation-class question of weather it should block or not. If you are allocating a whole systems amount of memory it should use __GFP_NORETRY and check for -ENOMEM, also I would hope that resources users this large would carefully grow as needed. Where did you add the __GFP_NORETRY flag? |
| Comment by Doug Oucharek (Inactive) [ 03/Oct/12 ] |
|
Given how there are many layers of "abstraction" in libcfs for OS portability, I ended up just hardcoding the addition of __GFP_NORETRY in routine cfs_alloc_flags_to_gfp() in linux-mem.c. Since we don't preallocate everything and do have some level of dynamic allocation, the possibility will exist that a happy running server could all of a sudden appear to freeze. I myself thought I had crashed the kernel since the terminal was no longer responsive. Only by seeing the CPU meter on the host system did I notice that the VM is still running at 100%. From a user's perspective, I would rather have an error message saying "task X could not be done because of a lack of memory" rather than a freeze. |
| Comment by Keith Mannthey (Inactive) [ 03/Oct/12 ] |
|
Yes working to keep the system out of OOM is a much better user experience. cfs_alloc_flags_to_gfp seems to be pretty low. I would think that is a huge amount of code affected. What are you seeing as your -ENOMEM indication? |
| Comment by Andreas Dilger [ 04/Oct/12 ] |
|
Doug, wouldn't it make sense to limit the number of router buffers to some amount less than the total amount of RAM? Using __GFP_NORETRY in a blanket fashion seems like it could cause gratuitous system failures for cases where there is low memory, but the allocation is not absurd like in your case. |
| Comment by Isaac Huang (Inactive) [ 04/Oct/12 ] |
|
1. I think __GFP_NORETRY is reasonable for router buffers. Routers should be dedicated nodes, where there's nothing else running - i.e. there's nothing like dirty pages to be flushed or idle process pages to be swapped out, so it makes little sense to make the VM retry. 2. I don't think we should make it fool proof by limiting large_router_buffers. System administrators should understand what large_router_buffers does, if they ask for too much, they are asking for trouble and should end up with troubles. Such failures happen only once at router startup, and such routers would be avoided by clients and servers by router pingers, so the consequence should not be catastrophic. Then the admin should notice it and learn his lesson. |
| Comment by Doug Oucharek (Inactive) [ 04/Oct/12 ] |
|
This becomes more complicated when looking forward to the Dynamic LNet Config project which will be making the router buffer pools changeable. With the code as is today, if a user tells a running router to increase the size of a pool beyond available memory, we will see the router lock up for potentially hours. That is unacceptable. If we use __GFP_NORETRY, it may return ENOMEM in cases where memory could have been freed to satisfy the request. However, I would rather see this than a live router lockup. Checking ahead of time to see if there is RAM available does not sound easy given how the Linux memory manager works. Also, I feel this would be doing the OS's job for it. I heard somewhere that work was done to the memory manager in the Linux 3 streams to address these sort of issues. None of that was back-ported to 2.6. |
| Comment by Isaac Huang (Inactive) [ 04/Oct/12 ] |
|
I tend to think __GFP_NORETRY is sufficient. On dedicated routers, where could the VM free much memory from? |
| Comment by Doug Oucharek (Inactive) [ 04/Oct/12 ] |
|
Good point. Ok, I can add CFS_ALLOC_NORETRY to our own set of memory allocation flags and map this to __GFP_NORETRY when present. This way it can be added on a case by case basis. I will only add this flag when allocating router buffers. |
| Comment by Andreas Dilger [ 05/Oct/12 ] |
|
As much as we could wish everyone using Lustre understood it as well as the developers, I don't think this is at all realistic. Users need to be told that something they are trying to do is unrealistic, rather than causing failures or hanging/crashing the node. Having a check like the following seems reasonable: if (router_buffer_pages > cfs_num_physpages * 7 / 8) { CERROR("too much router memory requested: max %u\n", cfs_num_physpages * 7 / 8); RETURN(-EINVAL); fi with allowances for printing messages with proper units, etc. We still need to keep some memory for other things as well, which we may not get with a simple -ENOMEM case. |
| Comment by Gerrit Updater [ 09/Oct/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45174 |
| Comment by Gerrit Updater [ 27/Oct/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45174/ |
| Comment by Peter Jones [ 27/Oct/21 ] |
|
Landed for 2.15 |