[LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled Created: 12/Jun/14 Updated: 12/Aug/14 Resolved: 20/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Herbert Yeung | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Clients: |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 2 | ||||
| Rank (Obsolete): | 14374 | ||||
| Description |
|
We have been seeing our SLES11SP2 and SLES11SP3 clients have stuck anonymous memory that cannot be cleared up without a reboot. We have three test cases which can replicate the problem reliably. We have been able to replicate the problem on different clients on all of our lustre file systems. We have not been able to reproduce the problem when using NFS, ext3, CXFS, or tmpfs. We have been working with SGI on tracking down this problem. Unfortunately, they have been unable to reproduce the problem on their systems. On our systems, they have simplified the test case to mmaping a file along with an equally sized anonymous region, and reading the contents of the mmaped file into the anonymous mmaped region. This test case can be provided to see if you can reproduce this problem. To determine if the problem is occurring, reboot the system to ensure that memory is clean. Check /proc/meminfo for the amount of Active(anon) memory being used. Run the test case. During the test case, the amount of anonymous memory will increase. At the end of the test case, it would be expected for the amount to drop back to pre-test case levels. To confirm that the anonymous memory is stuck, we have been using memhog to attempt to allocate memory. If the node has 32Gb of memory, with 2Gb of anonymous memory used, we attempt to allocate 31Gb of memory. If memhog completes and you then have only 1Gb of anonymous memory, you have not reproduced the problem. If memhog is killed, you have. SGI would like to get information about how to get debug information to track down this problem. |
| Comments |
| Comment by Oleg Drokin [ 12/Jun/14 ] |
|
Please upload your testcase. You list multiple client node versions, is this observed in all of them? |
| Comment by Herbert Yeung [ 12/Jun/14 ] |
|
Yes, we have been able to reproduce the problem on all of the clients. Though I am not positive that we have tested every permutation, I believe that all of the lustre versions have been reproduced on all of the clients. |
| Comment by Oleg Drokin [ 12/Jun/14 ] |
|
From the call:This was also reproduced on rhel6.4 kernel with the same testcase. The reproducer only works on NASA systems and it's not 100% reliable, but a high frequency is still reported. This originally was investigated from strange OOM issues. |
| Comment by Herbert Yeung [ 13/Jun/14 ] |
|
Second test case that reproduces the problem. |
| Comment by Herbert Yeung [ 13/Jun/14 ] |
|
Normally, we run the test script through PBS. To do so, you can use My_run.ivy. If you want to run interactively or without PBS, use My_run_I_ivy. Sample output from PBS is at kdgordon.o328785, though some of the data has been sanitized. The script prints out a variety of information, including the amount of anonymous memory being used by the system. The runit script is called which produces the stuck anonymous memory. After that, the amount of used anonymous memory is checked. memhog is called to try to clear up the memory and the amount of anonymous memory is displayed again. You will need to tweak the amount of memory that memhog attempts allocate based on how much memory your test system has. Generally, about 800MB of memory gets stuck after running the job. runit can be called several times to increase the amount of stuck memory. After running the test, the file checkpoint.mtcp is created. On some of our systems, the file may need to be deleted before being able to reproduce the problem again. After running memhog, the memory can go to swap instead and remains there used. This still indicates that the memory is not being freed. |
| Comment by Oleg Drokin [ 13/Jun/14 ] |
|
I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing. |
| Comment by Herbert Yeung [ 16/Jun/14 ] |
|
Source for test_ckpt uploaded. timeout comes from coreutils-8.22 |
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ] |
|
can you show me the output of /proc/meminfo when you see this problem? |
| Comment by Jay Lan (Inactive) [ 19/Jun/14 ] |
|
Below is an exmaple of a system that stuck in this memory. The system has 4TB memry and 1.5TB stuck in Acitve(anon) that can not be released. There are 126 nodes in that systems and application would request a number of nodes for their testing. After the memory leak, those nodes would have not enough memory for other jobs. The application would fail and resubmision of the job would then fail to start because the requested notes did not have enough memory.
|
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ] |
|
just to narrow down the problem, will you please try it again with huge page disabled? I really meant to say transparent huge page. |
| Comment by Scott Emery [ 19/Jun/14 ] |
|
ldan2 and ldan3 below are fairly ordinary self-contained systems running Log into ldan2. (I've mostly used a qsub session, but I have reproduced the problem outside of PBS) Log into ldan3. ( I have special permission to log into an ldan I don't On both systems cd to the test (lustre) directory 2. a 1g file created with: #
On ldan3 run: On ldan2 run: After the mmap program terminates, notice that the anonymous memory The problem does not reproduce if the dd "interference" is not run. |
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ] |
|
we may have found the problem, Oleg will push a patch soon. |
| Comment by Oleg Drokin [ 19/Jun/14 ] |
|
I did audit of all mmput calls in the code, there are only two functions that use it. My proposed patch is at http://review.whamcloud.com/10759 please give it a try |
| Comment by Oleg Drokin [ 19/Jun/14 ] |
|
Also additional info. |
| Comment by Jay Lan (Inactive) [ 19/Jun/14 ] |
|
The patch affects both server and client. If I only update the client, would it solve the problem at the client side? |
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ] |
|
yes, applying it on the client side only will fix the problem. |
| Comment by Scott Emery [ 19/Jun/14 ] |
|
Jay built a copy of the client for the ldan test system. Initial testing is positive, this fixes the test cases reported in this LU on the type of system I am using to test. |
| Comment by Jodi Levi (Inactive) [ 20/Jun/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed. |
| Comment by Swapnil Pimpale (Inactive) [ 27/Jun/14 ] |
|
b2_4 backport: http://review.whamcloud.com/10868 |