[LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: Lustre 2.5.0, Lustre 2.4.3
Labels:
- HB
Environment:
Clients:
Endeavour: 2.4.3, ldan: 2.4.1, Pleiades compute nodes: 2.1.5 or 2.4.1
Servers:
2.1.5, 2.4.1, 2.4.3

Severity:
2
Rank (Obsolete):
14374

Description

We have been seeing our SLES11SP2 and SLES11SP3 clients have stuck anonymous memory that cannot be cleared up without a reboot. We have three test cases which can replicate the problem reliably. We have been able to replicate the problem on different clients on all of our lustre file systems. We have not been able to reproduce the problem when using NFS, ext3, CXFS, or tmpfs.

We have been working with SGI on tracking down this problem. Unfortunately, they have been unable to reproduce the problem on their systems. On our systems, they have simplified the test case to mmaping a file along with an equally sized anonymous region, and reading the contents of the mmaped file into the anonymous mmaped region. This test case can be provided to see if you can reproduce this problem.

To determine if the problem is occurring, reboot the system to ensure that memory is clean. Check /proc/meminfo for the amount of Active(anon) memory being used. Run the test case. During the test case, the amount of anonymous memory will increase. At the end of the test case, it would be expected for the amount to drop back to pre-test case levels.

To confirm that the anonymous memory is stuck, we have been using memhog to attempt to allocate memory. If the node has 32Gb of memory, with 2Gb of anonymous memory used, we attempt to allocate 31Gb of memory. If memhog completes and you then have only 1Gb of anonymous memory, you have not reproduced the problem. If memhog is killed, you have.

SGI would like to get information about how to get debug information to track down this problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mmap.c
2 kB
12/Jun/14 5:53 PM
mmap4.c
2 kB
19/Jun/14 8:14 PM
T2.tgz
0.2 kB
13/Jun/14 1:56 AM
test_ckpt.cpp
1 kB
16/Jun/14 7:02 PM

Activity

[LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled

Jay Lan (Inactive) added a comment - 19/Jun/14 6:19 PM

Below is an exmaple of a system that stuck in this memory. The system has 4TB memry and 1.5TB stuck in Acitve(anon) that can not be released. There are 126 nodes in that systems and application would request a number of nodes for their testing. After the memory leak, those nodes would have not enough memory for other jobs. The application would fail and resubmision of the job would then fail to start because the requested notes did not have enough memory.

cat /proc/meminfo
MemTotal: 4036524872 kB
MemFree: 2399583516 kB
Buffers: 243504 kB
Cached: 5204560 kB
SwapCached: 678520 kB
Active: 1544908812 kB
Inactive: 56619188 kB
Active(anon): 1543105636 kB
Inactive(anon): 53018772 kB
Active(file): 1803176 kB
Inactive(file): 3600416 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 10239996 kB
SwapFree: 0 kB
Dirty: 554504 kB
Writeback: 26128 kB
AnonPages: 1595359296 kB
Mapped: 143708 kB
Shmem: 98772 kB
Slab: 11485844 kB
SReclaimable: 161660 kB
SUnreclaim: 11324184 kB
KernelStack: 87016 kB
PageTables: 6747560 kB
NFS_Unstable: 31856 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 2027403680 kB
Committed_AS: 1262704572 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 18824052 kB
VmallocChunk: 25954600480 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1264787456 kB
HugePages_Total: 1073
HugePages_Free: 468
HugePages_Rsvd: 468
HugePages_Surp: 1073
Hugepagesize: 2048 kB
DirectMap4k: 335872 kB
DirectMap2M: 134963200 kB
DirectMap1G: 3958374400 kB

Jay Lan (Inactive) added a comment - 19/Jun/14 6:19 PM Below is an exmaple of a system that stuck in this memory. The system has 4TB memry and 1.5TB stuck in Acitve(anon) that can not be released. There are 126 nodes in that systems and application would request a number of nodes for their testing. After the memory leak, those nodes would have not enough memory for other jobs. The application would fail and resubmision of the job would then fail to start because the requested notes did not have enough memory. cat /proc/meminfo MemTotal: 4036524872 kB MemFree: 2399583516 kB Buffers: 243504 kB Cached: 5204560 kB SwapCached: 678520 kB Active: 1544908812 kB Inactive: 56619188 kB Active(anon): 1543105636 kB Inactive(anon): 53018772 kB Active(file): 1803176 kB Inactive(file): 3600416 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 10239996 kB SwapFree: 0 kB Dirty: 554504 kB Writeback: 26128 kB AnonPages: 1595359296 kB Mapped: 143708 kB Shmem: 98772 kB Slab: 11485844 kB SReclaimable: 161660 kB SUnreclaim: 11324184 kB KernelStack: 87016 kB PageTables: 6747560 kB NFS_Unstable: 31856 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2027403680 kB Committed_AS: 1262704572 kB VmallocTotal: 34359738367 kB VmallocUsed: 18824052 kB VmallocChunk: 25954600480 kB HardwareCorrupted: 0 kB AnonHugePages: 1264787456 kB HugePages_Total: 1073 HugePages_Free: 468 HugePages_Rsvd: 468 HugePages_Surp: 1073 Hugepagesize: 2048 kB DirectMap4k: 335872 kB DirectMap2M: 134963200 kB DirectMap1G: 3958374400 kB

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 5:57 PM

can you show me the output of /proc/meminfo when you see this problem?

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 5:57 PM can you show me the output of /proc/meminfo when you see this problem?

Herbert Yeung added a comment - 16/Jun/14 7:02 PM

Source for test_ckpt uploaded. timeout comes from coreutils-8.22

Herbert Yeung added a comment - 16/Jun/14 7:02 PM Source for test_ckpt uploaded. timeout comes from coreutils-8.22

Oleg Drokin added a comment - 13/Jun/14 11:36 PM

I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

Oleg Drokin added a comment - 13/Jun/14 11:36 PM I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

Herbert Yeung added a comment - 13/Jun/14 2:16 AM

Normally, we run the test script through PBS. To do so, you can use My_run.ivy. If you want to run interactively or without PBS, use My_run_I_ivy. Sample output from PBS is at kdgordon.o328785, though some of the data has been sanitized.

The script prints out a variety of information, including the amount of anonymous memory being used by the system. The runit script is called which produces the stuck anonymous memory. After that, the amount of used anonymous memory is checked. memhog is called to try to clear up the memory and the amount of anonymous memory is displayed again.

You will need to tweak the amount of memory that memhog attempts allocate based on how much memory your test system has. Generally, about 800MB of memory gets stuck after running the job. runit can be called several times to increase the amount of stuck memory.

After running the test, the file checkpoint.mtcp is created. On some of our systems, the file may need to be deleted before being able to reproduce the problem again.

After running memhog, the memory can go to swap instead and remains there used. This still indicates that the memory is not being freed.

Herbert Yeung added a comment - 13/Jun/14 2:16 AM Normally, we run the test script through PBS. To do so, you can use My_run.ivy. If you want to run interactively or without PBS, use My_run_I_ivy. Sample output from PBS is at kdgordon.o328785, though some of the data has been sanitized. The script prints out a variety of information, including the amount of anonymous memory being used by the system. The runit script is called which produces the stuck anonymous memory. After that, the amount of used anonymous memory is checked. memhog is called to try to clear up the memory and the amount of anonymous memory is displayed again. You will need to tweak the amount of memory that memhog attempts allocate based on how much memory your test system has. Generally, about 800MB of memory gets stuck after running the job. runit can be called several times to increase the amount of stuck memory. After running the test, the file checkpoint.mtcp is created. On some of our systems, the file may need to be deleted before being able to reproduce the problem again. After running memhog, the memory can go to swap instead and remains there used. This still indicates that the memory is not being freed.

Herbert Yeung added a comment - 13/Jun/14 1:56 AM

Second test case that reproduces the problem.

Herbert Yeung added a comment - 13/Jun/14 1:56 AM Second test case that reproduces the problem.

Oleg Drokin added a comment - 12/Jun/14 9:22 PM

From the call:This was also reproduced on rhel6.4 kernel with the same testcase.
unmounting the fs either fails (fs busy) or when succeeding, does not free memory either.

The reproducer only works on NASA systems and it's not 100% reliable, but a high frequency is still reported.

This originally was investigated from strange OOM issues.

Oleg Drokin added a comment - 12/Jun/14 9:22 PM From the call:This was also reproduced on rhel6.4 kernel with the same testcase. unmounting the fs either fails (fs busy) or when succeeding, does not free memory either. The reproducer only works on NASA systems and it's not 100% reliable, but a high frequency is still reported. This originally was investigated from strange OOM issues.

Herbert Yeung added a comment - 12/Jun/14 5:54 PM

Yes, we have been able to reproduce the problem on all of the clients. Though I am not positive that we have tested every permutation, I believe that all of the lustre versions have been reproduced on all of the clients.

Herbert Yeung added a comment - 12/Jun/14 5:54 PM Yes, we have been able to reproduce the problem on all of the clients. Though I am not positive that we have tested every permutation, I believe that all of the lustre versions have been reproduced on all of the clients.

Oleg Drokin added a comment - 12/Jun/14 2:38 AM

Please upload your testcase.

You list multiple client node versions, is this observed in all of them?

Oleg Drokin added a comment - 12/Jun/14 2:38 AM Please upload your testcase. You list multiple client node versions, is this observed in all of them?

People

Assignee:: Oleg Drokin

Reporter:: Herbert Yeung

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Jun/14 1:57 AM

Updated:: 12/Aug/14 7:46 PM

Resolved:: 20/Jun/14 6:58 PM