[LU-2795] WRF runs causing Lustre clients to lose memory - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
6768

Description

At our center, we are running a Lustre 2.1.2 file system with Lustre 2.1.2 clients on all of the compute nodes of our Penguin cluster. Recently, a user has been performing WRF runs where he uses a special feature of WRF to offload all of the I/O onto a single node, which improves his I/O performance dramatically, but results in the node losing ~1 GB of memory to "Inactive" after each run. In our epilogue, we have a script checking for available free memory above a specified percentage, and every job that this user runs results in the node being set to offline due to this 1 GB of Inactive memory.

Here is an example of the output from drop_caches showing before and after the epilogue starts on one of these nodes:

Before:
MemTotal: 15.681 GB
MemFree: 6.495 GB
Cached: 6.206 GB
Active: 1.395 GB
Inactive: 6.247 GB
Dirty: 0.000 GB
Mapped: 0.003 GB
Slab: 1.391 GB
After:
MemTotal: 15.681 GB
MemFree: 14.003 GB
Cached: 0.007 GB
Active: 0.134 GB
Inactive: 1.309 GB
Dirty: 0.000 GB
Mapped: 0.003 GB
Slab: 0.082 GB

While looking for possible solutions to this problem, I stumbled upon a recent HPDD-Discuss question that was entitled "Possible file page leak in Lustre 2.1.2" which was very similar to our problem. It was suggested that the issue had already been discovered and resolved in http://jira.whamcloud.com/browse/LU-1576

This ticket suggests that the resolution was included as part of Lustre 2.1.3, so we tested this by installing the Lustre 2.1.3 client packages on some of our compute nodes and allowing the WRF job to run on these nodes. However, even after the upgrade to Lustre 2.1.3, we still saw the inactive memory at the end of the job. Do we need to upgrade our Lustre installation on the OSSes and MDS to Lustre 2.1.3 to fix this problem, or do you have any other suggestions?

Any help that you could provide us with would be appreciated!

Attachments

Activity

[LU-2795] WRF runs causing Lustre clients to lose memory

Cliff White (Inactive) added a comment - 19/Feb/13 2:22 PM

Thanks, let us know how it goes.

Cliff White (Inactive) added a comment - 19/Feb/13 2:22 PM Thanks, let us know how it goes.

Archie Dizon added a comment - 19/Feb/13 1:28 PM

Yes, we had tested installing 2.1.3 on a couple of our client systems to
see if that would fix the problem, but we were still seeing the issue on
those nodes with the Lustre 2.1.3 client installed on them. Thanks for
clarifying that, and it doesn't appear that this code would
be performing a great deal of readdirs, probably not the same memory leak.

Correct, dropping cache does not free the 1 GB of memory. Our epilogue
script attempts to drop cache twice, and after the second time it compares
the amount of free memory before determining if it can return the compute
node to service.

We are going to run the WRF job with Lustre at a higher logging level and
using the leak_finder.pl script provided by WhamCloud. We will send
whatever we find along to you.

Archie Dizon added a comment - 19/Feb/13 1:28 PM Yes, we had tested installing 2.1.3 on a couple of our client systems to see if that would fix the problem, but we were still seeing the issue on those nodes with the Lustre 2.1.3 client installed on them. Thanks for clarifying that, and it doesn't appear that this code would be performing a great deal of readdirs, probably not the same memory leak. Correct, dropping cache does not free the 1 GB of memory. Our epilogue script attempts to drop cache twice, and after the second time it compares the amount of free memory before determining if it can return the compute node to service. We are going to run the WRF job with Lustre at a higher logging level and using the leak_finder.pl script provided by WhamCloud. We will send whatever we find along to you.

Cliff White (Inactive) added a comment - 15/Feb/13 6:45 PM

You indicated that you had installed 2.1.3, which contains the fix for ~~LU-1576~~, this was our main indication. The ~~LU-1576~~ fix mostly deals with readdir pages, so unless your workload includes a lot of readdirs your have likely a different problem.

Are you saying that dropping cache does not free the 1GB of memory?

Cliff White (Inactive) added a comment - 15/Feb/13 6:45 PM You indicated that you had installed 2.1.3, which contains the fix for LU-1576 , this was our main indication. The LU-1576 fix mostly deals with readdir pages, so unless your workload includes a lot of readdirs your have likely a different problem. Are you saying that dropping cache does not free the 1GB of memory?

Archie Dizon added a comment - 15/Feb/13 6:03 PM

In regards to the question of waiting for a few minutes, the answer is no.
Even if we wait for hours, the inactive memory is never given back to the
system, we are forced to reboot these nodes to return them with their full
memory again. However, as you can see from the output in my last message,
we start off with > 6 GB of inactive memory at the beginning of the
epilogue and ~ 1 GB of inactive memory after the epilogue has waited
approximately 30 seconds. Although, no matter how long we wait, that 1 GB
of memory is never returned to the system

We had planned to set up a run of WRF to test the memory usage on our test
cluster, but this has gotten delayed as all of us were busy during the
week. We will have to wait until next week to get you some data on memory
usage.

Having talked with someone much more familiar with WRF and its dependencies
than myself, it sounds like running the WRF software the way that is being
done here, it may be a fairly big hassle. In other words, getting it
running for you locally may be fairly difficult. We will have to see if
going down that road is necessary when we give you some more data.

In the meantime, I'm curious as to how WhamCloud has determined that our
problem does not match up with http://jira.whamcloud.com/browse/LU-1576.
The symptoms are identical, and it was suggested in the HPDD discussion
list that this was an occurrence in Lustre 2.1.2 for some irregular I/O
patterns. What do they see as different between our problem and the one
described by LLNL on the list? For my future reference, I would be
interested to know how they determined that so I could use their methods
for better diagnosing Lustre problems in the future.

I'll have more to share with you next week.

Thanks

Archie Dizon added a comment - 15/Feb/13 6:03 PM In regards to the question of waiting for a few minutes, the answer is no. Even if we wait for hours, the inactive memory is never given back to the system, we are forced to reboot these nodes to return them with their full memory again. However, as you can see from the output in my last message, we start off with > 6 GB of inactive memory at the beginning of the epilogue and ~ 1 GB of inactive memory after the epilogue has waited approximately 30 seconds. Although, no matter how long we wait, that 1 GB of memory is never returned to the system We had planned to set up a run of WRF to test the memory usage on our test cluster, but this has gotten delayed as all of us were busy during the week. We will have to wait until next week to get you some data on memory usage. Having talked with someone much more familiar with WRF and its dependencies than myself, it sounds like running the WRF software the way that is being done here, it may be a fairly big hassle. In other words, getting it running for you locally may be fairly difficult. We will have to see if going down that road is necessary when we give you some more data. In the meantime, I'm curious as to how WhamCloud has determined that our problem does not match up with http://jira.whamcloud.com/browse/LU-1576 . The symptoms are identical, and it was suggested in the HPDD discussion list that this was an occurrence in Lustre 2.1.2 for some irregular I/O patterns. What do they see as different between our problem and the one described by LLNL on the list? For my future reference, I would be interested to know how they determined that so I could use their methods for better diagnosing Lustre problems in the future. I'll have more to share with you next week. Thanks

Cliff White (Inactive) added a comment - 15/Feb/13 5:30 PM

Can you update us on your status?

Cliff White (Inactive) added a comment - 15/Feb/13 5:30 PM Can you update us on your status?

Cliff White (Inactive) added a comment - 11/Feb/13 6:18 PM

It is possible you are seeing uncommitted writes to the OSTs - if you wait 5 - 10 minutes, can the cache be cleaned? It is possible you are seeing a leak, however this does not appear to match ~~LU-1576~~. The 'slabtop' tool may provide some additional data on memory consumption.
If this is a leak, is there any way you can run this on a single node, and/or provide us with the workload? We may need to reproduce this one locally to fix the issue.

Cliff White (Inactive) added a comment - 11/Feb/13 6:18 PM It is possible you are seeing uncommitted writes to the OSTs - if you wait 5 - 10 minutes, can the cache be cleaned? It is possible you are seeing a leak, however this does not appear to match LU-1576 . The 'slabtop' tool may provide some additional data on memory consumption. If this is a leak, is there any way you can run this on a single node, and/or provide us with the workload? We may need to reproduce this one locally to fix the issue.

People

Assignee:: Oleg Drokin

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 11/Feb/13 3:59 PM

Updated:: 10/Apr/18 5:11 PM

Resolved:: 10/Apr/18 5:11 PM