[LU-2795] WRF runs causing Lustre clients to lose memory - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
6768

Description

At our center, we are running a Lustre 2.1.2 file system with Lustre 2.1.2 clients on all of the compute nodes of our Penguin cluster. Recently, a user has been performing WRF runs where he uses a special feature of WRF to offload all of the I/O onto a single node, which improves his I/O performance dramatically, but results in the node losing ~1 GB of memory to "Inactive" after each run. In our epilogue, we have a script checking for available free memory above a specified percentage, and every job that this user runs results in the node being set to offline due to this 1 GB of Inactive memory.

Here is an example of the output from drop_caches showing before and after the epilogue starts on one of these nodes:

Before:
MemTotal: 15.681 GB
MemFree: 6.495 GB
Cached: 6.206 GB
Active: 1.395 GB
Inactive: 6.247 GB
Dirty: 0.000 GB
Mapped: 0.003 GB
Slab: 1.391 GB
After:
MemTotal: 15.681 GB
MemFree: 14.003 GB
Cached: 0.007 GB
Active: 0.134 GB
Inactive: 1.309 GB
Dirty: 0.000 GB
Mapped: 0.003 GB
Slab: 0.082 GB

While looking for possible solutions to this problem, I stumbled upon a recent HPDD-Discuss question that was entitled "Possible file page leak in Lustre 2.1.2" which was very similar to our problem. It was suggested that the issue had already been discovered and resolved in http://jira.whamcloud.com/browse/LU-1576

This ticket suggests that the resolution was included as part of Lustre 2.1.3, so we tested this by installing the Lustre 2.1.3 client packages on some of our compute nodes and allowing the WRF job to run on these nodes. However, even after the upgrade to Lustre 2.1.3, we still saw the inactive memory at the end of the job. Do we need to upgrade our Lustre installation on the OSSes and MDS to Lustre 2.1.3 to fix this problem, or do you have any other suggestions?

Any help that you could provide us with would be appreciated!

Attachments

Activity

People

Assignee:: Oleg Drokin

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 11/Feb/13 3:59 PM

Updated:: 10/Apr/18 5:11 PM

Resolved:: 10/Apr/18 5:11 PM