Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
None
-
None
-
3
-
6768
Description
At our center, we are running a Lustre 2.1.2 file system with Lustre 2.1.2 clients on all of the compute nodes of our Penguin cluster. Recently, a user has been performing WRF runs where he uses a special feature of WRF to offload all of the I/O onto a single node, which improves his I/O performance dramatically, but results in the node losing ~1 GB of memory to "Inactive" after each run. In our epilogue, we have a script checking for available free memory above a specified percentage, and every job that this user runs results in the node being set to offline due to this 1 GB of Inactive memory.
Here is an example of the output from drop_caches showing before and after the epilogue starts on one of these nodes:
- Before:
- MemTotal: 15.681 GB
- MemFree: 6.495 GB
- Cached: 6.206 GB
- Active: 1.395 GB
- Inactive: 6.247 GB
- Dirty: 0.000 GB
- Mapped: 0.003 GB
- Slab: 1.391 GB
- After:
- MemTotal: 15.681 GB
- MemFree: 14.003 GB
- Cached: 0.007 GB
- Active: 0.134 GB
- Inactive: 1.309 GB
- Dirty: 0.000 GB
- Mapped: 0.003 GB
- Slab: 0.082 GB
While looking for possible solutions to this problem, I stumbled upon a recent HPDD-Discuss question that was entitled "Possible file page leak in Lustre 2.1.2" which was very similar to our problem. It was suggested that the issue had already been discovered and resolved in http://jira.whamcloud.com/browse/LU-1576
This ticket suggests that the resolution was included as part of Lustre 2.1.3, so we tested this by installing the Lustre 2.1.3 client packages on some of our compute nodes and allowing the WRF job to run on these nodes. However, even after the upgrade to Lustre 2.1.3, we still saw the inactive memory at the end of the job. Do we need to upgrade our Lustre installation on the OSSes and MDS to Lustre 2.1.3 to fix this problem, or do you have any other suggestions?
Any help that you could provide us with would be appreciated!