Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2795

WRF runs causing Lustre clients to lose memory

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • None
    • None
    • 3
    • 6768

    Description

      At our center, we are running a Lustre 2.1.2 file system with Lustre 2.1.2 clients on all of the compute nodes of our Penguin cluster. Recently, a user has been performing WRF runs where he uses a special feature of WRF to offload all of the I/O onto a single node, which improves his I/O performance dramatically, but results in the node losing ~1 GB of memory to "Inactive" after each run. In our epilogue, we have a script checking for available free memory above a specified percentage, and every job that this user runs results in the node being set to offline due to this 1 GB of Inactive memory.

      Here is an example of the output from drop_caches showing before and after the epilogue starts on one of these nodes:

      1. Before:
      2. MemTotal: 15.681 GB
      3. MemFree: 6.495 GB
      4. Cached: 6.206 GB
      5. Active: 1.395 GB
      6. Inactive: 6.247 GB
      7. Dirty: 0.000 GB
      8. Mapped: 0.003 GB
      9. Slab: 1.391 GB
      10. After:
      11. MemTotal: 15.681 GB
      12. MemFree: 14.003 GB
      13. Cached: 0.007 GB
      14. Active: 0.134 GB
      15. Inactive: 1.309 GB
      16. Dirty: 0.000 GB
      17. Mapped: 0.003 GB
      18. Slab: 0.082 GB

      While looking for possible solutions to this problem, I stumbled upon a recent HPDD-Discuss question that was entitled "Possible file page leak in Lustre 2.1.2" which was very similar to our problem. It was suggested that the issue had already been discovered and resolved in http://jira.whamcloud.com/browse/LU-1576

      This ticket suggests that the resolution was included as part of Lustre 2.1.3, so we tested this by installing the Lustre 2.1.3 client packages on some of our compute nodes and allowing the WRF job to run on these nodes. However, even after the upgrade to Lustre 2.1.3, we still saw the inactive memory at the end of the job. Do we need to upgrade our Lustre installation on the OSSes and MDS to Lustre 2.1.3 to fix this problem, or do you have any other suggestions?

      Any help that you could provide us with would be appreciated!

      Attachments

        Activity

          People

            green Oleg Drokin
            adizon Archie Dizon
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: