Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2795

WRF runs causing Lustre clients to lose memory

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • None
    • None
    • 3
    • 6768

    Description

      At our center, we are running a Lustre 2.1.2 file system with Lustre 2.1.2 clients on all of the compute nodes of our Penguin cluster. Recently, a user has been performing WRF runs where he uses a special feature of WRF to offload all of the I/O onto a single node, which improves his I/O performance dramatically, but results in the node losing ~1 GB of memory to "Inactive" after each run. In our epilogue, we have a script checking for available free memory above a specified percentage, and every job that this user runs results in the node being set to offline due to this 1 GB of Inactive memory.

      Here is an example of the output from drop_caches showing before and after the epilogue starts on one of these nodes:

      1. Before:
      2. MemTotal: 15.681 GB
      3. MemFree: 6.495 GB
      4. Cached: 6.206 GB
      5. Active: 1.395 GB
      6. Inactive: 6.247 GB
      7. Dirty: 0.000 GB
      8. Mapped: 0.003 GB
      9. Slab: 1.391 GB
      10. After:
      11. MemTotal: 15.681 GB
      12. MemFree: 14.003 GB
      13. Cached: 0.007 GB
      14. Active: 0.134 GB
      15. Inactive: 1.309 GB
      16. Dirty: 0.000 GB
      17. Mapped: 0.003 GB
      18. Slab: 0.082 GB

      While looking for possible solutions to this problem, I stumbled upon a recent HPDD-Discuss question that was entitled "Possible file page leak in Lustre 2.1.2" which was very similar to our problem. It was suggested that the issue had already been discovered and resolved in http://jira.whamcloud.com/browse/LU-1576

      This ticket suggests that the resolution was included as part of Lustre 2.1.3, so we tested this by installing the Lustre 2.1.3 client packages on some of our compute nodes and allowing the WRF job to run on these nodes. However, even after the upgrade to Lustre 2.1.3, we still saw the inactive memory at the end of the job. Do we need to upgrade our Lustre installation on the OSSes and MDS to Lustre 2.1.3 to fix this problem, or do you have any other suggestions?

      Any help that you could provide us with would be appreciated!

      Attachments

        Activity

          [LU-2795] WRF runs causing Lustre clients to lose memory

          Thanks, let us know how it goes.

          cliffw Cliff White (Inactive) added a comment - Thanks, let us know how it goes.
          adizon Archie Dizon added a comment -

          Yes, we had tested installing 2.1.3 on a couple of our client systems to
          see if that would fix the problem, but we were still seeing the issue on
          those nodes with the Lustre 2.1.3 client installed on them. Thanks for
          clarifying that, and it doesn't appear that this code would
          be performing a great deal of readdirs, probably not the same memory leak.

          Correct, dropping cache does not free the 1 GB of memory. Our epilogue
          script attempts to drop cache twice, and after the second time it compares
          the amount of free memory before determining if it can return the compute
          node to service.

          We are going to run the WRF job with Lustre at a higher logging level and
          using the leak_finder.pl script provided by WhamCloud. We will send
          whatever we find along to you.

          adizon Archie Dizon added a comment - Yes, we had tested installing 2.1.3 on a couple of our client systems to see if that would fix the problem, but we were still seeing the issue on those nodes with the Lustre 2.1.3 client installed on them. Thanks for clarifying that, and it doesn't appear that this code would be performing a great deal of readdirs, probably not the same memory leak. Correct, dropping cache does not free the 1 GB of memory. Our epilogue script attempts to drop cache twice, and after the second time it compares the amount of free memory before determining if it can return the compute node to service. We are going to run the WRF job with Lustre at a higher logging level and using the leak_finder.pl script provided by WhamCloud. We will send whatever we find along to you.

          You indicated that you had installed 2.1.3, which contains the fix for LU-1576, this was our main indication. The LU-1576 fix mostly deals with readdir pages, so unless your workload includes a lot of readdirs your have likely a different problem.

          Are you saying that dropping cache does not free the 1GB of memory?

          cliffw Cliff White (Inactive) added a comment - You indicated that you had installed 2.1.3, which contains the fix for LU-1576 , this was our main indication. The LU-1576 fix mostly deals with readdir pages, so unless your workload includes a lot of readdirs your have likely a different problem. Are you saying that dropping cache does not free the 1GB of memory?
          adizon Archie Dizon added a comment -

          In regards to the question of waiting for a few minutes, the answer is no.
          Even if we wait for hours, the inactive memory is never given back to the
          system, we are forced to reboot these nodes to return them with their full
          memory again. However, as you can see from the output in my last message,
          we start off with > 6 GB of inactive memory at the beginning of the
          epilogue and ~ 1 GB of inactive memory after the epilogue has waited
          approximately 30 seconds. Although, no matter how long we wait, that 1 GB
          of memory is never returned to the system

          We had planned to set up a run of WRF to test the memory usage on our test
          cluster, but this has gotten delayed as all of us were busy during the
          week. We will have to wait until next week to get you some data on memory
          usage.

          Having talked with someone much more familiar with WRF and its dependencies
          than myself, it sounds like running the WRF software the way that is being
          done here, it may be a fairly big hassle. In other words, getting it
          running for you locally may be fairly difficult. We will have to see if
          going down that road is necessary when we give you some more data.

          In the meantime, I'm curious as to how WhamCloud has determined that our
          problem does not match up with http://jira.whamcloud.com/browse/LU-1576.
          The symptoms are identical, and it was suggested in the HPDD discussion
          list that this was an occurrence in Lustre 2.1.2 for some irregular I/O
          patterns. What do they see as different between our problem and the one
          described by LLNL on the list? For my future reference, I would be
          interested to know how they determined that so I could use their methods
          for better diagnosing Lustre problems in the future.

          I'll have more to share with you next week.

          Thanks

          adizon Archie Dizon added a comment - In regards to the question of waiting for a few minutes, the answer is no. Even if we wait for hours, the inactive memory is never given back to the system, we are forced to reboot these nodes to return them with their full memory again. However, as you can see from the output in my last message, we start off with > 6 GB of inactive memory at the beginning of the epilogue and ~ 1 GB of inactive memory after the epilogue has waited approximately 30 seconds. Although, no matter how long we wait, that 1 GB of memory is never returned to the system We had planned to set up a run of WRF to test the memory usage on our test cluster, but this has gotten delayed as all of us were busy during the week. We will have to wait until next week to get you some data on memory usage. Having talked with someone much more familiar with WRF and its dependencies than myself, it sounds like running the WRF software the way that is being done here, it may be a fairly big hassle. In other words, getting it running for you locally may be fairly difficult. We will have to see if going down that road is necessary when we give you some more data. In the meantime, I'm curious as to how WhamCloud has determined that our problem does not match up with http://jira.whamcloud.com/browse/LU-1576 . The symptoms are identical, and it was suggested in the HPDD discussion list that this was an occurrence in Lustre 2.1.2 for some irregular I/O patterns. What do they see as different between our problem and the one described by LLNL on the list? For my future reference, I would be interested to know how they determined that so I could use their methods for better diagnosing Lustre problems in the future. I'll have more to share with you next week. Thanks

          Can you update us on your status?

          cliffw Cliff White (Inactive) added a comment - Can you update us on your status?

          It is possible you are seeing uncommitted writes to the OSTs - if you wait 5 - 10 minutes, can the cache be cleaned? It is possible you are seeing a leak, however this does not appear to match LU-1576. The 'slabtop' tool may provide some additional data on memory consumption.
          If this is a leak, is there any way you can run this on a single node, and/or provide us with the workload? We may need to reproduce this one locally to fix the issue.

          cliffw Cliff White (Inactive) added a comment - It is possible you are seeing uncommitted writes to the OSTs - if you wait 5 - 10 minutes, can the cache be cleaned? It is possible you are seeing a leak, however this does not appear to match LU-1576 . The 'slabtop' tool may provide some additional data on memory consumption. If this is a leak, is there any way you can run this on a single node, and/or provide us with the workload? We may need to reproduce this one locally to fix the issue.

          People

            green Oleg Drokin
            adizon Archie Dizon
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: