Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5179

Reading files from lustre results in stuck anonymous memory when JOBID is enabled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.0, Lustre 2.4.3
    • Clients:
      Endeavour: 2.4.3, ldan: 2.4.1, Pleiades compute nodes: 2.1.5 or 2.4.1
      Servers:
      2.1.5, 2.4.1, 2.4.3
    • 2
    • 14374

    Description

      We have been seeing our SLES11SP2 and SLES11SP3 clients have stuck anonymous memory that cannot be cleared up without a reboot. We have three test cases which can replicate the problem reliably. We have been able to replicate the problem on different clients on all of our lustre file systems. We have not been able to reproduce the problem when using NFS, ext3, CXFS, or tmpfs.

      We have been working with SGI on tracking down this problem. Unfortunately, they have been unable to reproduce the problem on their systems. On our systems, they have simplified the test case to mmaping a file along with an equally sized anonymous region, and reading the contents of the mmaped file into the anonymous mmaped region. This test case can be provided to see if you can reproduce this problem.

      To determine if the problem is occurring, reboot the system to ensure that memory is clean. Check /proc/meminfo for the amount of Active(anon) memory being used. Run the test case. During the test case, the amount of anonymous memory will increase. At the end of the test case, it would be expected for the amount to drop back to pre-test case levels.

      To confirm that the anonymous memory is stuck, we have been using memhog to attempt to allocate memory. If the node has 32Gb of memory, with 2Gb of anonymous memory used, we attempt to allocate 31Gb of memory. If memhog completes and you then have only 1Gb of anonymous memory, you have not reproduced the problem. If memhog is killed, you have.

      SGI would like to get information about how to get debug information to track down this problem.

      Attachments

        1. mmap.c
          2 kB
        2. mmap4.c
          2 kB
        3. T2.tgz
          0.2 kB
        4. test_ckpt.cpp
          1 kB

        Activity

          [LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled
          spimpale Swapnil Pimpale (Inactive) added a comment - b2_4 backport: http://review.whamcloud.com/10868

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          yobbo Scott Emery added a comment -

          Jay built a copy of the client for the ldan test system. Initial testing is positive, this fixes the test cases reported in this LU on the type of system I am using to test.

          yobbo Scott Emery added a comment - Jay built a copy of the client for the ldan test system. Initial testing is positive, this fixes the test cases reported in this LU on the type of system I am using to test.

          yes, applying it on the client side only will fix the problem.

          jay Jinshan Xiong (Inactive) added a comment - yes, applying it on the client side only will fix the problem.

          The patch affects both server and client. If I only update the client, would it solve the problem at the client side?

          jaylan Jay Lan (Inactive) added a comment - The patch affects both server and client. If I only update the client, would it solve the problem at the client side?
          green Oleg Drokin added a comment -

          Also additional info.
          For the problem to manifest you must have jobid functionality enabled in the "read env variable" mode. Once I enabled that, I can immediately reproduce and so I tested the patch and it fixes the problem for me.

          green Oleg Drokin added a comment - Also additional info. For the problem to manifest you must have jobid functionality enabled in the "read env variable" mode. Once I enabled that, I can immediately reproduce and so I tested the patch and it fixes the problem for me.
          green Oleg Drokin added a comment -

          I did audit of all mmput calls in the code, there are only two functions that use it.
          One of them has a mm_struct leak that would introduce symptoms very similar to what you see.

          My proposed patch is at http://review.whamcloud.com/10759 please give it a try

          green Oleg Drokin added a comment - I did audit of all mmput calls in the code, there are only two functions that use it. One of them has a mm_struct leak that would introduce symptoms very similar to what you see. My proposed patch is at http://review.whamcloud.com/10759 please give it a try

          People

            green Oleg Drokin
            hyeung Herbert Yeung
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: