Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5179

Reading files from lustre results in stuck anonymous memory when JOBID is enabled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.0, Lustre 2.4.3
    • Clients:
      Endeavour: 2.4.3, ldan: 2.4.1, Pleiades compute nodes: 2.1.5 or 2.4.1
      Servers:
      2.1.5, 2.4.1, 2.4.3
    • 2
    • 14374

    Description

      We have been seeing our SLES11SP2 and SLES11SP3 clients have stuck anonymous memory that cannot be cleared up without a reboot. We have three test cases which can replicate the problem reliably. We have been able to replicate the problem on different clients on all of our lustre file systems. We have not been able to reproduce the problem when using NFS, ext3, CXFS, or tmpfs.

      We have been working with SGI on tracking down this problem. Unfortunately, they have been unable to reproduce the problem on their systems. On our systems, they have simplified the test case to mmaping a file along with an equally sized anonymous region, and reading the contents of the mmaped file into the anonymous mmaped region. This test case can be provided to see if you can reproduce this problem.

      To determine if the problem is occurring, reboot the system to ensure that memory is clean. Check /proc/meminfo for the amount of Active(anon) memory being used. Run the test case. During the test case, the amount of anonymous memory will increase. At the end of the test case, it would be expected for the amount to drop back to pre-test case levels.

      To confirm that the anonymous memory is stuck, we have been using memhog to attempt to allocate memory. If the node has 32Gb of memory, with 2Gb of anonymous memory used, we attempt to allocate 31Gb of memory. If memhog completes and you then have only 1Gb of anonymous memory, you have not reproduced the problem. If memhog is killed, you have.

      SGI would like to get information about how to get debug information to track down this problem.

      Attachments

        1. mmap.c
          2 kB
        2. mmap4.c
          2 kB
        3. T2.tgz
          0.2 kB
        4. test_ckpt.cpp
          1 kB

        Activity

          [LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled

          yes, applying it on the client side only will fix the problem.

          jay Jinshan Xiong (Inactive) added a comment - yes, applying it on the client side only will fix the problem.

          The patch affects both server and client. If I only update the client, would it solve the problem at the client side?

          jaylan Jay Lan (Inactive) added a comment - The patch affects both server and client. If I only update the client, would it solve the problem at the client side?
          green Oleg Drokin added a comment -

          Also additional info.
          For the problem to manifest you must have jobid functionality enabled in the "read env variable" mode. Once I enabled that, I can immediately reproduce and so I tested the patch and it fixes the problem for me.

          green Oleg Drokin added a comment - Also additional info. For the problem to manifest you must have jobid functionality enabled in the "read env variable" mode. Once I enabled that, I can immediately reproduce and so I tested the patch and it fixes the problem for me.
          green Oleg Drokin added a comment -

          I did audit of all mmput calls in the code, there are only two functions that use it.
          One of them has a mm_struct leak that would introduce symptoms very similar to what you see.

          My proposed patch is at http://review.whamcloud.com/10759 please give it a try

          green Oleg Drokin added a comment - I did audit of all mmput calls in the code, there are only two functions that use it. One of them has a mm_struct leak that would introduce symptoms very similar to what you see. My proposed patch is at http://review.whamcloud.com/10759 please give it a try

          we may have found the problem, Oleg will push a patch soon.

          jay Jinshan Xiong (Inactive) added a comment - we may have found the problem, Oleg will push a patch soon.
          yobbo Scott Emery added a comment -

          ldan2 and ldan3 below are fairly ordinary self-contained systems running
          Lustre 2.4.1-6nas_ofed154 client. This problem has been reproduced on several versions of the NAS Lustre client and server software.

          Log into ldan2. (I've mostly used a qsub session, but I have reproduced the problem outside of PBS)

          Log into ldan3. ( I have special permission to log into an ldan I don't
          have a PBS job running on)

          On both systems cd to the test (lustre) directory
          In this directory exist the following:
          1. a copy of hedi's test, I've tested with the first one he wrote and a fairly
          late version(mmap4.c). The later version (attached) is more flexible.

          2. a 1g file created with:
          dd if=/dev/zero of=1g bs=4096 count=262144

          #

          1. set up your favorite Anonymous Memory monitor on ldan2
          2. I've been using nodeinfo in a seperate window
            #

          On ldan3 run:
          dd count=1 bs=1 conv=notrunc of=1g if=/dev/zero

          On ldan2 run:
          ./mmap 1g

          After the mmap program terminates, notice that the anonymous memory
          used by mmap remains in memory. I've never been able to force
          persistent Anonymous memory out of the greater Linux virtual memory
          system (memory + swap). Anonymous memory swapped to disk is not
          accounted for as "Anonymous Memory", but it is accounted for as swap.

          The problem does not reproduce if the dd "interference" is not run.
          The problem does not reproduce if the dd "interference" is run and then
          lflush is run, or the file is read (w/o mmap) from a third system. The problem does not reproduce if the dd "interference" is run on ldan2, then the mmap test is run on ldan2.

          yobbo Scott Emery added a comment - ldan2 and ldan3 below are fairly ordinary self-contained systems running Lustre 2.4.1-6nas_ofed154 client. This problem has been reproduced on several versions of the NAS Lustre client and server software. Log into ldan2. (I've mostly used a qsub session, but I have reproduced the problem outside of PBS) Log into ldan3. ( I have special permission to log into an ldan I don't have a PBS job running on) On both systems cd to the test (lustre) directory In this directory exist the following: 1. a copy of hedi's test, I've tested with the first one he wrote and a fairly late version(mmap4.c). The later version (attached) is more flexible. 2. a 1g file created with: dd if=/dev/zero of=1g bs=4096 count=262144 # set up your favorite Anonymous Memory monitor on ldan2 I've been using nodeinfo in a seperate window # On ldan3 run: dd count=1 bs=1 conv=notrunc of=1g if=/dev/zero On ldan2 run: ./mmap 1g After the mmap program terminates, notice that the anonymous memory used by mmap remains in memory. I've never been able to force persistent Anonymous memory out of the greater Linux virtual memory system (memory + swap). Anonymous memory swapped to disk is not accounted for as "Anonymous Memory", but it is accounted for as swap. The problem does not reproduce if the dd "interference" is not run. The problem does not reproduce if the dd "interference" is run and then lflush is run, or the file is read (w/o mmap) from a third system. The problem does not reproduce if the dd "interference" is run on ldan2, then the mmap test is run on ldan2.
          jay Jinshan Xiong (Inactive) added a comment - - edited

          just to narrow down the problem, will you please try it again with huge page disabled?

          I really meant to say transparent huge page.

          jay Jinshan Xiong (Inactive) added a comment - - edited just to narrow down the problem, will you please try it again with huge page disabled? I really meant to say transparent huge page.

          Below is an exmaple of a system that stuck in this memory. The system has 4TB memry and 1.5TB stuck in Acitve(anon) that can not be released. There are 126 nodes in that systems and application would request a number of nodes for their testing. After the memory leak, those nodes would have not enough memory for other jobs. The application would fail and resubmision of the job would then fail to start because the requested notes did not have enough memory.

          1. cat /proc/meminfo
            MemTotal: 4036524872 kB
            MemFree: 2399583516 kB
            Buffers: 243504 kB
            Cached: 5204560 kB
            SwapCached: 678520 kB
            Active: 1544908812 kB
            Inactive: 56619188 kB
            Active(anon): 1543105636 kB
            Inactive(anon): 53018772 kB
            Active(file): 1803176 kB
            Inactive(file): 3600416 kB
            Unevictable: 0 kB
            Mlocked: 0 kB
            SwapTotal: 10239996 kB
            SwapFree: 0 kB
            Dirty: 554504 kB
            Writeback: 26128 kB
            AnonPages: 1595359296 kB
            Mapped: 143708 kB
            Shmem: 98772 kB
            Slab: 11485844 kB
            SReclaimable: 161660 kB
            SUnreclaim: 11324184 kB
            KernelStack: 87016 kB
            PageTables: 6747560 kB
            NFS_Unstable: 31856 kB
            Bounce: 0 kB
            WritebackTmp: 0 kB
            CommitLimit: 2027403680 kB
            Committed_AS: 1262704572 kB
            VmallocTotal: 34359738367 kB
            VmallocUsed: 18824052 kB
            VmallocChunk: 25954600480 kB
            HardwareCorrupted: 0 kB
            AnonHugePages: 1264787456 kB
            HugePages_Total: 1073
            HugePages_Free: 468
            HugePages_Rsvd: 468
            HugePages_Surp: 1073
            Hugepagesize: 2048 kB
            DirectMap4k: 335872 kB
            DirectMap2M: 134963200 kB
            DirectMap1G: 3958374400 kB
          jaylan Jay Lan (Inactive) added a comment - Below is an exmaple of a system that stuck in this memory. The system has 4TB memry and 1.5TB stuck in Acitve(anon) that can not be released. There are 126 nodes in that systems and application would request a number of nodes for their testing. After the memory leak, those nodes would have not enough memory for other jobs. The application would fail and resubmision of the job would then fail to start because the requested notes did not have enough memory. cat /proc/meminfo MemTotal: 4036524872 kB MemFree: 2399583516 kB Buffers: 243504 kB Cached: 5204560 kB SwapCached: 678520 kB Active: 1544908812 kB Inactive: 56619188 kB Active(anon): 1543105636 kB Inactive(anon): 53018772 kB Active(file): 1803176 kB Inactive(file): 3600416 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 10239996 kB SwapFree: 0 kB Dirty: 554504 kB Writeback: 26128 kB AnonPages: 1595359296 kB Mapped: 143708 kB Shmem: 98772 kB Slab: 11485844 kB SReclaimable: 161660 kB SUnreclaim: 11324184 kB KernelStack: 87016 kB PageTables: 6747560 kB NFS_Unstable: 31856 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2027403680 kB Committed_AS: 1262704572 kB VmallocTotal: 34359738367 kB VmallocUsed: 18824052 kB VmallocChunk: 25954600480 kB HardwareCorrupted: 0 kB AnonHugePages: 1264787456 kB HugePages_Total: 1073 HugePages_Free: 468 HugePages_Rsvd: 468 HugePages_Surp: 1073 Hugepagesize: 2048 kB DirectMap4k: 335872 kB DirectMap2M: 134963200 kB DirectMap1G: 3958374400 kB

          can you show me the output of /proc/meminfo when you see this problem?

          jay Jinshan Xiong (Inactive) added a comment - can you show me the output of /proc/meminfo when you see this problem?

          Source for test_ckpt uploaded. timeout comes from coreutils-8.22

          hyeung Herbert Yeung added a comment - Source for test_ckpt uploaded. timeout comes from coreutils-8.22
          green Oleg Drokin added a comment -

          I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

          green Oleg Drokin added a comment - I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

          People

            green Oleg Drokin
            hyeung Herbert Yeung
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: