Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5179

Reading files from lustre results in stuck anonymous memory when JOBID is enabled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.0, Lustre 2.4.3
    • Clients:
      Endeavour: 2.4.3, ldan: 2.4.1, Pleiades compute nodes: 2.1.5 or 2.4.1
      Servers:
      2.1.5, 2.4.1, 2.4.3
    • 2
    • 14374

    Description

      We have been seeing our SLES11SP2 and SLES11SP3 clients have stuck anonymous memory that cannot be cleared up without a reboot. We have three test cases which can replicate the problem reliably. We have been able to replicate the problem on different clients on all of our lustre file systems. We have not been able to reproduce the problem when using NFS, ext3, CXFS, or tmpfs.

      We have been working with SGI on tracking down this problem. Unfortunately, they have been unable to reproduce the problem on their systems. On our systems, they have simplified the test case to mmaping a file along with an equally sized anonymous region, and reading the contents of the mmaped file into the anonymous mmaped region. This test case can be provided to see if you can reproduce this problem.

      To determine if the problem is occurring, reboot the system to ensure that memory is clean. Check /proc/meminfo for the amount of Active(anon) memory being used. Run the test case. During the test case, the amount of anonymous memory will increase. At the end of the test case, it would be expected for the amount to drop back to pre-test case levels.

      To confirm that the anonymous memory is stuck, we have been using memhog to attempt to allocate memory. If the node has 32Gb of memory, with 2Gb of anonymous memory used, we attempt to allocate 31Gb of memory. If memhog completes and you then have only 1Gb of anonymous memory, you have not reproduced the problem. If memhog is killed, you have.

      SGI would like to get information about how to get debug information to track down this problem.

      Attachments

        1. mmap.c
          2 kB
        2. mmap4.c
          2 kB
        3. T2.tgz
          0.2 kB
        4. test_ckpt.cpp
          1 kB

        Activity

          [LU-5179] Reading files from lustre results in stuck anonymous memory when JOBID is enabled

          Source for test_ckpt uploaded. timeout comes from coreutils-8.22

          hyeung Herbert Yeung added a comment - Source for test_ckpt uploaded. timeout comes from coreutils-8.22
          green Oleg Drokin added a comment -

          I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

          green Oleg Drokin added a comment - I would just like to note that there's no source to the timeout and test_ckpt so it's kind of hard to see what they are doing.

          Normally, we run the test script through PBS. To do so, you can use My_run.ivy. If you want to run interactively or without PBS, use My_run_I_ivy. Sample output from PBS is at kdgordon.o328785, though some of the data has been sanitized.

          The script prints out a variety of information, including the amount of anonymous memory being used by the system. The runit script is called which produces the stuck anonymous memory. After that, the amount of used anonymous memory is checked. memhog is called to try to clear up the memory and the amount of anonymous memory is displayed again.

          You will need to tweak the amount of memory that memhog attempts allocate based on how much memory your test system has. Generally, about 800MB of memory gets stuck after running the job. runit can be called several times to increase the amount of stuck memory.

          After running the test, the file checkpoint.mtcp is created. On some of our systems, the file may need to be deleted before being able to reproduce the problem again.

          After running memhog, the memory can go to swap instead and remains there used. This still indicates that the memory is not being freed.

          hyeung Herbert Yeung added a comment - Normally, we run the test script through PBS. To do so, you can use My_run.ivy. If you want to run interactively or without PBS, use My_run_I_ivy. Sample output from PBS is at kdgordon.o328785, though some of the data has been sanitized. The script prints out a variety of information, including the amount of anonymous memory being used by the system. The runit script is called which produces the stuck anonymous memory. After that, the amount of used anonymous memory is checked. memhog is called to try to clear up the memory and the amount of anonymous memory is displayed again. You will need to tweak the amount of memory that memhog attempts allocate based on how much memory your test system has. Generally, about 800MB of memory gets stuck after running the job. runit can be called several times to increase the amount of stuck memory. After running the test, the file checkpoint.mtcp is created. On some of our systems, the file may need to be deleted before being able to reproduce the problem again. After running memhog, the memory can go to swap instead and remains there used. This still indicates that the memory is not being freed.

          Second test case that reproduces the problem.

          hyeung Herbert Yeung added a comment - Second test case that reproduces the problem.
          green Oleg Drokin added a comment -

          From the call:This was also reproduced on rhel6.4 kernel with the same testcase.
          unmounting the fs either fails (fs busy) or when succeeding, does not free memory either.

          The reproducer only works on NASA systems and it's not 100% reliable, but a high frequency is still reported.

          This originally was investigated from strange OOM issues.

          green Oleg Drokin added a comment - From the call:This was also reproduced on rhel6.4 kernel with the same testcase. unmounting the fs either fails (fs busy) or when succeeding, does not free memory either. The reproducer only works on NASA systems and it's not 100% reliable, but a high frequency is still reported. This originally was investigated from strange OOM issues.

          Yes, we have been able to reproduce the problem on all of the clients. Though I am not positive that we have tested every permutation, I believe that all of the lustre versions have been reproduced on all of the clients.

          hyeung Herbert Yeung added a comment - Yes, we have been able to reproduce the problem on all of the clients. Though I am not positive that we have tested every permutation, I believe that all of the lustre versions have been reproduced on all of the clients.
          green Oleg Drokin added a comment -

          Please upload your testcase.

          You list multiple client node versions, is this observed in all of them?

          green Oleg Drokin added a comment - Please upload your testcase. You list multiple client node versions, is this observed in all of them?

          People

            green Oleg Drokin
            hyeung Herbert Yeung
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: