Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4841

Performance regression in master for threads more than 2

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • 3
    • 13338

    Description

      After commit 586e95a5b3f7b9525d78e7efc9f2949387fc9d54 we have significant performance degradation in master. The following pictures show the regress measured on Xeon Phi:

      Attachments

        1. lu-4841.png
          lu-4841.png
          42 kB
        2. lu-4841-1.png
          lu-4841-1.png
          46 kB
        3. lu-4841-5.png
          lu-4841-5.png
          24 kB
        4. perf_2.5.png
          perf_2.5.png
          15 kB
        5. perf_master.png
          perf_master.png
          13 kB

        Activity

          [LU-4841] Performance regression in master for threads more than 2

          new ticket filed at LU-6842

          jay Jinshan Xiong (Inactive) added a comment - new ticket filed at LU-6842

          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.

          morrone Christopher Morrone (Inactive) added a comment - 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.
          jay Jinshan Xiong (Inactive) added a comment - - edited

          There are a few things to look at for this topic:
          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

          3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          jay Jinshan Xiong (Inactive) added a comment - - edited There are a few things to look at for this topic: 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          There are a few things to look at for this topic:
          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

          3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          jay Jinshan Xiong (Inactive) added a comment - There are a few things to look at for this topic: 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.

          adilger Andreas Dilger added a comment - Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.

          This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up.

          The 5 second txg timeout in ZFS is is the typical txg maximum. Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit.

          And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.

          morrone Christopher Morrone (Inactive) added a comment - This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up. The 5 second txg timeout in ZFS is is the typical txg maximum . Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit. And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.

          In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.

          I didn't mean the memory is for Lustre client to use for buffer space. For write RPC, clients have to hold those writing pages until the OST commits the corresponding transaction. Therefore, clients have to have extra memory to pin those pages in memory and applications can not use them. For ZFS, the typical txg timeout is 5 seconds, which means the clients will pin 5 seconds of writing data in memory; depending on the writing throughput on the client side, this can be a lot.

          There is really nothing we can do on the client side. Probably we can do some tune for ZFS. The I/O generated by Lustre is different from I/O of generic workload, so we may look into the timeout of txg or restrict the memory of write cache.

          jay Jinshan Xiong (Inactive) added a comment - In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space. I didn't mean the memory is for Lustre client to use for buffer space. For write RPC, clients have to hold those writing pages until the OST commits the corresponding transaction. Therefore, clients have to have extra memory to pin those pages in memory and applications can not use them. For ZFS, the typical txg timeout is 5 seconds, which means the clients will pin 5 seconds of writing data in memory; depending on the writing throughput on the client side, this can be a lot. There is really nothing we can do on the client side. Probably we can do some tune for ZFS. The I/O generated by Lustre is different from I/O of generic workload, so we may look into the timeout of txg or restrict the memory of write cache.

          By plenty, I meant there are plenty of available memory.

          Yes, but how much memory is "plenty"? In the real world, memory is a finite resource. We can not program with the assumption that there is always free memory. Lustre must behave reasonably by default when memory is under contention.

          Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST.

          No, it does not. Client memory is far faster than disk on a remote OST over the network pretty much by definition. Client memory under normal use cases is also under contention by actual applications, which are not represented by the naive tests that were used to create the graphs in this ticket.

          In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.

          Memory contention is normal in the real world, and Lustre's defaults should be selected to meet and function reasonably under real world usage.

          morrone Christopher Morrone (Inactive) added a comment - - edited By plenty, I meant there are plenty of available memory. Yes, but how much memory is "plenty"? In the real world, memory is a finite resource. We can not program with the assumption that there is always free memory. Lustre must behave reasonably by default when memory is under contention. Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST. No, it does not. Client memory is far faster than disk on a remote OST over the network pretty much by definition. Client memory under normal use cases is also under contention by actual applications, which are not represented by the naive tests that were used to create the graphs in this ticket. In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space. Memory contention is normal in the real world, and Lustre's defaults should be selected to meet and function reasonably under real world usage.
          gerrit Gerrit Updater added a comment - - edited

          N/A

          gerrit Gerrit Updater added a comment - - edited N/A

          People

            jay Jinshan Xiong (Inactive)
            dmiter Dmitry Eremin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: