Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4841

Performance regression in master for threads more than 2

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • 3
    • 13338

    Description

      After commit 586e95a5b3f7b9525d78e7efc9f2949387fc9d54 we have significant performance degradation in master. The following pictures show the regress measured on Xeon Phi:

      Attachments

        1. lu-4841.png
          lu-4841.png
          42 kB
        2. lu-4841-1.png
          lu-4841-1.png
          46 kB
        3. lu-4841-5.png
          lu-4841-5.png
          24 kB
        4. perf_2.5.png
          perf_2.5.png
          15 kB
        5. perf_master.png
          perf_master.png
          13 kB

        Activity

          [LU-4841] Performance regression in master for threads more than 2
          jay Jinshan Xiong (Inactive) made changes -
          Resolution New: Duplicate [ 3 ]
          Status Original: Reopened [ 4 ] New: Closed [ 6 ]
          morrone Christopher Morrone (Inactive) made changes -
          Labels Original: topllnl New: llnl
          jay Jinshan Xiong (Inactive) made changes -
          Link New: This issue is related to DDN-218 [ DDN-218 ]

          new ticket filed at LU-6842

          jay Jinshan Xiong (Inactive) added a comment - new ticket filed at LU-6842

          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.

          morrone Christopher Morrone (Inactive) added a comment - 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.
          jay Jinshan Xiong (Inactive) added a comment - - edited

          There are a few things to look at for this topic:
          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

          3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          jay Jinshan Xiong (Inactive) added a comment - - edited There are a few things to look at for this topic: 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          There are a few things to look at for this topic:
          1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

          2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

          3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          jay Jinshan Xiong (Inactive) added a comment - There are a few things to look at for this topic: 1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option; 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

          Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.

          adilger Andreas Dilger added a comment - Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.

          This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up.

          The 5 second txg timeout in ZFS is is the typical txg maximum. Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit.

          And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.

          morrone Christopher Morrone (Inactive) added a comment - This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up. The 5 second txg timeout in ZFS is is the typical txg maximum . Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit. And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.

          People

            jay Jinshan Xiong (Inactive)
            dmiter Dmitry Eremin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: