Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1695

Demonstrate MDS performance with increasing client load for SMP Affinity

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • None
    • None
    • Cray XE6 with Lustre 2.1.1 MDS/OSS
    • 3
    • 2186

    Description

      Cray has found that, plotting mds performance against number of clients (or ranks) using either mdtest or metabench, demonstrates that MDS performance for create/stat/unlink rises from 1 to 64 clients, where it peaks, then declines as additional clients are added to the test. Historically, more than 64 clients were not needed to show MDS performance saturation. the problem is that using more than 64 clients leads to a decline in performance rather than reaching a plateau, which would be expected given the limitation of using a single MDS.

      The following data were gathered using metabench to measure rates of create/stat/unlink for a fixed number of files spread over a growing number of clients. We are using Lustre 2.1.1 plus patches on the Lustre servers and the clients were Lustre 1.8.6 on Cray XE6. The data are for 1M files, but the degradation of create and unlink rates as the number of clients increases is consistent for a broad range of file counts. Furthermore, the degradation is higher when all files are in a single directory (as expected).

      Individual directories:
      1M files
      Ranks Nodes Creates Stats Unlinks
      512 32 18868 52702 13730
      1024 64 20615 55660 15427
      2048 128 19583 54987 11249
      4096 256 16587 54386 9586
      8192 512 13807 52892 7910

      Shared directory:
      1M files
      Ranks Nodes Creates Stats Unlinks
      512 32 19636 56030 9905
      1024 64 20149 56807 10190
      1024 64 16610 58880 9339
      2048 128 19890 57257 9343
      4096 256 6906 55991 4338
      8192 512 6348 59329 2761

      The DoD's HPCMOD office first reported this "behavior" to Sun and Cray several years ago following a test they funded to compare Lustre and GPFS metadata performance. For a small range of clients, Lustre out performed GPFS, but then, instead of hitting a plateau with increasing client load, the Lustre MDS performance declined significantly (greater than 64 or 128 nodes, depending on the test run). At the time, Sun told Cray and its customer that making the MDS SMP-aware would resolve the problem.

      As a result, we need to add a test of create/stat/unlink rates as a function of a wide range of client counts into the qualification of the SMP affinity feature. We need to show results before and after the SMP patches. If there is no effect, then these results will provide a baseline for comparison with future investigations.

      Attachments

        Issue Links

          Activity

            People

              liang Liang Zhen (Inactive)
              jcarrier John Carrier (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: