Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1514

Lower than expected Metadata performance

Details

    • Epic
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.2.0
    • None
    • See below.

    Description

      Over the past 12 months we at CSCS have been carefully benchmarking the Lustre metadata performance from the compute nodes of our Cray XE6 system. In this timeframe we have had four distinct configurations of the scratch filesystem namely:
      a) Cray XE6 (Gemini interconnect) connected via 12 QDR IB connections from 12 Service Nodes acting as routers to the Compute Nodes and a QDR switch to an external Lustre v2.2 filesystem hosted by 12 Intel SandyBridge OSSes (Intel Xeon CPU E5-2670 2.60GHz 64GByte RAM, dual socket 16 cores/socket running in hyperthreaded mode) and 2 AMD Opteron MDSes (AMD Opteron 6128 64GBytes RAM, 2.0 GHz, dual socket 8 cores/socket) and with 6 LSI7900 controller couplets and 768 10K rpm 2TB SATA drives in 48 enclosures formatted as 72 8+2PQ RAID6 LUNs and connected to the servers via 12 8Gbit/sec FC connections;

      b) Cray XE6 (Gemini interconnect) connected via 4 QDR IB connections from 4 Service Nodes acting as routers to the Compute Nodes and a QDR switch to an external Lustre v1.8.4 filesystem hosted by 4 AMD Opteron OSSes (AMD Opteron 6134 2.3 GHz 32 GBytes RAM, dual socket 8 cores/socket) and 2 AMD Opteron MDSes (AMD Opteron 2218 dual socket 2 cores/socket) and with on DDN SFA10K controller couplet and 290 7200rpm 1863GByte SATA drives in 5 SA4601 enclosures;

      c) Cray XE6 (Gemini interconnect) hosting an internal Lustre v1.8.4 filesystem via 12 Service Nodes acting as OSSes (AMD Opteron, 2.2 GHz, single socket 6cores/socket) and 1 Service Node acting as MDS (spec same as OSSes) and with the same back-end storage hardware as in a) above direct attached via 8Gbit/sec FC;

      d) Cray XT5 (SeaStar interconnect) hosting an internal Lustre v1.6.3 filesystem via 20 Service Nodes acting as OSSes (AMD Opteron, 2.6 GHz, single socket, dual Core, 8 GBytes RAM) and 1 service node acting as MDS (spec same as the OSSes) and with 5 LSI7900 controller couplets and 800 512GB 7200rpm SATA drives in 50 enclosures formatted as 80 8+2PQ RAID6 LUNs and connected to the OSSes via 4Gbit/sec FC.

      We used the mdtest application to gather the following results (in operations/second) for each of the above described filesystems. There were 1,400 clients spread over 175, 88 and 44 nodes. The tests were repeated multiple times in each case and the best numbers are presented (I've also attached a graph of the results).

      Directory create:
      (a) 12457.32
      (b) 9797.77
      (c) 10610.32
      (d) 5113.98

      Directory remove:
      (a) 7342.54
      (b) 11586.95
      (c) 10906.81
      (d) 3504.84

      File create:
      (a) 7842.85
      (b) 11685.58
      (c) 18069.77
      (d) 5255.05

      File remove:
      (a) 932.045
      (b) 11836.41
      (c) 13043.88
      (d) 4644.71

      As can be seen from the results the latest filesystem based on Lustre v2.2 with Intel SandyBridge OSSes and AMD Opteron MDSes has almost the worst performance, except in the case of directory create where it is marginally better than the internal Lustre v1.8.4 filesystem. It was our expectation that the new v2.2 Lustre filesystem would perform, in all cases, at least as well as the old v1.8.4 internal filesystem(config c) above). Can WhamCloud please offer advice and tuning options to improve the performance of the Lustre v2.2 filesystem.

      Attachments

        Activity

          [LU-1514] Lower than expected Metadata performance

          Thanks for the feedback. I will confirm the stripe count details for you shortly (the person that ran the tests is currently away). Likewise for the question regarding the use of "uniq" verses "shared directory" mode.

          WRT the SMP scalability issues (due to lock contention and thread context switching overheads) we wondered if either of the following would help:
          1) Confine the Lustre processes to a small number of cores using cpusets (will this work for the Lustre processes)?
          2) Remove or otherwise disable one of the 8-core Opteron processors in the MDS so that we only have one 8-core processor active.

          colinmcmurtrie Colin McMurtrie added a comment - Thanks for the feedback. I will confirm the stripe count details for you shortly (the person that ran the tests is currently away). Likewise for the question regarding the use of "uniq" verses "shared directory" mode. WRT the SMP scalability issues (due to lock contention and thread context switching overheads) we wondered if either of the following would help: 1) Confine the Lustre processes to a small number of cores using cpusets (will this work for the Lustre processes)? 2) Remove or otherwise disable one of the 8-core Opteron processors in the MDS so that we only have one 8-core processor active.

          Are these tests running with same number of clients and with same stripecount for files?

          Except the directory creation and file removal performance of 2.2, I would say other numbers are not big surprise to me, because current Lustre versions have SMP scalability issues, which means you might suffer from high lock contention and overhead of thread context switch etc on fat CPU servers, and here are CPU information of your MDSes:
          a) 16 cores (2 sockets), Lustre 2.2
          b) 4 cores (2 sockets), Lustre 1.8.4
          c) 6 cores (1 socket), Lustre 1.8.4
          d) 2 cores (1 socket), Lustre 1.6.3

          We are in progress of landing many SMP improvement patches to 2.3, I hope they could fix some of your issues.

          liang Liang Zhen (Inactive) added a comment - Are these tests running with same number of clients and with same stripecount for files? Except the directory creation and file removal performance of 2.2, I would say other numbers are not big surprise to me, because current Lustre versions have SMP scalability issues, which means you might suffer from high lock contention and overhead of thread context switch etc on fat CPU servers, and here are CPU information of your MDSes: a) 16 cores (2 sockets), Lustre 2.2 b) 4 cores (2 sockets), Lustre 1.8.4 c) 6 cores (1 socket), Lustre 1.8.4 d) 2 cores (1 socket), Lustre 1.6.3 We are in progress of landing many SMP improvement patches to 2.3, I hope they could fix some of your issues.
          green Oleg Drokin added a comment -

          Did you run the tests in "uniq" mode?
          Try to run them in shared directory mode, this is where the bulk of improvements are (since it's a much more common usecase).

          green Oleg Drokin added a comment - Did you run the tests in "uniq" mode? Try to run them in shared directory mode, this is where the bulk of improvements are (since it's a much more common usecase).

          People

            liang Liang Zhen (Inactive)
            colinmcmurtrie Colin McMurtrie
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: