Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-66

obdfilter-survey performance issue on NUMA system

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 22,980
    • 8541

    Description

      this is just copy of bug 22980, but I think it's better to track & discuss it at here:

      Hello,

      Testing our new IO servers we have an issue with obdfilter-survey. Our OSSs are based on 4
      Nehalem-EX processors, connected to a Boxboro chipset. Every socket has 6 cores. On every OST we
      have several FC channels connected to our storage bay.

      When we perform raw tests with sgpdd-survey, over 24 luns we get ~4400 MB/s on write and more than
      5500 MB/s on read.

      Then if we start a Lustre filesystem and we test these 24 osts with obdfilter-survey (size=24192
      rszlo=1024 rszhi=1024 nobjlo=1 nobjhi=2 thrlo=1 thrhi=16 case=disk tests_str="write read" sh
      obdfilter-survey) we always have a performance limit on 1200 MB/s for write and read.

      If we perform IOzone tests from five clients (2 threads per client, connected to the server with
      Infiniband) we get more than 2500 MB/s.

      Then we disconnected two sockets using command "echo 0 > /sys/devices/system/cpu/cpu5/online" on
      every cpu belonging to these two sockets and we get expected results on obdfilter-survey (4600 MB/s
      on write and 5500 MB/s on read). If we only disconnect one socket then obdfilter-survey gives us a
      max of 1600 MB/s. Using only one socket results are slightly worse than with two sockets.

      We also made these tests with Lustre 1.6, with other storage bays and with similar platforms (4
      sockets and 8 cpus per socket) having always the same kind of problem. If we activate the
      hyper-threading functionality on every socket then performances are even worse.

      It's like if obdfilter-survey has any kind of saturation when there are many sockets. What do you
      think? Thanks,

      Attachments

        1. affinity_map
          0.2 kB
        2. affinity_results.tgz
          465 kB
        3. bull_obdfilter_survey_chart_110309.pdf
          65 kB
        4. bull_obdfilter_survey_chart_110319.pdf
          66 kB
        5. full_results_kmalloc.tgz
          346 kB
        6. full_results.tgz
          727 kB
        7. lctl_setaffinity_v2.patch
          4 kB
        8. new_results_kmalloc.tgz
          78 kB
        9. obdfilter-survey_results.txt
          17 kB
        10. remove_vmalloc.patch
          3 kB

        Activity

          [LU-66] obdfilter-survey performance issue on NUMA system

          I don't think we need run the affinity tests, thank you.

          niu Niu Yawei (Inactive) added a comment - I don't think we need run the affinity tests, thank you.

          You're welcome.

          The storage array we are attached to should not give us more than 5 GB/s (read and write). So I think the figures given by obdfilter-survey are inaccurate because the test is not longer enough. Maybe I should increase size.

          Do you still need me to run affinity tests?

          Cheers,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - You're welcome. The storage array we are attached to should not give us more than 5 GB/s (read and write). So I think the figures given by obdfilter-survey are inaccurate because the test is not longer enough. Maybe I should increase size. Do you still need me to run affinity tests? Cheers, Sebastien.

          Thanks for your testing, Sebastien.

          The result shows both read and write performance got huge improvement, and the oprofile data looks normal this time. So I think the degradation is caused by the contention on BKL and vmap_area_lock. What I don't see is why the read throughput is extreme high in some cases (more than 10000 MB/s), what's the raw bandwith of each OST?

          niu Niu Yawei (Inactive) added a comment - Thanks for your testing, Sebastien. The result shows both read and write performance got huge improvement, and the oprofile data looks normal this time. So I think the degradation is caused by the contention on BKL and vmap_area_lock. What I don't see is why the read throughput is extreme high in some cases (more than 10000 MB/s), what's the raw bandwith of each OST?

          Full obdfilter-survey results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find:

          • summary.txt: array that sums up test results
          • result_*.txt: results for a specific test, with also 'numastat' output
          • opreport_*.txt: associated oprofile data
          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Full obdfilter-survey results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find: summary.txt: array that sums up test results result_*.txt: results for a specific test, with also 'numastat' output opreport_*.txt: associated oprofile data

          Change the vmalloc to kmalloc in ioctl path. (the previous one isn't correct, updated with this one)

          niu Niu Yawei (Inactive) added a comment - Change the vmalloc to kmalloc in ioctl path. (the previous one isn't correct, updated with this one)

          Hi, Sebastien

          The oprfile data provided by you is very helpful, in the unpatched tests, we can see thread_return() has extremly high rank, I think it's caused by the contention on BKL; in the patched (with unlocked_ioctl) tests, we can see alloc_vmap_area() and find_vmap_area() have very high rank, I think it's caused by the contention on the vmap_area_lock.

          I made a patch (remove_vmalloc.patch) which change the vmalloc() to kmalloc() in ioctl path, which could eliminate the contention on vmap_area_lock. Before you run the tests which I suggested in my last comment, I really like you to run this patch (togehter with the unlock_ioctl patch) first to see what'll happen. (of course, please enable oprofile while running tests). Thank you.

          niu Niu Yawei (Inactive) added a comment - Hi, Sebastien The oprfile data provided by you is very helpful, in the unpatched tests, we can see thread_return() has extremly high rank, I think it's caused by the contention on BKL; in the patched (with unlocked_ioctl) tests, we can see alloc_vmap_area() and find_vmap_area() have very high rank, I think it's caused by the contention on the vmap_area_lock. I made a patch (remove_vmalloc.patch) which change the vmalloc() to kmalloc() in ioctl path, which could eliminate the contention on vmap_area_lock. Before you run the tests which I suggested in my last comment, I really like you to run this patch (togehter with the unlock_ioctl patch) first to see what'll happen. (of course, please enable oprofile while running tests). Thank you.

          change vmalloc() to kamlloc() in the iocl path.

          niu Niu Yawei (Inactive) added a comment - change vmalloc() to kamlloc() in the iocl path.

          Hi, Sebastien

          I'd like you to run two tests, one in 'thread' mode, and another in 'objid' mode:

          • with 'unlocked_ioctl' patch;
          • 24 cores actived; (cores distributed on 4 sockets);
          • with oprofile and numastats;
          • provide the /tmp/obdfilter_survey_xxxx.detail as well, where the thread/object cpu mapping is logged.

          In the 'objid' mode, you have to provide objid to cpu core map in the config file, so you should know the object ids and try to mapp them to appropriate cpus (to make cpu always access the local IOH) before the test. Thanks.

          niu Niu Yawei (Inactive) added a comment - Hi, Sebastien I'd like you to run two tests, one in 'thread' mode, and another in 'objid' mode: with 'unlocked_ioctl' patch; 24 cores actived; (cores distributed on 4 sockets); with oprofile and numastats; provide the /tmp/obdfilter_survey_xxxx.detail as well, where the thread/object cpu mapping is logged. In the 'objid' mode, you have to provide objid to cpu core map in the config file, so you should know the object ids and try to mapp them to appropriate cpus (to make cpu always access the local IOH) before the test. Thanks.

          The test system we are dedicating to you is not ready yet, so I will have to run the tests by myself.

          I will try lctl_setaffinity.patch, but could you please tell me what kind of tests do you need? Only in 'thread' mode right? Still with oprofile and numastat? How many cores/sockets activated? With or without 'unlocked_ioctl' patch?

          TIA,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - The test system we are dedicating to you is not ready yet, so I will have to run the tests by myself. I will try lctl_setaffinity.patch, but could you please tell me what kind of tests do you need? Only in 'thread' mode right? Still with oprofile and numastat? How many cores/sockets activated? With or without 'unlocked_ioctl' patch? TIA, Sebastien.

          We use a custom kernel, based on RHEL6 GA (2.6.32-71.el6.x86_64).

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - We use a custom kernel, based on RHEL6 GA (2.6.32-71.el6.x86_64).

          Thanks a lot, Sebastien. What's the kernel version did you run the test on?

          niu Niu Yawei (Inactive) added a comment - Thanks a lot, Sebastien. What's the kernel version did you run the test on?

          People

            niu Niu Yawei (Inactive)
            liang Liang Zhen (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: