Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-66

obdfilter-survey performance issue on NUMA system

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 22,980
    • 8541

    Description

      this is just copy of bug 22980, but I think it's better to track & discuss it at here:

      Hello,

      Testing our new IO servers we have an issue with obdfilter-survey. Our OSSs are based on 4
      Nehalem-EX processors, connected to a Boxboro chipset. Every socket has 6 cores. On every OST we
      have several FC channels connected to our storage bay.

      When we perform raw tests with sgpdd-survey, over 24 luns we get ~4400 MB/s on write and more than
      5500 MB/s on read.

      Then if we start a Lustre filesystem and we test these 24 osts with obdfilter-survey (size=24192
      rszlo=1024 rszhi=1024 nobjlo=1 nobjhi=2 thrlo=1 thrhi=16 case=disk tests_str="write read" sh
      obdfilter-survey) we always have a performance limit on 1200 MB/s for write and read.

      If we perform IOzone tests from five clients (2 threads per client, connected to the server with
      Infiniband) we get more than 2500 MB/s.

      Then we disconnected two sockets using command "echo 0 > /sys/devices/system/cpu/cpu5/online" on
      every cpu belonging to these two sockets and we get expected results on obdfilter-survey (4600 MB/s
      on write and 5500 MB/s on read). If we only disconnect one socket then obdfilter-survey gives us a
      max of 1600 MB/s. Using only one socket results are slightly worse than with two sockets.

      We also made these tests with Lustre 1.6, with other storage bays and with similar platforms (4
      sockets and 8 cpus per socket) having always the same kind of problem. If we activate the
      hyper-threading functionality on every socket then performances are even worse.

      It's like if obdfilter-survey has any kind of saturation when there are many sockets. What do you
      think? Thanks,

      Attachments

        1. affinity_map
          0.2 kB
        2. affinity_results.tgz
          465 kB
        3. bull_obdfilter_survey_chart_110309.pdf
          65 kB
        4. bull_obdfilter_survey_chart_110319.pdf
          66 kB
        5. full_results_kmalloc.tgz
          346 kB
        6. full_results.tgz
          727 kB
        7. lctl_setaffinity_v2.patch
          4 kB
        8. new_results_kmalloc.tgz
          78 kB
        9. obdfilter-survey_results.txt
          17 kB
        10. remove_vmalloc.patch
          3 kB

        Activity

          [LU-66] obdfilter-survey performance issue on NUMA system

          I think this can be closed now.

          niu Niu Yawei (Inactive) added a comment - I think this can be closed now.

          latest testing results graphs

          liang Liang Zhen (Inactive) added a comment - latest testing results graphs

          performance graphs (adding data for non-patched version)

          liang Liang Zhen (Inactive) added a comment - performance graphs (adding data for non-patched version)

          sorry, some data in previous file is wrong

          liang Liang Zhen (Inactive) added a comment - sorry, some data in previous file is wrong

          performance charts for obdfilter-survey & sgpdd-survey
          btw, non-patched data is not here because it's too difficult for us to reinstalled unpatched rpms remotely, so we don't have data for unpatched results on the same machine.

          liang Liang Zhen (Inactive) added a comment - performance charts for obdfilter-survey & sgpdd-survey btw, non-patched data is not here because it's too difficult for us to reinstalled unpatched rpms remotely, so we don't have data for unpatched results on the same machine.
          niu Niu Yawei (Inactive) added a comment - - edited

          I ran bunch of sgpdd-survey tests on berlin6, and the oprofile result shows more than 50% copy_user_generic_string() samples. Since sgpdd-survey calls read/write syscall to perform I/O against block device, there should be lots of copy_from/to_user() for transfering data between userspace and kernel, and that consumes lots of CPU time. However, I don't think the copy_from_user() is the major bottoleneck of sgpdd-survey write, because when I ran sgpdd-survey read only, the copy_user_generic_string() samples is still very high (more than 60%), but sgpdd-survey read performance is comparable with obdfilter-survey's.

          The sgp_dd calls it's own sg_write()/sg_read(), and sg_read/write() just simply generate io request for underlying device, then unlug the device (that explains why iostat doesn't work for sgp_dd tests, because it bypass kernel io statistic code). I think tbe bottleneck should be in sg_write() (maybe it doesn't work well in multi-cores/mult-threads condition), though the exact root cause is unkown yet.

          niu Niu Yawei (Inactive) added a comment - - edited I ran bunch of sgpdd-survey tests on berlin6, and the oprofile result shows more than 50% copy_user_generic_string() samples. Since sgpdd-survey calls read/write syscall to perform I/O against block device, there should be lots of copy_from/to_user() for transfering data between userspace and kernel, and that consumes lots of CPU time. However, I don't think the copy_from_user() is the major bottoleneck of sgpdd-survey write, because when I ran sgpdd-survey read only, the copy_user_generic_string() samples is still very high (more than 60%), but sgpdd-survey read performance is comparable with obdfilter-survey's. The sgp_dd calls it's own sg_write()/sg_read(), and sg_read/write() just simply generate io request for underlying device, then unlug the device (that explains why iostat doesn't work for sgp_dd tests, because it bypass kernel io statistic code). I think tbe bottleneck should be in sg_write() (maybe it doesn't work well in multi-cores/mult-threads condition), though the exact root cause is unkown yet.

          Hi, Sebastien

          I agree with your conclusion that NUMIOA affinity doesn't affect test result much, to double confirm it, I think we should run the obdfilter-survey (without affinity patch) once more on the new system to see if there is any difference.

          For the issue of sgpdd-survey write worse than obdfilter-survey, I think one possible reason is that sgpdd-survey used few threads, however, the result shows 128 threads number is very close to 64 threads', maybe there is some bottleneck in the sgpdd-survey test tool, I will look into the code to get more details. At the same time, I think two quick tests could be helpful for this investigation:

          • run sgpdd-survey on only one device, try to get the raw bandwith of each single device;
          • collect oprofile data while running sgpdd-survey over 16 devices to see if there is any contention;

          I've got the access to the test system today, but it'll take me some time to learn how to run these tests on it, could you help me to run above three tests this time? Thanks a lot.

          niu Niu Yawei (Inactive) added a comment - Hi, Sebastien I agree with your conclusion that NUMIOA affinity doesn't affect test result much, to double confirm it, I think we should run the obdfilter-survey (without affinity patch) once more on the new system to see if there is any difference. For the issue of sgpdd-survey write worse than obdfilter-survey, I think one possible reason is that sgpdd-survey used few threads, however, the result shows 128 threads number is very close to 64 threads', maybe there is some bottleneck in the sgpdd-survey test tool, I will look into the code to get more details. At the same time, I think two quick tests could be helpful for this investigation: run sgpdd-survey on only one device, try to get the raw bandwith of each single device; collect oprofile data while running sgpdd-survey over 16 devices to see if there is any contention; I've got the access to the test system today, but it'll take me some time to learn how to run these tests on it, could you help me to run above three tests this time? Thanks a lot.

          Results with unlocked_ioctl and remove_vmalloc patches, plus affinity patch for obdfilter-survey. In the tarball please find:

          • summary.txt: array that sums up test results
          • result_*.txt: results for a specific obdfilter-survey test, with also 'numastat' output
          • obdfilter_survey*.detail: associated obdfilter-survey detailed data
          • affinity_map.*: associated affinity mapping
          • sgpdd_res_new.txt : sgpdd_survey results

          I am sorry to post these results today, but Jira was not accessible yesterday afternoon (French time). I took the opportunity of running some more tests, with on-purpose bad affinities.
          I also ran sgpdd-survey tests, because the hardware on which these tests were launched is not the same as before. Due to software reconfiguration, we now have access to 16 LUNs (instead of 15) through 8 FC links (instead of 4). So raw performance is better than before, as we are not more limited by FC bandwidth.

          In the results, several points are surprising:

          • obdfilter-survey results are better than sgpdd-survey results in write;
          • affinity mapping has little impact on performance, good mapping is better than bad mapping only with a high number of threads.
          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Results with unlocked_ioctl and remove_vmalloc patches, plus affinity patch for obdfilter-survey. In the tarball please find: summary.txt: array that sums up test results result_*.txt: results for a specific obdfilter-survey test, with also 'numastat' output obdfilter_survey*.detail: associated obdfilter-survey detailed data affinity_map.*: associated affinity mapping sgpdd_res_new.txt : sgpdd_survey results I am sorry to post these results today, but Jira was not accessible yesterday afternoon (French time). I took the opportunity of running some more tests, with on-purpose bad affinities. I also ran sgpdd-survey tests, because the hardware on which these tests were launched is not the same as before. Due to software reconfiguration, we now have access to 16 LUNs (instead of 15) through 8 FC links (instead of 4). So raw performance is better than before, as we are not more limited by FC bandwidth. In the results, several points are surprising: obdfilter-survey results are better than sgpdd-survey results in write; affinity mapping has little impact on performance, good mapping is better than bad mapping only with a high number of threads.

          Hi, Sebastien

          Ah, right. Binding object id to cpu doesn't make sense for this test, I've changed the patch to bind devno to cpu (lctl_setaffinity_v2.patch), and I also updated the example affinity_map. Please use the new patch to run the test that I meantioned in my previous comment. Thanks for your effort!

          niu Niu Yawei (Inactive) added a comment - Hi, Sebastien Ah, right. Binding object id to cpu doesn't make sense for this test, I've changed the patch to bind devno to cpu (lctl_setaffinity_v2.patch), and I also updated the example affinity_map. Please use the new patch to run the test that I meantioned in my previous comment. Thanks for your effort!

          If I understand correctly, in order to know in advance the objids I should have a look at the obdfilter_survey_xxxx.detail file and consider the next run will do '+1' on the ids.
          The problem is the obdfilter_survey_xxxx.detail contains the following:

          =======================> ost 15 sz 314572800K rsz 1024K obj 15 thr 15
          =============> Create 1 on localhost:quartcel-OST0005_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0008_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0007_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST000c_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0000_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST000d_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0004_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0006_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0002_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST000a_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0009_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST000b_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST000e_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0001_ecc
          create: 1 objects
          create: #1 is object id 0x29
          =============> Create 1 on localhost:quartcel-OST0003_ecc
          create: 1 objects
          create: #1 is object id 0x29

          So, as you can see, all objects have the same ids on the OSTs... In that case, I am afraid the 'objid to core' mapping is useless.
          Unless I manually create new objects on OSTs so that objids are different everywhere?

          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - If I understand correctly, in order to know in advance the objids I should have a look at the obdfilter_survey_xxxx.detail file and consider the next run will do '+1' on the ids. The problem is the obdfilter_survey_xxxx.detail contains the following: =======================> ost 15 sz 314572800K rsz 1024K obj 15 thr 15 =============> Create 1 on localhost:quartcel-OST0005_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0008_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0007_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000c_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0000_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000d_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0004_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0006_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0002_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000a_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0009_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000b_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000e_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0001_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0003_ecc create: 1 objects create: #1 is object id 0x29 So, as you can see, all objects have the same ids on the OSTs... In that case, I am afraid the 'objid to core' mapping is useless. Unless I manually create new objects on OSTs so that objids are different everywhere? Sebastien.

          People

            niu Niu Yawei (Inactive)
            liang Liang Zhen (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: