Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-29

obdfilter-survey doesn't work well if cpu_cores (/w hyperT) > 16

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.6
    • Lustre 1.8.6
    • None
    • 3
    • 22,980
    • 8550

    Description

      it seems obdfilter-survey is not working well on 12 cores system (can see 24 cores on OSS if hyper_thread=on).
      Here is quick results on 12 cores, 6 cores and 8 on same OSSs. For 6 and 8 cores, I turned CPUs off by "echo 0 > /sys/devices/system/cpu/cpuX/online" on 12 core system. (X5670, Westmere 6 cores x 2 sockets)
      Testing on "# of cpu cores <= 16" seems no problem, but on 24 cores, it can't be working well.
      This has been discussing on bug 22980, but still nothing solution to run obdfilter-survery on current Westmere box.

      #TEST-1 4xOSSs, 56OSTs(14 OSTs per OSS), 12 cores (# of CPU cores is 24)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3323.91 [ 39.96, 71.93] read 5967.91 [ 94.91, 127.93]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 5807.10 [ 72.93, 120.77] read 6182.79 [ 96.91, 140.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 6377.41 [ 75.93, 176.83] read 6193.18 [ 81.98, 139.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 6279.64 [ 69.93, 185.83] read 6162.43 [ 77.88, 162.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 6114.28 [ 9.99, 226.79] read 6017.08 [ 14.98, 220.80]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 6078.08 [ 8.99, 285.73] read 5923.64 [ 16.98, 161.85]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 6168.36 [ 76.92, 250.75] read 5828.33 [ 85.95, 174.77]

      #TEST-2 4xOSSs, 56OSTs(14 OSTs per OSS), 6 cores (# of CPU cores is 12, all physical cpu_id=1 are turned off)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3677.43 [ 36.97, 75.93] read 8355.91 [ 137.87, 168.85]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 7045.25 [ 89.92, 141.87] read 10672.33 [ 153.87, 212.80]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9909.58 [ 116.88, 217.78] read 10235.82 [ 140.87, 203.83]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9796.21 [ 106.90, 214.80] read 10803.78 [ 142.87, 348.93]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9377.85 [ 54.95, 265.75] read 10700.27 [ 126.76, 279.74]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9257.48 [ 0.00, 384.63] read 10726.18 [ 121.87, 291.74]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9162.01 [ 0.00, 242.78] read 10627.94 [ 115.89, 271.74]

      #TEST-3 4xOSSx, 56OSTs(14 OSTs per OSS), 8 cores (# of CPU cores is 16, core_id=

      {2, 10}

      from both sockets are turned off)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3614.92 [ 43.96, 75.93] read 7919.40 [ 122.88, 169.84]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 6703.91 [ 71.94, 135.87] read 9899.53 [ 156.87, 201.81]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9901.78 [ 123.88, 233.78] read 10401.05 [ 151.85, 202.81]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9721.29 [ 115.89, 212.80] read 10812.26 [ 151.86, 241.54]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9330.51 [ 94.91, 257.50] read 10672.22 [ 112.90, 342.66]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9053.42 [ 22.98, 263.75] read 10657.08 [ 95.91, 286.73]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9081.75 [ 45.96, 239.57] read 10562.43 [ 78.93, 270.75]

      Attachments

        Activity

          [LU-29] obdfilter-survey doesn't work well if cpu_cores (/w hyperT) > 16
          pjones Peter Jones added a comment -

          Marking as resolved as landed upstream for Oracle Lustre 1.8.6

          pjones Peter Jones added a comment - Marking as resolved as landed upstream for Oracle Lustre 1.8.6

          I believe so. In my case, obdfilter-survery didn't work well on 24 cores system (actually it was 12 cores, but 24 cores with HT=on) and this is NOT NUMIOA platform.
          I confirmed this problem was fixed by Niu patches here, and obdfilter-survey worked well for 24 cores on non-NUMIOA platform.

          btw, I also have a NUMIOA system, but using with VMs and spiting CPU and IO chip by KVM(Kernel-based Virtual Machine) now. However, I'm really interested in LU-66. I do keep an eye this and, want to test patches in my test system if it's available.

          Thanks again!

          ihara Shuichi Ihara (Inactive) added a comment - I believe so. In my case, obdfilter-survery didn't work well on 24 cores system (actually it was 12 cores, but 24 cores with HT=on) and this is NOT NUMIOA platform. I confirmed this problem was fixed by Niu patches here, and obdfilter-survey worked well for 24 cores on non-NUMIOA platform. btw, I also have a NUMIOA system, but using with VMs and spiting CPU and IO chip by KVM(Kernel-based Virtual Machine) now. However, I'm really interested in LU-66 . I do keep an eye this and, want to test patches in my test system if it's available. Thanks again!
          pjones Peter Jones added a comment -

          Am I right in thinking that the NUMIOA issue is being tracked under LU-66 and this ticket can be marked as resolved? Bugzilla seems to be available again btw.

          pjones Peter Jones added a comment - Am I right in thinking that the NUMIOA issue is being tracked under LU-66 and this ticket can be marked as resolved? Bugzilla seems to be available again btw.
          liang Liang Zhen (Inactive) added a comment - - edited

          As Niu said, the major difference between this issue and b22980 is, b22980 is running over NUMIOA system.
          Diego mentioned (on b22980) Lustre client can driver OSS harder than obdfilter-survey, I think it could be because IO threads of OST are numa-affinity (ptlrpc_service::srv_cpu_affinity), but lctl threads of obdfilter-survey don't have any affinity so they probably have much more cross-nodes traffic.

          I hope Bull can collect some data for us (i.e: if each numa node has 8 cores):

          • only enable 1/2/3/4 numa node and run obdfilter-survey
          • enable 8/16/24/32 cores, but these cores distribute on different numa nodes
            If we get quite different performance, then I think our assumption is correct, otherwise we pointed to wrong place. Of course, we should get oprofiles while running with these tests.
          liang Liang Zhen (Inactive) added a comment - - edited As Niu said, the major difference between this issue and b22980 is, b22980 is running over NUMIOA system. Diego mentioned (on b22980) Lustre client can driver OSS harder than obdfilter-survey, I think it could be because IO threads of OST are numa-affinity (ptlrpc_service::srv_cpu_affinity), but lctl threads of obdfilter-survey don't have any affinity so they probably have much more cross-nodes traffic. I hope Bull can collect some data for us (i.e: if each numa node has 8 cores): only enable 1/2/3/4 numa node and run obdfilter-survey enable 8/16/24/32 cores, but these cores distribute on different numa nodes If we get quite different performance, then I think our assumption is correct, otherwise we pointed to wrong place. Of course, we should get oprofiles while running with these tests.

          Thank you, Andreas.

          Yes, I agree with you that we'd better collect some oprofile data in next step, will have a meeting with Bull's people tonight for the b22980.

          The test result shows that the shmem lock isn't a major factor, so I think we can put it aside for a while. One thing confused me is that the 'unlocked_ioctl' patch works for Ihara's test, but doesn't work well for b22980. Liang discussed this issue with me yesterday, and we concluded three differences between the two tests:

          • Ihara's test is against 1.8, b22980 is against 2.0; (I checked the obdecho code, seems there is no major difference between 1.8 and 2.0 in 'case=disk' mode)
          • With/without patch applied, Ihara compared 12 cores with 24 cores, b22980 compared 8 cores with 32 cores; (Will ask Bull's people to do more tests with patch applied, 16 cores, 24 cores...)
          • This test is running on SMP (Ihara, please correct me if I'm wrong), but b22980 is running on NUMA; (Liang mentioned that without cpu affinity, the performance degradation could be huge in NUMA arch, will ask them to do more test to see if it's NUMA dependent).

          Will talk to Bull's people on what we found, and ask them to supply some oprofile data in next test. If you have any comments, please let me know.

          niu Niu Yawei (Inactive) added a comment - Thank you, Andreas. Yes, I agree with you that we'd better collect some oprofile data in next step, will have a meeting with Bull's people tonight for the b22980. The test result shows that the shmem lock isn't a major factor, so I think we can put it aside for a while. One thing confused me is that the 'unlocked_ioctl' patch works for Ihara's test, but doesn't work well for b22980. Liang discussed this issue with me yesterday, and we concluded three differences between the two tests: Ihara's test is against 1.8, b22980 is against 2.0; (I checked the obdecho code, seems there is no major difference between 1.8 and 2.0 in 'case=disk' mode) With/without patch applied, Ihara compared 12 cores with 24 cores, b22980 compared 8 cores with 32 cores; (Will ask Bull's people to do more tests with patch applied, 16 cores, 24 cores...) This test is running on SMP (Ihara, please correct me if I'm wrong), but b22980 is running on NUMA; (Liang mentioned that without cpu affinity, the performance degradation could be huge in NUMA arch, will ask them to do more test to see if it's NUMA dependent). Will talk to Bull's people on what we found, and ask them to supply some oprofile data in next test. If you have any comments, please let me know.

          I'm currently unable to post to bugzilla...

          Inspection template(s):
          Bug: 22980
          Developer: niu@whamcloud.com
          Size: 7 Lines of Change
          Date: 2011-2-8
          Defects: 1
          Type: CODE
          Inspector: adilger@whamcloud.com

          --------------
          >@@ -1681,24 +1680,15 @@ int jt_obd_test_brw(int argc, char **argv)
          > } else if (be_verbose(verbose, &next_time,i, &next_count,count)) {
          >- shmem_lock ();
          > printf("%s: %s number %d @ "LPD64":"LPU64" for %d\n",
          > jt_cmdname(argv[0]), write ? "write" : "read", i,
          > data.ioc_obdo1.o_id, data.ioc_offset,
          > (int)(pages * getpagesize()));
          >- shmem_unlock ();

          I would be surprised if the locking here affects the performance. be_verbose()
          should be true at most every few seconds, and otherwise the shmem_lock/unlock()
          is never hit. I think this was put in place to avoid all of the printf()
          statements from overlapping, which ruins the whole result from the test. If
          there actually IS overhead from this locking, it just means that the message
          rate is too high and needs to be reduced.

          >@@ -1622,20 +1622,19 @@ int jt_obd_test_brw(int argc, char **argv)
          >
          > #ifdef MAX_THREADS
          > if (thread) {
          >- shmem_lock ();
          > if (nthr_per_obj != 0)

          { > /* threads interleave */ > obj_idx = (thread - 1)/nthr_per_obj; > objid += obj_idx; > stride *= nthr_per_obj; >- if ((thread - 1) % nthr_per_obj == 0) >- shared_data->offsets[obj_idx] = stride + thr_offset; > thr_offset += ((thread - 1) % nthr_per_obj) * len; > }

          else

          { > /* threads disjoint */ > thr_offset += (thread - 1) * len; > }

          >
          >+ shmem_lock ();
          >+
          > shared_data->barrier--;
          > if (shared_data->barrier == 0)
          > l_cond_broadcast(&shared_data->cond);
          > if (!repeat_offset) {
          > #ifdef MAX_THREADS
          >- if (stride == len)

          { >- data.ioc_offset += stride; >- }

          else if (i < count)

          { >- shmem_lock (); >- data.ioc_offset = shared_data->offsets[obj_idx]; >- shared_data->offsets[obj_idx] += len; >- shmem_unlock (); >- }

          >+ data.ioc_offset += stride;

          (defect) I don't think this is going to result in the same test load at all.
          It means that only "len/stride" fraction of each object is written, and in
          fact it looks like there will be holes in every object because the common
          data->ioc_offset is being incremented by every thread in a racy manner so the
          offset will get large too quickly.

          What about adding per-object shmem_locks to protect offsets[] values? That
          would avoid most of the contention on this lock, if that is the overhead.
          However, like previously stated, I think it is best to get some real data
          (e.g. oprofile for kernel and gprof for userspace, collected over two different
          test runs to avoid too much overhead).

          adilger Andreas Dilger added a comment - I'm currently unable to post to bugzilla... Inspection template(s): Bug: 22980 Developer: niu@whamcloud.com Size: 7 Lines of Change Date: 2011-2-8 Defects: 1 Type: CODE Inspector: adilger@whamcloud.com -------------- >@@ -1681,24 +1680,15 @@ int jt_obd_test_brw(int argc, char **argv) > } else if (be_verbose(verbose, &next_time,i, &next_count,count)) { >- shmem_lock (); > printf("%s: %s number %d @ "LPD64":"LPU64" for %d\n", > jt_cmdname(argv [0] ), write ? "write" : "read", i, > data.ioc_obdo1.o_id, data.ioc_offset, > (int)(pages * getpagesize())); >- shmem_unlock (); I would be surprised if the locking here affects the performance. be_verbose() should be true at most every few seconds, and otherwise the shmem_lock/unlock() is never hit. I think this was put in place to avoid all of the printf() statements from overlapping, which ruins the whole result from the test. If there actually IS overhead from this locking, it just means that the message rate is too high and needs to be reduced. >@@ -1622,20 +1622,19 @@ int jt_obd_test_brw(int argc, char **argv) > > #ifdef MAX_THREADS > if (thread) { >- shmem_lock (); > if (nthr_per_obj != 0) { > /* threads interleave */ > obj_idx = (thread - 1)/nthr_per_obj; > objid += obj_idx; > stride *= nthr_per_obj; >- if ((thread - 1) % nthr_per_obj == 0) >- shared_data->offsets[obj_idx] = stride + thr_offset; > thr_offset += ((thread - 1) % nthr_per_obj) * len; > } else { > /* threads disjoint */ > thr_offset += (thread - 1) * len; > } > >+ shmem_lock (); >+ > shared_data->barrier--; > if (shared_data->barrier == 0) > l_cond_broadcast(&shared_data->cond); > if (!repeat_offset) { > #ifdef MAX_THREADS >- if (stride == len) { >- data.ioc_offset += stride; >- } else if (i < count) { >- shmem_lock (); >- data.ioc_offset = shared_data->offsets[obj_idx]; >- shared_data->offsets[obj_idx] += len; >- shmem_unlock (); >- } >+ data.ioc_offset += stride; (defect) I don't think this is going to result in the same test load at all. It means that only "len/stride" fraction of each object is written, and in fact it looks like there will be holes in every object because the common data->ioc_offset is being incremented by every thread in a racy manner so the offset will get large too quickly. What about adding per-object shmem_locks to protect offsets[] values? That would avoid most of the contention on this lock, if that is the overhead. However, like previously stated, I think it is best to get some real data (e.g. oprofile for kernel and gprof for userspace, collected over two different test runs to avoid too much overhead).

          Niu, it was my fault, sorry. the problem was NOT caused by this patches. The problem came from CPU affinity setting on VM. Maybe many context happened?
          Anyway, once I did correct CPU affinity on VM, the patches works as well as 24 cores system.

          Thanks!

          ihara Shuichi Ihara (Inactive) added a comment - Niu, it was my fault, sorry. the problem was NOT caused by this patches. The problem came from CPU affinity setting on VM. Maybe many context happened? Anyway, once I did correct CPU affinity on VM, the patches works as well as 24 cores system. Thanks!

          Ihara, that's interesting, I don't have any ideas on that so far, let's see what happened when your result comes out. Thanks.

          niu Niu Yawei (Inactive) added a comment - Ihara, that's interesting, I don't have any ideas on that so far, let's see what happened when your result comes out. Thanks.

          Niu, I'm investigating for test infrastructure on VMs (KVM: Kernel based Virtual Machine). Once apply the your patch and run obdfilter survey on VM, the performance is going to bad. Without the patch, I'm getting reasonable number even on VMs. So, the the patch seems have some impacts if I run obdfilter-survey on VM.

          I will file results and more information (will get oprofile on VM) in a couple of days.

          Ihara

          ihara Shuichi Ihara (Inactive) added a comment - Niu, I'm investigating for test infrastructure on VMs (KVM: Kernel based Virtual Machine). Once apply the your patch and run obdfilter survey on VM, the performance is going to bad. Without the patch, I'm getting reasonable number even on VMs. So, the the patch seems have some impacts if I run obdfilter-survey on VM. I will file results and more information (will get oprofile on VM) in a couple of days. Ihara

          Yes, I meant 22980, thanks Peter.

          Andreas, the HAVE_UNLOCKED_IOCTL is defined by the kernel which has the 'unlocked_ioctl' method.

          niu Niu Yawei (Inactive) added a comment - Yes, I meant 22980, thanks Peter. Andreas, the HAVE_UNLOCKED_IOCTL is defined by the kernel which has the 'unlocked_ioctl' method.

          People

            niu Niu Yawei (Inactive)
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: