Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-29

obdfilter-survey doesn't work well if cpu_cores (/w hyperT) > 16

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.6
    • Lustre 1.8.6
    • None
    • 3
    • 22,980
    • 8550

    Description

      it seems obdfilter-survey is not working well on 12 cores system (can see 24 cores on OSS if hyper_thread=on).
      Here is quick results on 12 cores, 6 cores and 8 on same OSSs. For 6 and 8 cores, I turned CPUs off by "echo 0 > /sys/devices/system/cpu/cpuX/online" on 12 core system. (X5670, Westmere 6 cores x 2 sockets)
      Testing on "# of cpu cores <= 16" seems no problem, but on 24 cores, it can't be working well.
      This has been discussing on bug 22980, but still nothing solution to run obdfilter-survery on current Westmere box.

      #TEST-1 4xOSSs, 56OSTs(14 OSTs per OSS), 12 cores (# of CPU cores is 24)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3323.91 [ 39.96, 71.93] read 5967.91 [ 94.91, 127.93]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 5807.10 [ 72.93, 120.77] read 6182.79 [ 96.91, 140.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 6377.41 [ 75.93, 176.83] read 6193.18 [ 81.98, 139.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 6279.64 [ 69.93, 185.83] read 6162.43 [ 77.88, 162.86]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 6114.28 [ 9.99, 226.79] read 6017.08 [ 14.98, 220.80]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 6078.08 [ 8.99, 285.73] read 5923.64 [ 16.98, 161.85]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 6168.36 [ 76.92, 250.75] read 5828.33 [ 85.95, 174.77]

      #TEST-2 4xOSSs, 56OSTs(14 OSTs per OSS), 6 cores (# of CPU cores is 12, all physical cpu_id=1 are turned off)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3677.43 [ 36.97, 75.93] read 8355.91 [ 137.87, 168.85]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 7045.25 [ 89.92, 141.87] read 10672.33 [ 153.87, 212.80]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9909.58 [ 116.88, 217.78] read 10235.82 [ 140.87, 203.83]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9796.21 [ 106.90, 214.80] read 10803.78 [ 142.87, 348.93]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9377.85 [ 54.95, 265.75] read 10700.27 [ 126.76, 279.74]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9257.48 [ 0.00, 384.63] read 10726.18 [ 121.87, 291.74]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9162.01 [ 0.00, 242.78] read 10627.94 [ 115.89, 271.74]

      #TEST-3 4xOSSx, 56OSTs(14 OSTs per OSS), 8 cores (# of CPU cores is 16, core_id=

      {2, 10}

      from both sockets are turned off)
      ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3614.92 [ 43.96, 75.93] read 7919.40 [ 122.88, 169.84]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 6703.91 [ 71.94, 135.87] read 9899.53 [ 156.87, 201.81]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9901.78 [ 123.88, 233.78] read 10401.05 [ 151.85, 202.81]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9721.29 [ 115.89, 212.80] read 10812.26 [ 151.86, 241.54]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9330.51 [ 94.91, 257.50] read 10672.22 [ 112.90, 342.66]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9053.42 [ 22.98, 263.75] read 10657.08 [ 95.91, 286.73]
      ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9081.75 [ 45.96, 239.57] read 10562.43 [ 78.93, 270.75]

      Attachments

        Activity

          [LU-29] obdfilter-survey doesn't work well if cpu_cores (/w hyperT) > 16

          Thank you, Andreas.

          Yes, I agree with you that we'd better collect some oprofile data in next step, will have a meeting with Bull's people tonight for the b22980.

          The test result shows that the shmem lock isn't a major factor, so I think we can put it aside for a while. One thing confused me is that the 'unlocked_ioctl' patch works for Ihara's test, but doesn't work well for b22980. Liang discussed this issue with me yesterday, and we concluded three differences between the two tests:

          • Ihara's test is against 1.8, b22980 is against 2.0; (I checked the obdecho code, seems there is no major difference between 1.8 and 2.0 in 'case=disk' mode)
          • With/without patch applied, Ihara compared 12 cores with 24 cores, b22980 compared 8 cores with 32 cores; (Will ask Bull's people to do more tests with patch applied, 16 cores, 24 cores...)
          • This test is running on SMP (Ihara, please correct me if I'm wrong), but b22980 is running on NUMA; (Liang mentioned that without cpu affinity, the performance degradation could be huge in NUMA arch, will ask them to do more test to see if it's NUMA dependent).

          Will talk to Bull's people on what we found, and ask them to supply some oprofile data in next test. If you have any comments, please let me know.

          niu Niu Yawei (Inactive) added a comment - Thank you, Andreas. Yes, I agree with you that we'd better collect some oprofile data in next step, will have a meeting with Bull's people tonight for the b22980. The test result shows that the shmem lock isn't a major factor, so I think we can put it aside for a while. One thing confused me is that the 'unlocked_ioctl' patch works for Ihara's test, but doesn't work well for b22980. Liang discussed this issue with me yesterday, and we concluded three differences between the two tests: Ihara's test is against 1.8, b22980 is against 2.0; (I checked the obdecho code, seems there is no major difference between 1.8 and 2.0 in 'case=disk' mode) With/without patch applied, Ihara compared 12 cores with 24 cores, b22980 compared 8 cores with 32 cores; (Will ask Bull's people to do more tests with patch applied, 16 cores, 24 cores...) This test is running on SMP (Ihara, please correct me if I'm wrong), but b22980 is running on NUMA; (Liang mentioned that without cpu affinity, the performance degradation could be huge in NUMA arch, will ask them to do more test to see if it's NUMA dependent). Will talk to Bull's people on what we found, and ask them to supply some oprofile data in next test. If you have any comments, please let me know.

          I'm currently unable to post to bugzilla...

          Inspection template(s):
          Bug: 22980
          Developer: niu@whamcloud.com
          Size: 7 Lines of Change
          Date: 2011-2-8
          Defects: 1
          Type: CODE
          Inspector: adilger@whamcloud.com

          --------------
          >@@ -1681,24 +1680,15 @@ int jt_obd_test_brw(int argc, char **argv)
          > } else if (be_verbose(verbose, &next_time,i, &next_count,count)) {
          >- shmem_lock ();
          > printf("%s: %s number %d @ "LPD64":"LPU64" for %d\n",
          > jt_cmdname(argv[0]), write ? "write" : "read", i,
          > data.ioc_obdo1.o_id, data.ioc_offset,
          > (int)(pages * getpagesize()));
          >- shmem_unlock ();

          I would be surprised if the locking here affects the performance. be_verbose()
          should be true at most every few seconds, and otherwise the shmem_lock/unlock()
          is never hit. I think this was put in place to avoid all of the printf()
          statements from overlapping, which ruins the whole result from the test. If
          there actually IS overhead from this locking, it just means that the message
          rate is too high and needs to be reduced.

          >@@ -1622,20 +1622,19 @@ int jt_obd_test_brw(int argc, char **argv)
          >
          > #ifdef MAX_THREADS
          > if (thread) {
          >- shmem_lock ();
          > if (nthr_per_obj != 0)

          { > /* threads interleave */ > obj_idx = (thread - 1)/nthr_per_obj; > objid += obj_idx; > stride *= nthr_per_obj; >- if ((thread - 1) % nthr_per_obj == 0) >- shared_data->offsets[obj_idx] = stride + thr_offset; > thr_offset += ((thread - 1) % nthr_per_obj) * len; > }

          else

          { > /* threads disjoint */ > thr_offset += (thread - 1) * len; > }

          >
          >+ shmem_lock ();
          >+
          > shared_data->barrier--;
          > if (shared_data->barrier == 0)
          > l_cond_broadcast(&shared_data->cond);
          > if (!repeat_offset) {
          > #ifdef MAX_THREADS
          >- if (stride == len)

          { >- data.ioc_offset += stride; >- }

          else if (i < count)

          { >- shmem_lock (); >- data.ioc_offset = shared_data->offsets[obj_idx]; >- shared_data->offsets[obj_idx] += len; >- shmem_unlock (); >- }

          >+ data.ioc_offset += stride;

          (defect) I don't think this is going to result in the same test load at all.
          It means that only "len/stride" fraction of each object is written, and in
          fact it looks like there will be holes in every object because the common
          data->ioc_offset is being incremented by every thread in a racy manner so the
          offset will get large too quickly.

          What about adding per-object shmem_locks to protect offsets[] values? That
          would avoid most of the contention on this lock, if that is the overhead.
          However, like previously stated, I think it is best to get some real data
          (e.g. oprofile for kernel and gprof for userspace, collected over two different
          test runs to avoid too much overhead).

          adilger Andreas Dilger added a comment - I'm currently unable to post to bugzilla... Inspection template(s): Bug: 22980 Developer: niu@whamcloud.com Size: 7 Lines of Change Date: 2011-2-8 Defects: 1 Type: CODE Inspector: adilger@whamcloud.com -------------- >@@ -1681,24 +1680,15 @@ int jt_obd_test_brw(int argc, char **argv) > } else if (be_verbose(verbose, &next_time,i, &next_count,count)) { >- shmem_lock (); > printf("%s: %s number %d @ "LPD64":"LPU64" for %d\n", > jt_cmdname(argv [0] ), write ? "write" : "read", i, > data.ioc_obdo1.o_id, data.ioc_offset, > (int)(pages * getpagesize())); >- shmem_unlock (); I would be surprised if the locking here affects the performance. be_verbose() should be true at most every few seconds, and otherwise the shmem_lock/unlock() is never hit. I think this was put in place to avoid all of the printf() statements from overlapping, which ruins the whole result from the test. If there actually IS overhead from this locking, it just means that the message rate is too high and needs to be reduced. >@@ -1622,20 +1622,19 @@ int jt_obd_test_brw(int argc, char **argv) > > #ifdef MAX_THREADS > if (thread) { >- shmem_lock (); > if (nthr_per_obj != 0) { > /* threads interleave */ > obj_idx = (thread - 1)/nthr_per_obj; > objid += obj_idx; > stride *= nthr_per_obj; >- if ((thread - 1) % nthr_per_obj == 0) >- shared_data->offsets[obj_idx] = stride + thr_offset; > thr_offset += ((thread - 1) % nthr_per_obj) * len; > } else { > /* threads disjoint */ > thr_offset += (thread - 1) * len; > } > >+ shmem_lock (); >+ > shared_data->barrier--; > if (shared_data->barrier == 0) > l_cond_broadcast(&shared_data->cond); > if (!repeat_offset) { > #ifdef MAX_THREADS >- if (stride == len) { >- data.ioc_offset += stride; >- } else if (i < count) { >- shmem_lock (); >- data.ioc_offset = shared_data->offsets[obj_idx]; >- shared_data->offsets[obj_idx] += len; >- shmem_unlock (); >- } >+ data.ioc_offset += stride; (defect) I don't think this is going to result in the same test load at all. It means that only "len/stride" fraction of each object is written, and in fact it looks like there will be holes in every object because the common data->ioc_offset is being incremented by every thread in a racy manner so the offset will get large too quickly. What about adding per-object shmem_locks to protect offsets[] values? That would avoid most of the contention on this lock, if that is the overhead. However, like previously stated, I think it is best to get some real data (e.g. oprofile for kernel and gprof for userspace, collected over two different test runs to avoid too much overhead).

          Niu, it was my fault, sorry. the problem was NOT caused by this patches. The problem came from CPU affinity setting on VM. Maybe many context happened?
          Anyway, once I did correct CPU affinity on VM, the patches works as well as 24 cores system.

          Thanks!

          ihara Shuichi Ihara (Inactive) added a comment - Niu, it was my fault, sorry. the problem was NOT caused by this patches. The problem came from CPU affinity setting on VM. Maybe many context happened? Anyway, once I did correct CPU affinity on VM, the patches works as well as 24 cores system. Thanks!

          Ihara, that's interesting, I don't have any ideas on that so far, let's see what happened when your result comes out. Thanks.

          niu Niu Yawei (Inactive) added a comment - Ihara, that's interesting, I don't have any ideas on that so far, let's see what happened when your result comes out. Thanks.

          Niu, I'm investigating for test infrastructure on VMs (KVM: Kernel based Virtual Machine). Once apply the your patch and run obdfilter survey on VM, the performance is going to bad. Without the patch, I'm getting reasonable number even on VMs. So, the the patch seems have some impacts if I run obdfilter-survey on VM.

          I will file results and more information (will get oprofile on VM) in a couple of days.

          Ihara

          ihara Shuichi Ihara (Inactive) added a comment - Niu, I'm investigating for test infrastructure on VMs (KVM: Kernel based Virtual Machine). Once apply the your patch and run obdfilter survey on VM, the performance is going to bad. Without the patch, I'm getting reasonable number even on VMs. So, the the patch seems have some impacts if I run obdfilter-survey on VM. I will file results and more information (will get oprofile on VM) in a couple of days. Ihara

          Yes, I meant 22980, thanks Peter.

          Andreas, the HAVE_UNLOCKED_IOCTL is defined by the kernel which has the 'unlocked_ioctl' method.

          niu Niu Yawei (Inactive) added a comment - Yes, I meant 22980, thanks Peter. Andreas, the HAVE_UNLOCKED_IOCTL is defined by the kernel which has the 'unlocked_ioctl' method.

          I don't think there are any ioctls that depend on BKL, but I haven't looked through them closely. In particular, I'm not sure if there is proper serialization around the configuration ioctls or not.

          That said, since the configuration is almost always done by mount/unmount and not by the old lctl commands, I don't think this will be a serious risk, so I think it makes sense to move the Lustre ioctl handling over to ->unlocked_ioctl(). That should be done only for kernels which support the ->unlocked_ioctl() method, which means a configure check is needed to set HAVE_UNLOCKED_IOCTL if that method is present in struct file_operations.

          adilger Andreas Dilger added a comment - I don't think there are any ioctls that depend on BKL, but I haven't looked through them closely. In particular, I'm not sure if there is proper serialization around the configuration ioctls or not. That said, since the configuration is almost always done by mount/unmount and not by the old lctl commands, I don't think this will be a serious risk, so I think it makes sense to move the Lustre ioctl handling over to ->unlocked_ioctl(). That should be done only for kernels which support the ->unlocked_ioctl() method, which means a configure check is needed to set HAVE_UNLOCKED_IOCTL if that method is present in struct file_operations.
          pjones Peter Jones added a comment -

          As per Andreas, you probably mean bz 22980, rather than 22890. Yes please, can you attach your patch to the bz - thanks!

          pjones Peter Jones added a comment - As per Andreas, you probably mean bz 22980, rather than 22890. Yes please, can you attach your patch to the bz - thanks!

          Thanks for your good news, Ihara. Looks the patch works as we expected.

          Hi, Andreas

          The user space semaphore used to protect the shmem is another contention source, however, it looks not so severe as the BKL of each ioctl. Should we post the patch to the bug 22890 to see if it resovles the problem?

          BTW, I thought there isn't any Lustre ioctls depends on BKL, and it's safe to introduce 'unlocked_ioctl'. Could you confirm it?

          niu Niu Yawei (Inactive) added a comment - Thanks for your good news, Ihara. Looks the patch works as we expected. Hi, Andreas The user space semaphore used to protect the shmem is another contention source, however, it looks not so severe as the BKL of each ioctl. Should we post the patch to the bug 22890 to see if it resovles the problem? BTW, I thought there isn't any Lustre ioctls depends on BKL, and it's safe to introduce 'unlocked_ioctl'. Could you confirm it?
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          Niu, sorry, it looked like something bad in the storage side when I did benchmark yesterday. Once I fixed the storage, tried obdfilter-survey with applied your patches. It seems patches fixe the problem on 24 core system and getting close number to when HT=off. Here is results on 12 cores (HT=off) and 24 cores (HT=on).

          # 12 cores (HT=off), 4 OSSs, 56 OSTs (14OSTs per OSS)
          ost 56 sz 469762048K rsz 1024K obj   56 thr   56 write 3546.88 [  37.96,  70.86] read 7633.11 [ 124.88, 156.85] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  112 write 6420.31 [  91.91, 130.75] read 10121.79 [ 159.70, 202.60] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  224 write 9576.76 [ 125.84, 216.80] read 10444.91 [ 167.84, 216.79] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  448 write 10264.63 [  98.95, 207.61] read 10972.26 [ 150.68, 232.78] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  896 write 9842.69 [  91.91, 305.69] read 10896.16 [ 121.89, 330.57] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr 1792 write 9613.51 [  28.96, 251.70] read 10792.37 [ 123.88, 277.50] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr 3584 write 9597.46 [   0.00, 253.78] read 10698.87 [ 118.89, 271.75] 
          
          # 24 cores (HT=on), 4 OSSs, 56 OSTs (14OSTs per OSS)
          ost 56 sz 469762048K rsz 1024K obj   56 thr   56 write 3345.48 [  42.96,  66.94] read 6981.70 [ 102.91, 153.86] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  112 write 6327.40 [  88.92, 128.89] read 9826.28 [ 156.85, 208.80] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  224 write 9792.45 [ 139.87, 218.77] read 10409.23 [ 173.84, 303.70] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  448 write 10262.20 [ 106.90, 235.78] read 10903.93 [ 157.86, 253.79] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr  896 write 9905.94 [  98.91, 233.78] read 10829.35 [ 127.88, 266.75] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr 1792 write 9656.78 [   6.99, 251.79] read 10761.36 [ 115.89, 333.68] 
          ost 56 sz 469762048K rsz 1024K obj   56 thr 3584 write 9596.28 [   0.00, 261.76] read 10742.13 [ 119.89, 324.68] 
          
          
          ihara Shuichi Ihara (Inactive) added a comment - - edited Niu, sorry, it looked like something bad in the storage side when I did benchmark yesterday. Once I fixed the storage, tried obdfilter-survey with applied your patches. It seems patches fixe the problem on 24 core system and getting close number to when HT=off. Here is results on 12 cores (HT=off) and 24 cores (HT=on). # 12 cores (HT=off), 4 OSSs, 56 OSTs (14OSTs per OSS) ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3546.88 [ 37.96, 70.86] read 7633.11 [ 124.88, 156.85] ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 6420.31 [ 91.91, 130.75] read 10121.79 [ 159.70, 202.60] ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9576.76 [ 125.84, 216.80] read 10444.91 [ 167.84, 216.79] ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 10264.63 [ 98.95, 207.61] read 10972.26 [ 150.68, 232.78] ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9842.69 [ 91.91, 305.69] read 10896.16 [ 121.89, 330.57] ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9613.51 [ 28.96, 251.70] read 10792.37 [ 123.88, 277.50] ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9597.46 [ 0.00, 253.78] read 10698.87 [ 118.89, 271.75] # 24 cores (HT=on), 4 OSSs, 56 OSTs (14OSTs per OSS) ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3345.48 [ 42.96, 66.94] read 6981.70 [ 102.91, 153.86] ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 6327.40 [ 88.92, 128.89] read 9826.28 [ 156.85, 208.80] ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9792.45 [ 139.87, 218.77] read 10409.23 [ 173.84, 303.70] ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 10262.20 [ 106.90, 235.78] read 10903.93 [ 157.86, 253.79] ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9905.94 [ 98.91, 233.78] read 10829.35 [ 127.88, 266.75] ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9656.78 [ 6.99, 251.79] read 10761.36 [ 115.89, 333.68] ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9596.28 [ 0.00, 261.76] read 10742.13 [ 119.89, 324.68]

          See also https://bugzilla.lustre.org/show_bug.cgi?id=22980#c18 for a similar issue. I suspect that the performance bottleneck may be in userspace, but we can only find out with some oprofile and/or lockmeter data.

          adilger Andreas Dilger added a comment - See also https://bugzilla.lustre.org/show_bug.cgi?id=22980#c18 for a similar issue. I suspect that the performance bottleneck may be in userspace, but we can only find out with some oprofile and/or lockmeter data.

          People

            niu Niu Yawei (Inactive)
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: