[LU-29] obdfilter-survey doesn't work well if cpu_cores (/w hyperT) > 16 Created: 22/Dec/10  Updated: 28/Jun/11  Resolved: 14/Feb/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File LU-29.patch     Text File bug22980-for-1.8.x.patch    
Severity: 3
Bugzilla ID: 22,980
Rank (Obsolete): 8550

 Description   

it seems obdfilter-survey is not working well on 12 cores system (can see 24 cores on OSS if hyper_thread=on).
Here is quick results on 12 cores, 6 cores and 8 on same OSSs. For 6 and 8 cores, I turned CPUs off by "echo 0 > /sys/devices/system/cpu/cpuX/online" on 12 core system. (X5670, Westmere 6 cores x 2 sockets)
Testing on "# of cpu cores <= 16" seems no problem, but on 24 cores, it can't be working well.
This has been discussing on bug 22980, but still nothing solution to run obdfilter-survery on current Westmere box.

#TEST-1 4xOSSs, 56OSTs(14 OSTs per OSS), 12 cores (# of CPU cores is 24)
ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3323.91 [ 39.96, 71.93] read 5967.91 [ 94.91, 127.93]
ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 5807.10 [ 72.93, 120.77] read 6182.79 [ 96.91, 140.86]
ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 6377.41 [ 75.93, 176.83] read 6193.18 [ 81.98, 139.86]
ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 6279.64 [ 69.93, 185.83] read 6162.43 [ 77.88, 162.86]
ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 6114.28 [ 9.99, 226.79] read 6017.08 [ 14.98, 220.80]
ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 6078.08 [ 8.99, 285.73] read 5923.64 [ 16.98, 161.85]
ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 6168.36 [ 76.92, 250.75] read 5828.33 [ 85.95, 174.77]

#TEST-2 4xOSSs, 56OSTs(14 OSTs per OSS), 6 cores (# of CPU cores is 12, all physical cpu_id=1 are turned off)
ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3677.43 [ 36.97, 75.93] read 8355.91 [ 137.87, 168.85]
ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 7045.25 [ 89.92, 141.87] read 10672.33 [ 153.87, 212.80]
ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9909.58 [ 116.88, 217.78] read 10235.82 [ 140.87, 203.83]
ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9796.21 [ 106.90, 214.80] read 10803.78 [ 142.87, 348.93]
ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9377.85 [ 54.95, 265.75] read 10700.27 [ 126.76, 279.74]
ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9257.48 [ 0.00, 384.63] read 10726.18 [ 121.87, 291.74]
ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9162.01 [ 0.00, 242.78] read 10627.94 [ 115.89, 271.74]

#TEST-3 4xOSSx, 56OSTs(14 OSTs per OSS), 8 cores (# of CPU cores is 16, core_id=

{2, 10}

from both sockets are turned off)
ost 56 sz 469762048K rsz 1024K obj 56 thr 56 write 3614.92 [ 43.96, 75.93] read 7919.40 [ 122.88, 169.84]
ost 56 sz 469762048K rsz 1024K obj 56 thr 112 write 6703.91 [ 71.94, 135.87] read 9899.53 [ 156.87, 201.81]
ost 56 sz 469762048K rsz 1024K obj 56 thr 224 write 9901.78 [ 123.88, 233.78] read 10401.05 [ 151.85, 202.81]
ost 56 sz 469762048K rsz 1024K obj 56 thr 448 write 9721.29 [ 115.89, 212.80] read 10812.26 [ 151.86, 241.54]
ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9330.51 [ 94.91, 257.50] read 10672.22 [ 112.90, 342.66]
ost 56 sz 469762048K rsz 1024K obj 56 thr 1792 write 9053.42 [ 22.98, 263.75] read 10657.08 [ 95.91, 286.73]
ost 56 sz 469762048K rsz 1024K obj 56 thr 3584 write 9081.75 [ 45.96, 239.57] read 10562.43 [ 78.93, 270.75]



 Comments   
Comment by Liang Zhen (Inactive) [ 22/Dec/10 ]

Hi Ihara,

I'm a little confusing about these data, I think your box has 2 * 6-cores (see 24-cores with Hyper-threading), right? could you please give me a simple list of performance data like this:

  • Hyper-threading OFF
    1) 2 cores
    2) 4 cores
    3) 6 cores
    4) 8 cores
    5) 10 cores
    6) 12 cores
  • Hyper-threading ON
    1) 4 cores
    2) 6 cores
    3) 8 cores
    4) 12 cores
    5) 16 cores
    6) 18 creos
    7) 20 cores
    8) 24 cores

I don't need many data samples, just an average value should be good enough. I'm a little suspecting it could be an issue of utility.

Thanks
Liang

Comment by Shuichi Ihara (Inactive) [ 22/Dec/10 ]

Liang,

Yes, I'm testing on Intel 6 cores x 2 socket box.
I can't turn Hyper-Thread off due to can't reboot right now, but I just collected the data with 4, 6, 8, 12... cores when HT is enabled. Here is results.

Ran obdfilter-survey on single OSS with 14 OSTs. (obj=1, thr=64)

#core write read
4 2728.94 2672.93
6 2679.19 2669.31
8 2677.86 2663.86
12 2658.00 2660.31
16 2633.77 2650.97
18 2626.06 2653.20
20 2618.16 2649.80
22 2586.50 2620.15
24 1685.99 1575.12

The numbers rapidly drop only on 24 cores.
Let me try to run same testing with HT=off later.

Comment by Shuichi Ihara (Inactive) [ 22/Dec/10 ]

Here is same test results on same box, but HT=off.

#core write read
2 3019.27 2798.00
4 2983.61 2754.41
6 2914.59 2748.15
8 2897.25 2731.10
10 2877.80 2724.94
12 2896.62 2711.43

not big changes by number of cores, but another interesting thing is that the write number is better than the results on HT=on. does HT harm the lustre performance, basically?

Comment by Liang Zhen (Inactive) [ 23/Dec/10 ]

Ihara, thanks for these data, yes I think hyper-threading will not help on performance of lustre(server side), at least for any of current releases.

I actually have a stupid question, I assume you were disable those cores symmetrically right? (i.e: for 8 cores tests, it's disabled 2 cores on the first socket and 2 cores on the second socket)

Comment by Liang Zhen (Inactive) [ 23/Dec/10 ]

reassign to Niu for the next step survey

Comment by Niu Yawei (Inactive) [ 24/Dec/10 ]

obdfilter_survey calls 'lctl test_brw' which issue ioctl to kernel, however, ioctl needs BKL! Though we released the BKL in echo_client_iocontrl() before I/O start (and reacquired it after I/O done), the overhead of lock contention could be huge in our test senario (dozens of cores and hundreds of processes).

I think we'd better support 'unlocked_ioctl' for lustre file_operations, then move all the performance sensitive ioctls into 'unlocked_ioctl', OBD_IOC_BRW_READ/WRITE for example.

Hi, Shuichi

I'll make a patch as per my above analysis. If it's handy for you, could you collect some statistics by oprofile (or lockmeter even better) to confirm my analysis? Thank you.

Comment by Shuichi Ihara (Inactive) [ 24/Dec/10 ]

yes, for 8 cores testing, I killed two cores from each socket.
Niu, I'm happy to test your patches on our test box, please let me know what profile you want.

Ihara

Comment by Niu Yawei (Inactive) [ 29/Dec/10 ]

Adding "unlocked_ioctl" for preformance sensitive ioctls, such as "OBD_IOC_BRW_READ/WRITE"

Comment by Niu Yawei (Inactive) [ 29/Dec/10 ]

Hi, Ihara

Sorry for the late response, I just asked few days off for some personal issues.

I've made a patch which try to resolve this problem, it's available at http://review.whamcloud.com/163 (I also attached it here for you convenience). Please try this patch.

Comment by Shuichi Ihara (Inactive) [ 30/Dec/10 ]

Niu, thanks for patching. Let me try this patch in end of next week due to new year holiday.
I will let you know the results whether it works well or not.

Comment by Niu Yawei (Inactive) [ 30/Dec/10 ]

The original patch has defect, I've updated it with new one.

Comment by Shuichi Ihara (Inactive) [ 06/Jan/11 ]

Niu,

I just tested your latest patch, but obdfilter-suvery result is still low on 24 cores. Here is results.

12 cores (HT=disabled)
ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 9871.99 [ 85.75, 229.56] read 10802.02 [ 125.88, 309.74]

24 cores (HT=enabled)
ost 56 sz 469762048K rsz 1024K obj 56 thr 896 write 6076.08 [ 21.98, 557.93] read 5614.03 [ 12.98, 748.07]

Comment by Shuichi Ihara (Inactive) [ 06/Jan/11 ]

btw, I've been testing this on lustre-1.8.4. So, I did some code adjustments from http://review.whamcloud.com/163 for lustre-1.8.

Ihara

Comment by Shuichi Ihara (Inactive) [ 06/Jan/11 ]

adjusted patch for 1.8.x

Comment by Niu Yawei (Inactive) [ 06/Jan/11 ]

Thank you, Ihara.

Could you run a full test, and post all the output (like what you did in the first comment) to see if there is any differences?

I suspect there is some other contention dragged down the performance, could you use oprofile to collect some data while running the test?

btw, what's the kernel version?

Comment by Andreas Dilger [ 07/Jan/11 ]

See also https://bugzilla.lustre.org/show_bug.cgi?id=22980#c18 for a similar issue. I suspect that the performance bottleneck may be in userspace, but we can only find out with some oprofile and/or lockmeter data.

Comment by Shuichi Ihara (Inactive) [ 07/Jan/11 ]

Niu, sorry, it looked like something bad in the storage side when I did benchmark yesterday. Once I fixed the storage, tried obdfilter-survey with applied your patches. It seems patches fixe the problem on 24 core system and getting close number to when HT=off. Here is results on 12 cores (HT=off) and 24 cores (HT=on).

# 12 cores (HT=off), 4 OSSs, 56 OSTs (14OSTs per OSS)
ost 56 sz 469762048K rsz 1024K obj   56 thr   56 write 3546.88 [  37.96,  70.86] read 7633.11 [ 124.88, 156.85] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  112 write 6420.31 [  91.91, 130.75] read 10121.79 [ 159.70, 202.60] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  224 write 9576.76 [ 125.84, 216.80] read 10444.91 [ 167.84, 216.79] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  448 write 10264.63 [  98.95, 207.61] read 10972.26 [ 150.68, 232.78] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  896 write 9842.69 [  91.91, 305.69] read 10896.16 [ 121.89, 330.57] 
ost 56 sz 469762048K rsz 1024K obj   56 thr 1792 write 9613.51 [  28.96, 251.70] read 10792.37 [ 123.88, 277.50] 
ost 56 sz 469762048K rsz 1024K obj   56 thr 3584 write 9597.46 [   0.00, 253.78] read 10698.87 [ 118.89, 271.75] 

# 24 cores (HT=on), 4 OSSs, 56 OSTs (14OSTs per OSS)
ost 56 sz 469762048K rsz 1024K obj   56 thr   56 write 3345.48 [  42.96,  66.94] read 6981.70 [ 102.91, 153.86] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  112 write 6327.40 [  88.92, 128.89] read 9826.28 [ 156.85, 208.80] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  224 write 9792.45 [ 139.87, 218.77] read 10409.23 [ 173.84, 303.70] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  448 write 10262.20 [ 106.90, 235.78] read 10903.93 [ 157.86, 253.79] 
ost 56 sz 469762048K rsz 1024K obj   56 thr  896 write 9905.94 [  98.91, 233.78] read 10829.35 [ 127.88, 266.75] 
ost 56 sz 469762048K rsz 1024K obj   56 thr 1792 write 9656.78 [   6.99, 251.79] read 10761.36 [ 115.89, 333.68] 
ost 56 sz 469762048K rsz 1024K obj   56 thr 3584 write 9596.28 [   0.00, 261.76] read 10742.13 [ 119.89, 324.68] 

Comment by Niu Yawei (Inactive) [ 07/Jan/11 ]

Thanks for your good news, Ihara. Looks the patch works as we expected.

Hi, Andreas

The user space semaphore used to protect the shmem is another contention source, however, it looks not so severe as the BKL of each ioctl. Should we post the patch to the bug 22890 to see if it resovles the problem?

BTW, I thought there isn't any Lustre ioctls depends on BKL, and it's safe to introduce 'unlocked_ioctl'. Could you confirm it?

Comment by Peter Jones [ 19/Jan/11 ]

As per Andreas, you probably mean bz 22980, rather than 22890. Yes please, can you attach your patch to the bz - thanks!

Comment by Andreas Dilger [ 19/Jan/11 ]

I don't think there are any ioctls that depend on BKL, but I haven't looked through them closely. In particular, I'm not sure if there is proper serialization around the configuration ioctls or not.

That said, since the configuration is almost always done by mount/unmount and not by the old lctl commands, I don't think this will be a serious risk, so I think it makes sense to move the Lustre ioctl handling over to ->unlocked_ioctl(). That should be done only for kernels which support the ->unlocked_ioctl() method, which means a configure check is needed to set HAVE_UNLOCKED_IOCTL if that method is present in struct file_operations.

Comment by Niu Yawei (Inactive) [ 19/Jan/11 ]

Yes, I meant 22980, thanks Peter.

Andreas, the HAVE_UNLOCKED_IOCTL is defined by the kernel which has the 'unlocked_ioctl' method.

Comment by Shuichi Ihara (Inactive) [ 25/Jan/11 ]

Niu, I'm investigating for test infrastructure on VMs (KVM: Kernel based Virtual Machine). Once apply the your patch and run obdfilter survey on VM, the performance is going to bad. Without the patch, I'm getting reasonable number even on VMs. So, the the patch seems have some impacts if I run obdfilter-survey on VM.

I will file results and more information (will get oprofile on VM) in a couple of days.

Ihara

Comment by Niu Yawei (Inactive) [ 25/Jan/11 ]

Ihara, that's interesting, I don't have any ideas on that so far, let's see what happened when your result comes out. Thanks.

Comment by Shuichi Ihara (Inactive) [ 07/Feb/11 ]

Niu, it was my fault, sorry. the problem was NOT caused by this patches. The problem came from CPU affinity setting on VM. Maybe many context happened?
Anyway, once I did correct CPU affinity on VM, the patches works as well as 24 cores system.

Thanks!

Comment by Andreas Dilger [ 09/Feb/11 ]

I'm currently unable to post to bugzilla...

Inspection template(s):
Bug: 22980
Developer: niu@whamcloud.com
Size: 7 Lines of Change
Date: 2011-2-8
Defects: 1
Type: CODE
Inspector: adilger@whamcloud.com

--------------
>@@ -1681,24 +1680,15 @@ int jt_obd_test_brw(int argc, char **argv)
> } else if (be_verbose(verbose, &next_time,i, &next_count,count)) {
>- shmem_lock ();
> printf("%s: %s number %d @ "LPD64":"LPU64" for %d\n",
> jt_cmdname(argv[0]), write ? "write" : "read", i,
> data.ioc_obdo1.o_id, data.ioc_offset,
> (int)(pages * getpagesize()));
>- shmem_unlock ();

I would be surprised if the locking here affects the performance. be_verbose()
should be true at most every few seconds, and otherwise the shmem_lock/unlock()
is never hit. I think this was put in place to avoid all of the printf()
statements from overlapping, which ruins the whole result from the test. If
there actually IS overhead from this locking, it just means that the message
rate is too high and needs to be reduced.

>@@ -1622,20 +1622,19 @@ int jt_obd_test_brw(int argc, char **argv)
>
> #ifdef MAX_THREADS
> if (thread) {
>- shmem_lock ();
> if (nthr_per_obj != 0)

{ > /* threads interleave */ > obj_idx = (thread - 1)/nthr_per_obj; > objid += obj_idx; > stride *= nthr_per_obj; >- if ((thread - 1) % nthr_per_obj == 0) >- shared_data->offsets[obj_idx] = stride + thr_offset; > thr_offset += ((thread - 1) % nthr_per_obj) * len; > }

else

{ > /* threads disjoint */ > thr_offset += (thread - 1) * len; > }

>
>+ shmem_lock ();
>+
> shared_data->barrier--;
> if (shared_data->barrier == 0)
> l_cond_broadcast(&shared_data->cond);
> if (!repeat_offset) {
> #ifdef MAX_THREADS
>- if (stride == len)

{ >- data.ioc_offset += stride; >- }

else if (i < count)

{ >- shmem_lock (); >- data.ioc_offset = shared_data->offsets[obj_idx]; >- shared_data->offsets[obj_idx] += len; >- shmem_unlock (); >- }

>+ data.ioc_offset += stride;

(defect) I don't think this is going to result in the same test load at all.
It means that only "len/stride" fraction of each object is written, and in
fact it looks like there will be holes in every object because the common
data->ioc_offset is being incremented by every thread in a racy manner so the
offset will get large too quickly.

What about adding per-object shmem_locks to protect offsets[] values? That
would avoid most of the contention on this lock, if that is the overhead.
However, like previously stated, I think it is best to get some real data
(e.g. oprofile for kernel and gprof for userspace, collected over two different
test runs to avoid too much overhead).

Comment by Niu Yawei (Inactive) [ 09/Feb/11 ]

Thank you, Andreas.

Yes, I agree with you that we'd better collect some oprofile data in next step, will have a meeting with Bull's people tonight for the b22980.

The test result shows that the shmem lock isn't a major factor, so I think we can put it aside for a while. One thing confused me is that the 'unlocked_ioctl' patch works for Ihara's test, but doesn't work well for b22980. Liang discussed this issue with me yesterday, and we concluded three differences between the two tests:

  • Ihara's test is against 1.8, b22980 is against 2.0; (I checked the obdecho code, seems there is no major difference between 1.8 and 2.0 in 'case=disk' mode)
  • With/without patch applied, Ihara compared 12 cores with 24 cores, b22980 compared 8 cores with 32 cores; (Will ask Bull's people to do more tests with patch applied, 16 cores, 24 cores...)
  • This test is running on SMP (Ihara, please correct me if I'm wrong), but b22980 is running on NUMA; (Liang mentioned that without cpu affinity, the performance degradation could be huge in NUMA arch, will ask them to do more test to see if it's NUMA dependent).

Will talk to Bull's people on what we found, and ask them to supply some oprofile data in next test. If you have any comments, please let me know.

Comment by Liang Zhen (Inactive) [ 09/Feb/11 ]

As Niu said, the major difference between this issue and b22980 is, b22980 is running over NUMIOA system.
Diego mentioned (on b22980) Lustre client can driver OSS harder than obdfilter-survey, I think it could be because IO threads of OST are numa-affinity (ptlrpc_service::srv_cpu_affinity), but lctl threads of obdfilter-survey don't have any affinity so they probably have much more cross-nodes traffic.

I hope Bull can collect some data for us (i.e: if each numa node has 8 cores):

  • only enable 1/2/3/4 numa node and run obdfilter-survey
  • enable 8/16/24/32 cores, but these cores distribute on different numa nodes
    If we get quite different performance, then I think our assumption is correct, otherwise we pointed to wrong place. Of course, we should get oprofiles while running with these tests.
Comment by Peter Jones [ 09/Feb/11 ]

Am I right in thinking that the NUMIOA issue is being tracked under LU-66 and this ticket can be marked as resolved? Bugzilla seems to be available again btw.

Comment by Shuichi Ihara (Inactive) [ 09/Feb/11 ]

I believe so. In my case, obdfilter-survery didn't work well on 24 cores system (actually it was 12 cores, but 24 cores with HT=on) and this is NOT NUMIOA platform.
I confirmed this problem was fixed by Niu patches here, and obdfilter-survey worked well for 24 cores on non-NUMIOA platform.

btw, I also have a NUMIOA system, but using with VMs and spiting CPU and IO chip by KVM(Kernel-based Virtual Machine) now. However, I'm really interested in LU-66. I do keep an eye this and, want to test patches in my test system if it's available.

Thanks again!

Comment by Peter Jones [ 14/Feb/11 ]

Marking as resolved as landed upstream for Oracle Lustre 1.8.6

Generated at Sat Feb 10 01:03:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.