[LU-66] obdfilter-survey performance issue on NUMA system - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.1.0
Labels:
None

Bugzilla ID:
22,980
Rank (Obsolete):
8541

Description

this is just copy of bug 22980, but I think it's better to track & discuss it at here:

Hello,

Testing our new IO servers we have an issue with obdfilter-survey. Our OSSs are based on 4
Nehalem-EX processors, connected to a Boxboro chipset. Every socket has 6 cores. On every OST we
have several FC channels connected to our storage bay.

When we perform raw tests with sgpdd-survey, over 24 luns we get ~4400 MB/s on write and more than
5500 MB/s on read.

Then if we start a Lustre filesystem and we test these 24 osts with obdfilter-survey (size=24192
rszlo=1024 rszhi=1024 nobjlo=1 nobjhi=2 thrlo=1 thrhi=16 case=disk tests_str="write read" sh
obdfilter-survey) we always have a performance limit on 1200 MB/s for write and read.

If we perform IOzone tests from five clients (2 threads per client, connected to the server with
Infiniband) we get more than 2500 MB/s.

Then we disconnected two sockets using command "echo 0 > /sys/devices/system/cpu/cpu5/online" on
every cpu belonging to these two sockets and we get expected results on obdfilter-survey (4600 MB/s
on write and 5500 MB/s on read). If we only disconnect one socket then obdfilter-survey gives us a
max of 1600 MB/s. Using only one socket results are slightly worse than with two sockets.

We also made these tests with Lustre 1.6, with other storage bays and with similar platforms (4
sockets and 8 cpus per socket) having always the same kind of problem. If we activate the
hyper-threading functionality on every socket then performances are even worse.

It's like if obdfilter-survey has any kind of saturation when there are many sockets. What do you
think? Thanks,

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

affinity_map
0.2 kB
23/Feb/11 3:30 AM
affinity_results.tgz
465 kB
24/Feb/11 1:20 AM
bull_obdfilter_survey_chart_110309.pdf
65 kB
08/Mar/11 11:17 PM
bull_obdfilter_survey_chart_110319.pdf
66 kB
21/Mar/11 1:01 AM
full_results_kmalloc.tgz
346 kB
16/Feb/11 11:54 PM
full_results.tgz
727 kB
15/Feb/11 3:42 AM
lctl_setaffinity_v2.patch
4 kB
23/Feb/11 3:29 AM
new_results_kmalloc.tgz
78 kB
21/Feb/11 1:21 AM
obdfilter-survey_results.txt
17 kB
14/Feb/11 4:11 AM
remove_vmalloc.patch
3 kB
15/Feb/11 7:14 PM

Activity

[LU-66] obdfilter-survey performance issue on NUMA system

Niu Yawei (Inactive) added a comment - 23/Feb/11 3:34 AM

Hi, Sebastien

Ah, right. Binding object id to cpu doesn't make sense for this test, I've changed the patch to bind devno to cpu (lctl_setaffinity_v2.patch), and I also updated the example affinity_map. Please use the new patch to run the test that I meantioned in my previous comment. Thanks for your effort!

Niu Yawei (Inactive) added a comment - 23/Feb/11 3:34 AM Hi, Sebastien Ah, right. Binding object id to cpu doesn't make sense for this test, I've changed the patch to bind devno to cpu (lctl_setaffinity_v2.patch), and I also updated the example affinity_map. Please use the new patch to run the test that I meantioned in my previous comment. Thanks for your effort!

Sebastien Buisson (Inactive) added a comment - 23/Feb/11 2:32 AM

If I understand correctly, in order to know in advance the objids I should have a look at the obdfilter_survey_xxxx.detail file and consider the next run will do '+1' on the ids.
The problem is the obdfilter_survey_xxxx.detail contains the following:

=======================> ost 15 sz 314572800K rsz 1024K obj 15 thr 15
=============> Create 1 on localhost:quartcel-OST0005_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0008_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0007_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST000c_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0000_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST000d_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0004_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0006_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0002_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST000a_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0009_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST000b_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST000e_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0001_ecc
create: 1 objects
create: #1 is object id 0x29
=============> Create 1 on localhost:quartcel-OST0003_ecc
create: 1 objects
create: #1 is object id 0x29

So, as you can see, all objects have the same ids on the OSTs... In that case, I am afraid the 'objid to core' mapping is useless.
Unless I manually create new objects on OSTs so that objids are different everywhere?

Sebastien.

Sebastien Buisson (Inactive) added a comment - 23/Feb/11 2:32 AM If I understand correctly, in order to know in advance the objids I should have a look at the obdfilter_survey_xxxx.detail file and consider the next run will do '+1' on the ids. The problem is the obdfilter_survey_xxxx.detail contains the following: =======================> ost 15 sz 314572800K rsz 1024K obj 15 thr 15 =============> Create 1 on localhost:quartcel-OST0005_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0008_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0007_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000c_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0000_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000d_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0004_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0006_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0002_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000a_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0009_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000b_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST000e_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0001_ecc create: 1 objects create: #1 is object id 0x29 =============> Create 1 on localhost:quartcel-OST0003_ecc create: 1 objects create: #1 is object id 0x29 So, as you can see, all objects have the same ids on the OSTs... In that case, I am afraid the 'objid to core' mapping is useless. Unless I manually create new objects on OSTs so that objids are different everywhere? Sebastien.

Niu Yawei (Inactive) added a comment - 21/Feb/11 6:52 PM

Hi, Sebastien

The results looks really good, I think it's basically what we expected, thank you.

One thing unknown is that the write performance dropped a lot in 960 threads. To measure how the cpu affinity affect test result, could you help us to do more tests? I think it'll be useful for our further performance tuning work.

What I want to test is:

apply "remove BKL" + "kmalloc" + "lctl_setaffinity" patches;
run test in "objid" mode, 4 sockets enabled, and without oprofile enabled.
provide the result, numstat and the /tmp/obdfilter_survey_xxxx.detail (where the thread/object cpu mapping is logged)

In the "objid" mode, each lctl thread will be mapped to a specified cpu, so you should know all the objids before run tests and set the objid-cpu mapping in the /tmp/affinity_map (please refer to the affinity_map example), of course, the objid should be on the local IOH of it's mapped cpu.

Niu Yawei (Inactive) added a comment - 21/Feb/11 6:52 PM Hi, Sebastien The results looks really good, I think it's basically what we expected, thank you. One thing unknown is that the write performance dropped a lot in 960 threads. To measure how the cpu affinity affect test result, could you help us to do more tests? I think it'll be useful for our further performance tuning work. What I want to test is: apply "remove BKL" + "kmalloc" + "lctl_setaffinity" patches; run test in "objid" mode, 4 sockets enabled, and without oprofile enabled. provide the result, numstat and the /tmp/obdfilter_survey_xxxx.detail (where the thread/object cpu mapping is logged) In the "objid" mode, each lctl thread will be mapped to a specified cpu, so you should know all the objids before run tests and set the objid-cpu mapping in the /tmp/affinity_map (please refer to the affinity_map example), of course, the objid should be on the local IOH of it's mapped cpu.

Sebastien Buisson (Inactive) added a comment - 21/Feb/11 1:21 AM

New results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find:

result_*.txt: results for a specific obdfilter-survey test, with also 'numastat' output
opreport_*.txt: associated oprofile data
sgpdd_res.txt : sgpdd_survey results

I am sorry, I was not able to get results for 3, 2 and 1 socket. I launched the tests several times, and each time the server crashed. I seems the system does not appreciate to run oprofile with not all sockets...

sgpdd_survey results clearly show a limit around 3 GB/s. This limitation is due to the available bandwidth to the storage, because we use only 4 FC links.

Sebastien Buisson (Inactive) added a comment - 21/Feb/11 1:21 AM New results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find: result_*.txt: results for a specific obdfilter-survey test, with also 'numastat' output opreport_*.txt: associated oprofile data sgpdd_res.txt : sgpdd_survey results I am sorry, I was not able to get results for 3, 2 and 1 socket. I launched the tests several times, and each time the server crashed. I seems the system does not appreciate to run oprofile with not all sockets... sgpdd_survey results clearly show a limit around 3 GB/s. This limitation is due to the available bandwidth to the storage, because we use only 4 FC links.

Liang Zhen (Inactive) added a comment - 17/Feb/11 6:53 PM

I agree that we don't need to run affinity tests because numastat shows that foreign memory access is not a big issue (< 5%). However, I do think that we should increase size (probably 5X) so we can get better vision.
Sebastien, could you please help us to run:

increase size (5X)
only run with patches (kmalloc patch and remove BKL patch)
only run with 1,2,3,4 sockets (don't need to iterate over 8,16,24 cores)
if it's possible, could you give us sgp-dd results on the same hardware, so we can see whether there is anything else we can improve.

Thanks
Liang

Liang Zhen (Inactive) added a comment - 17/Feb/11 6:53 PM I agree that we don't need to run affinity tests because numastat shows that foreign memory access is not a big issue (< 5%). However, I do think that we should increase size (probably 5X) so we can get better vision. Sebastien, could you please help us to run: increase size (5X) only run with patches (kmalloc patch and remove BKL patch) only run with 1,2,3,4 sockets (don't need to iterate over 8,16,24 cores) if it's possible, could you give us sgp-dd results on the same hardware, so we can see whether there is anything else we can improve. Thanks Liang

Niu Yawei (Inactive) added a comment - 17/Feb/11 6:51 AM

I don't think we need run the affinity tests, thank you.

Niu Yawei (Inactive) added a comment - 17/Feb/11 6:51 AM I don't think we need run the affinity tests, thank you.

Sebastien Buisson (Inactive) added a comment - 17/Feb/11 3:59 AM

You're welcome.

The storage array we are attached to should not give us more than 5 GB/s (read and write). So I think the figures given by obdfilter-survey are inaccurate because the test is not longer enough. Maybe I should increase size.

Do you still need me to run affinity tests?

Cheers,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 17/Feb/11 3:59 AM You're welcome. The storage array we are attached to should not give us more than 5 GB/s (read and write). So I think the figures given by obdfilter-survey are inaccurate because the test is not longer enough. Maybe I should increase size. Do you still need me to run affinity tests? Cheers, Sebastien.

Niu Yawei (Inactive) added a comment - 17/Feb/11 2:16 AM

Thanks for your testing, Sebastien.

The result shows both read and write performance got huge improvement, and the oprofile data looks normal this time. So I think the degradation is caused by the contention on BKL and vmap_area_lock. What I don't see is why the read throughput is extreme high in some cases (more than 10000 MB/s), what's the raw bandwith of each OST?

Niu Yawei (Inactive) added a comment - 17/Feb/11 2:16 AM Thanks for your testing, Sebastien. The result shows both read and write performance got huge improvement, and the oprofile data looks normal this time. So I think the degradation is caused by the contention on BKL and vmap_area_lock. What I don't see is why the read throughput is extreme high in some cases (more than 10000 MB/s), what's the raw bandwith of each OST?

Sebastien Buisson (Inactive) added a comment - 16/Feb/11 11:54 PM

Full obdfilter-survey results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find:

summary.txt: array that sums up test results
result_*.txt: results for a specific test, with also 'numastat' output
opreport_*.txt: associated oprofile data

Sebastien Buisson (Inactive) added a comment - 16/Feb/11 11:54 PM Full obdfilter-survey results with unlocked_ioctl and remove_vmalloc patches. In the tarball please find: summary.txt: array that sums up test results result_*.txt: results for a specific test, with also 'numastat' output opreport_*.txt: associated oprofile data

Niu Yawei (Inactive) added a comment - 15/Feb/11 7:14 PM

Change the vmalloc to kmalloc in ioctl path. (the previous one isn't correct, updated with this one)

Niu Yawei (Inactive) added a comment - 15/Feb/11 7:14 PM Change the vmalloc to kmalloc in ioctl path. (the previous one isn't correct, updated with this one)

Niu Yawei (Inactive) added a comment - 15/Feb/11 6:51 PM

Hi, Sebastien

The oprfile data provided by you is very helpful, in the unpatched tests, we can see thread_return() has extremly high rank, I think it's caused by the contention on BKL; in the patched (with unlocked_ioctl) tests, we can see alloc_vmap_area() and find_vmap_area() have very high rank, I think it's caused by the contention on the vmap_area_lock.

I made a patch (remove_vmalloc.patch) which change the vmalloc() to kmalloc() in ioctl path, which could eliminate the contention on vmap_area_lock. Before you run the tests which I suggested in my last comment, I really like you to run this patch (togehter with the unlock_ioctl patch) first to see what'll happen. (of course, please enable oprofile while running tests). Thank you.

Niu Yawei (Inactive) added a comment - 15/Feb/11 6:51 PM Hi, Sebastien The oprfile data provided by you is very helpful, in the unpatched tests, we can see thread_return() has extremly high rank, I think it's caused by the contention on BKL; in the patched (with unlocked_ioctl) tests, we can see alloc_vmap_area() and find_vmap_area() have very high rank, I think it's caused by the contention on the vmap_area_lock. I made a patch (remove_vmalloc.patch) which change the vmalloc() to kmalloc() in ioctl path, which could eliminate the contention on vmap_area_lock. Before you run the tests which I suggested in my last comment, I really like you to run this patch (togehter with the unlock_ioctl patch) first to see what'll happen. (of course, please enable oprofile while running tests). Thank you.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Liang Zhen (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 09/Feb/11 6:44 AM

Updated:: 17/Dec/13 2:16 AM

Resolved:: 17/Dec/13 2:16 AM