[LU-11415] ksocklnd performance improvement on 40Gbps ethernet Created: 21/Sep/18 Updated: 08/Apr/19 Resolved: 04/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Jinshan Xiong | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Recently I'm benchmarking a newly setup lustre servers with 40Gbps ethernet network connection. Jumbo frame is enabled and MTU is set to 9000 for both NICs on the client and server side. The connection between client and server is really simple, they are under the same TOR switch, no routers in between. Firstly I used iperf3 to verify the throughput between client and server, and the throughput is stable at 30~32gib/s from either direction. However, when I launched lnet selftest, I usually see less throughput than perf3, which is about ~2500MiB/s. After speaking with Amir and Doug, I monitored ksocklnd threads on both client and server side, the problem we're seeing is that when lnet selftest is performing reading test, there is only one ksocklnd thread consuming 100% CPU time, while the other threads don't take any workload; write test is similar but only one server ksocklnd thread is busy doing work. The workload doesn't seem to spread out to all threads in the pool.
It would be possible that the only thread is enough to handle all the traffic so there is no need to launch the workload to the other threads, but it's also possible that there are some scheduling problems in the implementation of ksocklnd. Doug mentioned that o2iblnd could spread the workload well.
|
| Comments |
| Comment by Jinshan Xiong [ 21/Sep/18 ] |
|
Typically this is what I have seen from the server side when doing write test: {{top - 21:45:16 up 21:57, 1 user, load average: 11.02, 7.19, 5.00}} |
| Comment by Amir Shehata (Inactive) [ 17/Nov/18 ] |
|
I looked at the socklnd scheduling code and it's very similar to the o2iblnd code. There is a scheduler created per CPT. A thread is created for each cpu in the CPT (if the number of threads is not explicitly configured). When creating a connection, the CPT is derived from the peer nid using: lnet_cpt_of_nid() hashing function. The CPT is used to grab the appropriate scheduler and assign it to the connection. All operations on the connection uses the assigned scheduler. Which means if we're running a 1-1 testing, then the same scheduler will always be used. If there are multiple threads in the scheduler then we should round robin over them. In the MLX o2iblnd case most of the work is offloaded to the HW. In case of OPA I believe the HFI driver has its own set of threads which do the work. But in socklnd all the work is done in the socklnd scheduler thread, which causes the thread to consume a lot of the CPU resources if it's the only thread in the scheduler. In the test above how many CPUs are in each CPT? |
| Comment by Jinshan Xiong [ 17/Nov/18 ] |
|
Everything is default there is no settings in `lnet.conf` other than specifying NIC for Lustre. This node has 2 NUMA node with 24 cores, what would be the recommended configuration for CPT? |
| Comment by Amir Shehata (Inactive) [ 19/Nov/18 ] |
|
How many socklnd threads were started? |
| Comment by Jinshan Xiong [ 20/Nov/18 ] |
|
These are all the threads: # ps ax | grep sock |
| Comment by Amir Shehata (Inactive) [ 21/Nov/18 ] |
|
I think I found the issue in the code. In slocklnd.c:ksocknal_choose_scheduler_locked(): 683 select_sched: 684 »·······sched = &info->ksi_scheds[0]; 685 »·······/* 686 »······· * NB: it's safe so far, but info->ksi_nthreads could be changed 687 »······· * at runtime when we have dynamic LNet configuration, then we 688 »······· * need to take care of this. 689 »······· */ 690 »·······CDEBUG(D_NET, "info->ksi_nthreads = %d \n", info->ksi_nthreads); 691 »·······for (i = 1; i < info->ksi_nthreads; i++) { 692 »·······»·······CDEBUG(D_NET, "sched->kss_nconns = %d info->ksi_scheds[%d].kss_nconns = %d\n", 693 »·······»······· sched->kss_nconns, i, info->ksi_scheds[i].kss_nconns); 694 »·······»·······if (sched->kss_nconns > info->ksi_scheds[i].kss_nconns) 695 »·······»·······»·······sched = &info->ksi_scheds[i]; 696 »·······} 697 698 »·······return sched; When running an lnet_selftest script the connection is created once and is used for the duration of the run. This will result in using the same scheduler continuously for the same connection. This algorithm works if you have multiple connections, each connection will use a different scheduler thread. But if most of the traffic is between the same two peers we will not change the scheduler thread and we will run into this bottle neck. Seems like we ought to be balancing the traffic not in connection creation time only, but when receiving a new message. I'm gonna create a patch within the next day or so. Would you be able to test it and see if it performs better? |
| Comment by Jinshan Xiong [ 21/Nov/18 ] |
|
I will be happy to try that out. Thanks. |
| Comment by Amir Shehata (Inactive) [ 23/Nov/18 ] |
|
Summarized the issue and the proposed solution here: Will try and get a patch in soon. |
| Comment by Gerrit Updater [ 28/Nov/18 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33740 |
| Comment by Amir Shehata (Inactive) [ 28/Nov/18 ] |
|
Hey Jinshan, can you try this patch? run a 1-1 selftest and monitor the socknal_sd_* threads' CPU usage. |
| Comment by Gerrit Updater [ 04/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33740/ |
| Comment by Peter Jones [ 04/Jan/19 ] |
|
Landed for 2.13
|
| Comment by Gerrit Updater [ 25/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34299 |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34299/ |