[LU-2379] 40GigE LNet performance Created: 23/Nov/12  Updated: 11/Jun/20  Resolved: 11/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Liang Zhen (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Rank (Obsolete): 5648

 Description   

Warpmech got LNet performance issue over 40GigE:

  • between two 40GigE nodes, selftest can only get half of bandwidth
    • it's not a big surprise to me, because socklnd connection is bound with exact one thread, and receiving side doesn't have Zero-copy, which means only one core is receiving from a 40GigE link if there is only one peer, so I suspect performance of 1:1 40GigE is CPU bound.
    • one possible solution is reusing CONTROL link to transfer bulk data as well with some special policy
    • I know that enable kiov vmap can help on receiving performance of some NICs, but not sure it will work for this case.
  • while running BRW tests between 10GigE clients and 40GigE server, one direction can work well (saturate the bandwidth), another direction can only get half.
    • 8 clients read can saturate the link pretty well, but I don't know if 4 clients can do the same.
    • 8 clients write can't saturate the link, it can get half of bandwidth. I actually found that one 10GigE client can only get 250-300MB/sec on write against 40GigE server, which matches aggregation write performance of 8 clients (2GB/sec)
    • network latency is quite low (I don't remember the number)


 Comments   
Comment by Isaac Huang (Inactive) [ 04/Dec/12 ]

1. point to point 40GE tests:
I heard that at least a fully utilized 5GHz CPU would be needed to saturate one direction of a 10GE link, if nothing is off loaded from host CPU. That depends on message size. Have they tuned jumbo frames and TCP send/recv buffers?

2. BRW tests between 10GigE clients and 40GigE server:

I assumed that it was 8 clients vs one server. If true, then when 8 clients are reading, the overhead of memory copying on incoming data was spread over the 8 clients, while in the write case, the server had to handle all the memory copy overhead. If that's the cause, then it'd be interesting to see the results with more servers, e.g. 8 clients write to 2 servers or more.

Generated at Sat Feb 10 01:24:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.