[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.6
Labels:
- ORNL
- ornl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.

After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.

Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).

iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.

lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.

We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html

https://jira.whamcloud.com/browse/LU-11415

https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)

Any help or suggestions would be awesome.

Thanks!

Jeff

Attachments

Issue Links

duplicates

LU-12815 Create multiple TCP sockets per SockLND

Resolved

is related to

LU-14320 Poor zfs performance (particularly reads) with ZFS 0.8.5 on RHEL 7.9

Closed

is related to

LU-14676 Better hash distribution to different CPTs when LNET router is exist

Resolved

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Jeff Niles

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 04/Jan/21 10:51 PM

Updated:: 02/Mar/22 5:49 PM

Resolved:: 02/Mar/22 5:49 PM