Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.8
-
None
-
9223372036854775807
Description
In Cyfronet we use Ethernet RDMA for accesing Lustre filesystem located 14KM away in a secondary DC.
RoCEv2 is RDMA over Ethernet implementation which may be used over lossy network. In order to minimize effects of frame drops caused by the network congestion RoCEv2 uses ECN mechanism.
ForĀ RoCEv2 ECN congestion control to work properly congestion marking has to be enabled on all devices all over the path. Traffic subjected for ECN marking on the network side must be properly tagged by HCA.
For this purpose DSCP field (part of TOS field) from IP packet is used to differentiate RDMA and RDMA-CNP traffic from other flows. Then ECN marking may be enabled and used only for RDMA traffic when congestion is detected.
Lustre LNET does not support setting the TOS value in ko2iblnd.
Currently - the only way to enable tos marking of RDMA traffic is to set default TOS in mlx4/5 drivers using cma_roce_tos script which is part of mOFED distribution. The script is using configfs to set desired value and must be executed before ko2iblnd module is loaded
Drawback of current way of setting tos is that it does not allow to have different ToS values in case of having more than one o2ib nets on one HCA (in separate vlans). It is also difficult to verify if proper tos has been properly set for ko2iblnd QPs.
More convenient and flexible way would be to have ko2iblnd module option for setting tos on per network basis as well as having ToS support in lnetctl for dynamic configuration.
From technical point of view it is possible to set RDMA TOS on QP basis on API level by using rdma_set_option (RDMA_OPTION_ID_TOS field)
Please consider enabling tos setting on per-network basis for lustre o2ib networks in ko2iblnd driver.
Example ko2iblnd parameter could look like this:
modprobe ko2iblnd tos2nets="o2ib80(48),o2ib81(0x18)" tos=12
where
tos - default tos, applied when no explicit mapping is given
tos2nets - set tos value on per-network basis
We have proper infrastructure in place we can use for testing and verification if it helps with development.
Best Regards
–
Lukasz Flis