Details
-
Improvement
-
Resolution: Fixed
-
Blocker
-
None
-
None
-
15136
Description
Although Lustre has OBD_FAILs to concoct request or reply loss of RPCs, but they are mostly used for unit-tests and not flexible enough to inject random message losses while running with workload.
Combination of OBD_FAIL_PTLRPC_DROP_RPC and CFS_FAIL_RAND can randomly drop RPCs, however, it always drops request before send and leaves RPC in the same status, it cannot simulate LNet message loss which may trigger more complex RPC status, for example, loss of LNet ACK/REPLY of ptlrpc bulk request, or ptlrpc reply etc.
So we need to create a new mechanism to support randomly silent message loss in network of small testing systems. A straightforward solution is to allow user to control message drop in LNet, it needs new user interfaces to add or remove Drop Rule of message, and internal handlers of these drop rules in core LNet.
To simplify implementation, LNet Drop Rule should only be applied to the receive side of a connection (this still can cover all message paths), each Drop Rule contains a few attributes:
- Source NID
- Destination NID
- Drop Rate Factor, if the factor is N, in each N incoming messages that can match this rule, LNet will randomly drop one of them.
User can add new Drop Rule by run command:
lctl net_drop add --source SOURCE_NID –dest DESTINATION_NID --rate DROP_RATE
Here are some examples
$ Lctl net_drop_add --source *@o2ib0 --dest *@tcp2 --rate 1000 Randomly drop 1 message in each 1000 messages from o2ib0 to tcp2 $ Lctl net_drop_add --source 192.168.1.100@tcp0 --dest *@o2ib3 --rate 500 Randomly drop 1 message in each 500 messages from 192.168.1.100@tcp0 to any nodes of o2ib3 $ Lctl net_drop_add --source *@o2ib2 --dest * 2000 Randomly drop 1 message in each 2000 incoming messages from o2ib2.
User can remove Drop Rule by running command
lctl net_drop_del --source SOURCE_NID --dest DESTINATION_NID
All rules will be removed if user simply run “lctl net_drop_del --all”
Show all LNet Drop Rules by running command
lctl net_drop_list
With LNet Drop Rule, we can simulate unreliable network with simple environment and small number of machines. User can add Drop Rule on either end point of cluster (client or server), or LNet routers.
The major benefit of adding Drop Rules only on LNet routers is, the same router pool can be used to test any Lustre version, because router only needs LNet which does not have compatibility issue. It also means this feature does not need to be backported.