Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
As shown in the log below
00000800:00000100:6.0:1600464412.044215:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987 00000800:00000100:6.0:1600464412.045753:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:6.0:1600464412.045755:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987 00000800:00000100:6.0:1600464412.047336:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
The o2iblnd can get into a loop attemptint to reconnect to a node which is not up until the connection timeout kicks in.
There are two potential solutions. Add a new module parameter to control the number of times to attempt a reconnect before we fail.
Another option, which I prefer, is to use the existing retry_count o2iblnd module parameter to limit the number of connection retries.
It's currently used for:
retry_count The maximum number of times that a data transfer operation should be retried on the connection when an error occurs. This setting controls the number of times to retry send, RDMA, and atomic operations when timeouts occur. Applies only to RDMA_PS_TCP.
I believe it's can be used for reconnection attempts performed by the iblnd as well.