[LU-13972] kiblnd can continue attempting to reconnect indefinitely. Created: 18/Sep/20 Updated: 19/Mar/21 Resolved: 19/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
As shown in the log below 00000800:00000100:6.0:1600464412.044215:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987 00000800:00000100:6.0:1600464412.045753:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:6.0:1600464412.045755:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987 00000800:00000100:6.0:1600464412.047336:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 The o2iblnd can get into a loop attemptint to reconnect to a node which is not up until the connection timeout kicks in. There are two potential solutions. Add a new module parameter to control the number of times to attempt a reconnect before we fail. Another option, which I prefer, is to use the existing retry_count o2iblnd module parameter to limit the number of connection retries. It's currently used for: retry_count The maximum number of times that a data transfer operation should be retried on the connection when an error occurs. This setting controls the number of times to retry send, RDMA, and atomic operations when timeouts occur. Applies only to RDMA_PS_TCP. I believe it's can be used for reconnection attempts performed by the iblnd as well. |
| Comments |
| Comment by Gerrit Updater [ 19/Sep/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39981 |
| Comment by Chris Hunter (Inactive) [ 21/Sep/20 ] |
|
Currently when a client loses connection to a server, it will retry indefinitely.
|
| Comment by Gerrit Updater [ 19/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39981/ |
| Comment by Peter Jones [ 19/Oct/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 11/Mar/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42011 |
| Comment by Gerrit Updater [ 17/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42011/ |