[LU-13972] kiblnd can continue attempting to reconnect indefinitely. Created: 18/Sep/20  Updated: 19/Mar/21  Resolved: 19/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As shown in the log below

00000800:00000100:6.0:1600464412.044215:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987
00000800:00000100:6.0:1600464412.045753:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
00000800:00000100:6.0:1600464412.045755:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987
00000800:00000100:6.0:1600464412.047336:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1

The o2iblnd can get into a loop attemptint to reconnect to a node which is not up until the connection timeout kicks in.

There are two potential solutions. Add a new module parameter to control the number of times to attempt a reconnect before we fail.

Another option, which I prefer, is to use the existing retry_count o2iblnd module parameter to limit the number of connection retries.

It's currently used for:

retry_count
The maximum number of times that a data transfer operation 
should be retried on the connection when an error occurs. This setting 
controls the number of
times to retry send, RDMA, and atomic operations when timeouts occur. 
Applies only to RDMA_PS_TCP. 

I believe it's can be used for reconnection attempts performed by the iblnd as well.



 Comments   
Comment by Gerrit Updater [ 19/Sep/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39981
Subject: LU-13972 o2iblnd: Don't retry indefinitely
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0ce7179f98ab2bef0566afdbb9a8f58db6a9c7c

Comment by Chris Hunter (Inactive) [ 21/Sep/20 ]

Currently when a client loses connection to a server, it will retry indefinitely.
With this change, it appears the client will eventually fail the connection.

 

Comment by Gerrit Updater [ 19/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39981/
Subject: LU-13972 o2iblnd: Don't retry indefinitely
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c8ad11ef08f0f2f886004ae4a56f67722c16d5c

Comment by Peter Jones [ 19/Oct/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 11/Mar/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42011
Subject: LU-13972 o2iblnd: Don't retry indefinitely
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 30ce9d9700e82e8317709c8ffa5a4ab754e6544e

Comment by Gerrit Updater [ 17/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42011/
Subject: LU-13972 o2iblnd: Don't retry indefinitely
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 6d2aae7396cfcc37873effa137f8e0cc437132ff

Generated at Sat Feb 10 03:05:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.