Details
-
Bug
-
Resolution: Low Priority
-
Major
-
None
-
Lustre 2.3.0, Lustre 2.1.1
-
None
-
3
-
4035
Description
I have two nodes, "oss1" and "mds1". I can lctl ping from oss1 to mds1 with no problem, proving basic infrastructure is correct.
If I kill mds1 (i.e. pull the power cord) and boot it back up and start lnet, the first lctl ping from oss1 to it will fail with an EIO. A subsequent lctl ping will succeed.
In terms of network traffic when this happens, 3 TCP sessions from oss1 to mds1 are established, serially – that is 1 is opened and then closed, 2 is opened and then closed and 3 is opened and left open. The first two exchange a few packets each and close down gracefully. The third seems to be much longer lived and is in fact the session where both the failed and then the successful lctl ping happen and it continues to live on.
Clearly this problem is not an artefact of TCP or IP and is entirely an artefact of LNET itself, so it seems that LNET ought to be able to handle this situation more gracefully.