Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.3.0
-
None
-
3
-
4542
Description
lnet_nid2peer_locked()->lnet_find_peer_locked() will get NULL while LNet is in progress of shutting down, which means lnet_find_peer_locked() can allocate a new peer and try to insert it into peer table. If one thread dropped lock and allocating new peer, another thread could have already finalized everything of LNet, so the first thread will crash system because it will try to get lock and peer-table after allocation of peer.
The simple solution is add an extra refcount on peer-table (number of peers) before allocating new peer, because the shutting down thread always needs to wait until peers number to be zero before going to the next step.
This bug is not introduced by new LNet, but it can be exposed easily by new LNet.