Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
I found this issue when testing Amir's new patches (see https://review.whamcloud.com/#/c/33651/9), but I believe the issue exists in current master.
LNetNIInit() calls lnet_monitor_thr_start() -> lnet_router_post_mt_start() -> lnet_wait_known_routerstate()
lnet_wait_known_routerstate() will wait indefinitely until all gateways have been discovered.
However, the discovery thread is not started until after lnet_monitor_thr_start() returns. Thus, LNet never finishes starting.
Logs slowly fill with:
[7073564.123980] LNetError: 31952:0:(router.c:873:lnet_check_routers()) Failed to discover router 192.168.2.26@tcp4
Reproduced on simple three node VM.
The LNet configuration:
sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 cat /etc/modprobe.d/lnet.conf | dshbak -c ---------------- sles15s01 ---------------- options lnet networks="tcp(eth0)" options lnet routes="tcp4 192.168.2.26@tcp" options lnet lnet_peer_discovery_disabled=0 options lnet check_routers_before_use=1 ---------------- sles15c01 ---------------- options lnet ip2nets="tcp4(eth0) 192.168.*.*; tcp99(eth0) 192.168.*.*" options lnet routes="tcp 192.168.2.26@tcp4" options lnet lnet_peer_discovery_disabled=0 options lnet check_routers_before_use=1 ---------------- sles15build01 ---------------- options lnet ip2nets="tcp(eth0) 192.168.*.*; tcp4(eth1) 192.168.*.*; tcp99(eth1) 192.168.*.*" options lnet forwarding=enabled options lnet lnet_peer_discovery_disabled=0 sles15build01:/etc/modprobe.d #
Attempt to start LNet:
sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 pdsh> modprobe lnet pdsh> lctl net up sles15build01: LNET configured
LNet can start on the node acting as a router, but the hangs indefinitely on the other two nodes.