Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.15.0
-
lustre-2.15.0_3.llnl-3.t4.x86_64
TOSS 4.4-5
-
3
-
9223372036854775807
Description
When starting lnet, we observe NIs going down after about 110 sec. The nodes that we've observed this issue in are router nodes. We setting are setting check_routers_before_use=1 in out lnet module parameters. We do not see this issue with check_routers_before_use=0.
The network is
o2ib18 <{}> tcp129 <{}> o2ib100
The routers of interest are opal[187-190]
opal[187,188] route between o2ib18 and tcp129
opal[189,190] route between tcp129 and o2ib100
o2ib100 includes many non-opal nodes, including LNet routers and MDS and OSS nodes.
The issue was first observed on the tcp network. However, stopping lnet on all nodes in o2ib18 then starting it on opal187 showed the same symptoms on the infiniband NI.
When the NI status was down, traffic was unable to flow between compute nodes on o2ib18 and a filesystem on o2ib100. Also, pings don't work between nodes with down NIs.
However, when starting some nodes opal[188,190] with check_routers_before_use not set, then starting opal[187,189] with check_routers_before_use=1, opal[187,189] are able to ping and be pinged by opal[188,190], but can't ping each other or themselves.
We noticed this when booting the opal cluster. All the non-opal nodes on o2ib100 were up and LNet was running on those non-opal nodes. The opal router nodes listed above were powered on first, and after they were up and LNet was started, the rest of the opal nodes (about 60 lustre clients) were booted. We found that the opal routers NIs were down and the opal clients could not ping through the opal routers to the MDS and OSS nodes on o2ib100. This is a concern because this scenario occurs when we update operating system versions or recover from power outages.