[LU-16106] lnet network NIs go down when they have no peers and check_routers_before_use=1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2
Affects Version/s: Lustre 2.15.0
Labels:
- llnl
Environment:
lustre-2.15.0_3.llnl-3.t4.x86_64
TOSS 4.4-5

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When starting lnet, we observe NIs going down after about 110 sec. The nodes that we've observed this issue in are router nodes. We setting are setting check_routers_before_use=1 in out lnet module parameters. We do not see this issue with check_routers_before_use=0.

The network is
o2ib18 <{}> tcp129 <{}> o2ib100

The routers of interest are opal[187-190]
opal[187,188] route between o2ib18 and tcp129
opal[189,190] route between tcp129 and o2ib100

o2ib100 includes many non-opal nodes, including LNet routers and MDS and OSS nodes.

The issue was first observed on the tcp network. However, stopping lnet on all nodes in o2ib18 then starting it on opal187 showed the same symptoms on the infiniband NI.

When the NI status was down, traffic was unable to flow between compute nodes on o2ib18 and a filesystem on o2ib100. Also, pings don't work between nodes with down NIs.

However, when starting some nodes opal[188,190] with check_routers_before_use not set, then starting opal[187,189] with check_routers_before_use=1, opal[187,189] are able to ping and be pinged by opal[188,190], but can't ping each other or themselves.

We noticed this when booting the opal cluster. All the non-opal nodes on o2ib100 were up and LNet was running on those non-opal nodes. The opal router nodes listed above were powered on first, and after they were up and LNet was started, the rest of the opal nodes (about 60 lustre clients) were booted. We found that the opal routers NIs were down and the opal clients could not ping through the opal routers to the MDS and OSS nodes on o2ib100. This is a concern because this scenario occurs when we update operating system versions or recover from power outages.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

startup_router_nodes_systemctl.gz
1.27 MB
25/Aug/22 12:42 AM

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Gian-Carlo Defazio

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Aug/22 12:34 AM

Updated:: 05/Apr/23 8:09 PM

Resolved:: 17/Sep/22 1:01 PM