[LU-16106] lnet network NIs go down when they have no peers and check_routers_before_use=1 Created: 25/Aug/22 Updated: 05/Apr/23 Resolved: 17/Sep/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.15.0_3.llnl-3.t4.x86_64 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
When starting lnet, we observe NIs going down after about 110 sec. The nodes that we've observed this issue in are router nodes. We setting are setting check_routers_before_use=1 in out lnet module parameters. We do not see this issue with check_routers_before_use=0. The network is The routers of interest are opal[187-190] o2ib100 includes many non-opal nodes, including LNet routers and MDS and OSS nodes. The issue was first observed on the tcp network. However, stopping lnet on all nodes in o2ib18 then starting it on opal187 showed the same symptoms on the infiniband NI. When the NI status was down, traffic was unable to flow between compute nodes on o2ib18 and a filesystem on o2ib100. Also, pings don't work between nodes with down NIs. However, when starting some nodes opal[188,190] with check_routers_before_use not set, then starting opal[187,189] with check_routers_before_use=1, opal[187,189] are able to ping and be pinged by opal[188,190], but can't ping each other or themselves. We noticed this when booting the opal cluster. All the non-opal nodes on o2ib100 were up and LNet was running on those non-opal nodes. The opal router nodes listed above were powered on first, and after they were up and LNet was started, the rest of the opal nodes (about 60 lustre clients) were booted. We found that the opal routers NIs were down and the opal clients could not ping through the opal routers to the MDS and OSS nodes on o2ib100. This is a concern because this scenario occurs when we update operating system versions or recover from power outages. |
| Comments |
| Comment by Gian-Carlo Defazio [ 25/Aug/22 ] |
|
The nodes seem to have differing opinions about up/down [root@opal193:~]# pdsh -w eopal[187-190] 'lnetctl peer show --verbose | grep -E -A 1 "\- nid.*tcp129"' | dshbak -c
----------------
eopal187
----------------
- nid: 192.168.129.188@tcp129
state: up
--
- nid: 192.168.129.190@tcp129
state: up
--
- nid: 192.168.129.189@tcp129
state: down
--
- nid: 192.168.129.187@tcp129
state: down
----------------
eopal188
----------------
- nid: 192.168.129.190@tcp129
state: up
--
- nid: 192.168.129.188@tcp129
state: up
--
- nid: 192.168.129.189@tcp129
state: up
--
- nid: 192.168.129.187@tcp129
state: up
----------------
eopal189
----------------
- nid: 192.168.129.190@tcp129
state: up
--
- nid: 192.168.129.188@tcp129
state: up
--
- nid: 192.168.129.187@tcp129
state: down
--
- nid: 192.168.129.189@tcp129
state: down
----------------
eopal190
----------------
- nid: 192.168.129.188@tcp129
state: up
--
- nid: 192.168.129.190@tcp129
state: up
--
- nid: 192.168.129.187@tcp129
state: up
--
- nid: 192.168.129.189@tcp129
state: up
|
| Comment by Gian-Carlo Defazio [ 25/Aug/22 ] |
|
I've uploaded startup_router_nodes_systemctl.gz which shows starting lnet via systemctl for opal[187-190]. All 4 routers have check_routers_before_use=1. debug=+net. |
| Comment by Olaf Faaland [ 25/Aug/22 ] |
|
For our reference, our local issue is TOSS5754 |
| Comment by Gian-Carlo Defazio [ 25/Aug/22 ] |
|
Here's the module params we set for lnet on opal[187,188] options lnet forwarding="enabled" \ networks="o2ib18(hsi0),tcp129(lnet0)" \ routes="o2ib600 192.168.129.[189-190]@tcp129; \ tcp0 192.168.129.[189-190]@tcp129; \ o2ib100 192.168.129.[189-190]@tcp129" options lnet lnet_peer_discovery_disabled=1 # Common to lustre 2.8, lustre 2.10, lustre 2.12 options libcfs libcfs_panic_on_lbug=1 options libcfs libcfs_debug=0x3060580 options ptlrpc at_min=45 options ptlrpc at_max=600 options ksocklnd keepalive_count=100 options ksocklnd keepalive_idle=30 options lnet check_routers_before_use=1 options lnet lnet_health_sensitivity=0# lustre 2.12 default for keepalive_intvl is 5 (secs) # lustre 2.12 default for avoid_asym_router_failure is 1 (enabled) # Below settings are set via module options ONLY for Lustre <= 2.10 # For later versions of Lustre, they are set via lnetctl YAML files. options lnet forwarding="enabled" options lnet tiny_router_buffers=2048 options lnet small_router_buffers=16384 options lnet large_router_buffers=2048 options ko2iblnd credits=1024 options ksocklnd credits=512 and for opal[189,190] only the nets and routes differ options lnet forwarding="enabled" \ networks="o2ib100(san0),tcp129(lnet0)" \ routes="o2ib600 172.19.2.[22-25]@o2ib100; \ tcp0 172.19.2.[22-25]@o2ib100; \ o2ib18 192.168.129.[187-188]@tcp129" options lnet lnet_peer_discovery_disabled=1 # Common to lustre 2.8, lustre 2.10, lustre 2.12 options libcfs libcfs_panic_on_lbug=1 options libcfs libcfs_debug=0x3060580 options ptlrpc at_min=45 options ptlrpc at_max=600 options ksocklnd keepalive_count=100 options ksocklnd keepalive_idle=30 #options lnet check_routers_before_use=1 options lnet lnet_health_sensitivity=0# lustre 2.12 default for keepalive_intvl is 5 (secs) # lustre 2.12 default for avoid_asym_router_failure is 1 (enabled) # Below settings are set via module options ONLY for Lustre <= 2.10 # For later versions of Lustre, they are set via lnetctl YAML files. options lnet forwarding="enabled" options lnet tiny_router_buffers=2048 options lnet small_router_buffers=16384 options lnet large_router_buffers=2048 options ko2iblnd credits=1024 options ksocklnd credits=512
|
| Comment by Olaf Faaland [ 25/Aug/22 ] |
|
In the above description of the module parameters, the "# Below settings are set via module options ONLY for Lustre <= 2.10 ..." is not correct, we set all that via modprobe.d files. The embedded comment is outdated and never got cleaned up. |
| Comment by Peter Jones [ 25/Aug/22 ] |
|
Serguei Can you please advise Thanks Peter |
| Comment by Serguei Smirnov [ 25/Aug/22 ] |
|
Hi, Are there actually any differences between module parameters for opal[187,189] vs. [188,190]? You mention that "the nets and routes differ", but I don't see that. Also, could you please clarify which version of lustre is used? lctl --version Thanks, Serguei.
|
| Comment by Gian-Carlo Defazio [ 25/Aug/22 ] |
|
@ssmirnov I had the router numbers wrong. I've updated them in the description and where I posted the module parameters. The the correct groupings are: opal[187,188] are routers between o2ib18 and tcp129 opal[189,190] are routers between o2ib100 and tcp129 I set check_routers_before_use=0 for one router in each group for testing purposes. opal[188,190] have check_routers_before_use=0 (and can ping, have NIs up) opal[187,189] have check_routers_before_use=1 (NIs down, have pinging issues)
~# lctl --version lctl 2.15.0_3.llnl
|
| Comment by Etienne Aujames [ 26/Aug/22 ] |
|
Hello, I am missing something? check_routers_before_use=1 considers all routers peers and routes as down when starting ("=0" -> considers as up). To use a route, a ping (discovery ping on 2.15) have to success. You can check the following lnet module parameters too: parm: alive_router_check_interval:Seconds between live router health checks (<= 0 to disable) <-- default 60s parm: router_ping_timeout:Seconds to wait for the reply to a router health query (int) <-- default 50s |
| Comment by Etienne Aujames [ 26/Aug/22 ] |
|
Sorry for my comment above, I did not see your update. |
| Comment by Serguei Smirnov [ 26/Aug/22 ] |
|
Gian-Carlo, Could you please let me know how to get the source with "2.15.0_3.llnl" tag? I wasn't able to find this tag in LLNL Lustre repo. Thanks, Serguei. |
| Comment by Gian-Carlo Defazio [ 26/Aug/22 ] |
|
Serguei, Sorry, it wasn't pushed. It's on github now at https://github.com/LLNL/lustre/tree/2.15.0_3.llnl |
| Comment by Gian-Carlo Defazio [ 26/Aug/22 ] |
|
Serguei, I've seen the same issue now for our local lustre 2.14 and 2.12 when all other router nodes are down (or have lnet down) and a single node starts lnet. I need to collect more data on what combinations or nodes, startup orders, and module parameters cause the problem. I'll post that info early next week. |
| Comment by Gerrit Updater [ 28/Aug/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48355 |
| Comment by Gian-Carlo Defazio [ 30/Aug/22 ] |
|
The patch seems to solve the issues we were having. Starting the routers in any order works now. This is with check_routers_before_use=1 of course. The routers can ping each other and compute nodes can get through to routers to the file system. |
| Comment by Olaf Faaland [ 12/Sep/22 ] |
|
Gian, Have we tested with the final version of the patch (set 3)? thanks, |
| Comment by Gerrit Updater [ 12/Sep/22 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/48529 |
| Comment by Gian-Carlo Defazio [ 12/Sep/22 ] |
|
We have not tested patch set 3. |
| Comment by Gerrit Updater [ 17/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48355/ |
| Comment by Peter Jones [ 17/Sep/22 ] |
|
This fix has now landed for 2.16. We still need to track the port to b2_15 being merged and confirm that LLNL's testing is successful with the latest version |
| Comment by Gerrit Updater [ 26/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48529/ |
| Comment by Gian-Carlo Defazio [ 26/Sep/22 ] |
|
I tested patch set 3 and it looks good. Sorry for the late notification, the testing resources with the correct setup for that test have been unavailable lately. |