[LU-16106] lnet network NIs go down when they have no peers and check_routers_before_use=1 Created: 25/Aug/22  Updated: 05/Apr/23  Resolved: 17/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.15.0_3.llnl-3.t4.x86_64
TOSS 4.4-5


Attachments: File startup_router_nodes_systemctl.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When starting lnet, we observe NIs going down after about 110 sec. The nodes that we've observed this issue in are router nodes. We setting are setting check_routers_before_use=1 in out lnet module parameters. We do not see this issue with check_routers_before_use=0.

The network is
o2ib18 <{}> tcp129 <{}> o2ib100

The routers of interest are opal[187-190]
opal[187,188] route between o2ib18 and tcp129
opal[189,190] route between tcp129 and o2ib100

o2ib100 includes many non-opal nodes, including LNet routers and MDS and OSS nodes.

The issue was first observed on the tcp network. However, stopping lnet on all nodes in o2ib18 then starting it on opal187 showed the same symptoms on the infiniband NI.

When the NI status was down, traffic was unable to flow between compute nodes on o2ib18 and a filesystem on o2ib100. Also, pings don't work between nodes with down NIs.

However, when starting some nodes opal[188,190] with check_routers_before_use not set, then starting opal[187,189] with check_routers_before_use=1, opal[187,189] are able to ping and be pinged by opal[188,190], but can't ping each other or themselves.

We noticed this when booting the opal cluster. All the non-opal nodes on o2ib100 were up and LNet was running on those non-opal nodes. The opal router nodes listed above were powered on first, and after they were up and LNet was started, the rest of the opal nodes (about 60 lustre clients) were booted. We found that the opal routers NIs were down and the opal clients could not ping through the opal routers to the MDS and OSS nodes on o2ib100. This is a concern because this scenario occurs when we update operating system versions or recover from power outages.



 Comments   
Comment by Gian-Carlo Defazio [ 25/Aug/22 ]

The nodes seem to have differing opinions about up/down
In this case opal[188,190] were stared with check_routers_before_use not set
opal[187,189] started with check_routers_before_use=1

[root@opal193:~]# pdsh -w eopal[187-190] 'lnetctl peer show --verbose | grep -E -A 1 "\- nid.*tcp129"' | dshbak -c
----------------
eopal187
----------------
        - nid: 192.168.129.188@tcp129
          state: up
--
        - nid: 192.168.129.190@tcp129
          state: up
--
        - nid: 192.168.129.189@tcp129
          state: down
--
        - nid: 192.168.129.187@tcp129
          state: down
----------------
eopal188
----------------
        - nid: 192.168.129.190@tcp129
          state: up
--
        - nid: 192.168.129.188@tcp129
          state: up
--
        - nid: 192.168.129.189@tcp129
          state: up
--
        - nid: 192.168.129.187@tcp129
          state: up
----------------
eopal189
----------------
        - nid: 192.168.129.190@tcp129
          state: up
--
        - nid: 192.168.129.188@tcp129
          state: up
--
        - nid: 192.168.129.187@tcp129
          state: down
--
        - nid: 192.168.129.189@tcp129
          state: down
----------------
eopal190
----------------
        - nid: 192.168.129.188@tcp129
          state: up
--
        - nid: 192.168.129.190@tcp129
          state: up
--
        - nid: 192.168.129.187@tcp129
          state: up
--
        - nid: 192.168.129.189@tcp129
          state: up

Comment by Gian-Carlo Defazio [ 25/Aug/22 ]

I've uploaded startup_router_nodes_systemctl.gz which shows starting lnet via systemctl for opal[187-190]. All 4 routers have check_routers_before_use=1. debug=+net.

Comment by Olaf Faaland [ 25/Aug/22 ]

For our reference, our local issue is TOSS5754

Comment by Gian-Carlo Defazio [ 25/Aug/22 ]

Here's the module params we set for lnet on opal[187,188]

options lnet forwarding="enabled" \
             networks="o2ib18(hsi0),tcp129(lnet0)" \
             routes="o2ib600  192.168.129.[189-190]@tcp129; \
                     tcp0     192.168.129.[189-190]@tcp129; \
                     o2ib100  192.168.129.[189-190]@tcp129"
options lnet lnet_peer_discovery_disabled=1
# Common to lustre 2.8, lustre 2.10, lustre 2.12
options libcfs libcfs_panic_on_lbug=1
options libcfs libcfs_debug=0x3060580
options ptlrpc at_min=45
options ptlrpc at_max=600
options ksocklnd keepalive_count=100
options ksocklnd keepalive_idle=30
options lnet check_routers_before_use=1
options lnet lnet_health_sensitivity=0# lustre 2.12 default for keepalive_intvl is 5 (secs)
# lustre 2.12 default for avoid_asym_router_failure is 1 (enabled)
# Below settings are set via module options ONLY for Lustre <= 2.10
# For later versions of Lustre, they are set via lnetctl YAML files.
options lnet forwarding="enabled"
options lnet tiny_router_buffers=2048
options lnet small_router_buffers=16384
options lnet large_router_buffers=2048
options ko2iblnd credits=1024
options ksocklnd credits=512

and for opal[189,190] only the nets and routes differ

options lnet forwarding="enabled" \
             networks="o2ib100(san0),tcp129(lnet0)" \
             routes="o2ib600  172.19.2.[22-25]@o2ib100; \
                     tcp0     172.19.2.[22-25]@o2ib100; \
                     o2ib18   192.168.129.[187-188]@tcp129"
options lnet lnet_peer_discovery_disabled=1
# Common to lustre 2.8, lustre 2.10, lustre 2.12
options libcfs libcfs_panic_on_lbug=1
options libcfs libcfs_debug=0x3060580
options ptlrpc at_min=45
options ptlrpc at_max=600
options ksocklnd keepalive_count=100
options ksocklnd keepalive_idle=30
#options lnet check_routers_before_use=1
options lnet lnet_health_sensitivity=0# lustre 2.12 default for keepalive_intvl is 5 (secs)
# lustre 2.12 default for avoid_asym_router_failure is 1 (enabled)
# Below settings are set via module options ONLY for Lustre <= 2.10
# For later versions of Lustre, they are set via lnetctl YAML files.
options lnet forwarding="enabled"
options lnet tiny_router_buffers=2048
options lnet small_router_buffers=16384
options lnet large_router_buffers=2048
options ko2iblnd credits=1024
options ksocklnd credits=512

 

Comment by Olaf Faaland [ 25/Aug/22 ]

In the above description of the module parameters, the "# Below settings are set via module options ONLY for Lustre <= 2.10 ..." is not correct, we set all that via modprobe.d files. The embedded comment is outdated and never got cleaned up.

Comment by Peter Jones [ 25/Aug/22 ]

Serguei

Can you please advise

Thanks

Peter

Comment by Serguei Smirnov [ 25/Aug/22 ]

Hi,

Are there actually any differences between module parameters for opal[187,189] vs. [188,190]?

You mention that "the nets and routes differ", but I don't see that.

Also, could you please clarify which version of lustre is used?

lctl --version 

Thanks,

Serguei.

 

Comment by Gian-Carlo Defazio [ 25/Aug/22 ]

@ssmirnov  I had the router numbers wrong. I've updated them in the description and where I posted the module parameters.

The the correct groupings are:

opal[187,188] are routers between o2ib18 and tcp129

opal[189,190] are routers between o2ib100 and tcp129

I set check_routers_before_use=0 for one router in each group for testing purposes.

opal[188,190] have check_routers_before_use=0 (and can ping, have NIs up)

opal[187,189] have check_routers_before_use=1 (NIs down, have pinging issues)

 

 

~# lctl --version
lctl 2.15.0_3.llnl

 

 

Comment by Etienne Aujames [ 26/Aug/22 ]

Hello,
Your route for opal[187,189] should be: "o2ib100 192.168.129.[188,190]@tcp129"
And for opal[188,190]: "o2ib18 192.168.129.[187,189]@tcp129"
And you don't need the route for tcp0 and o2ib600 because you don't have the interfaces on opal[187,190] for these networks.

I am missing something?

check_routers_before_use=1 considers all routers peers and routes as down when starting ("=0" -> considers as up). To use a route, a ping (discovery ping on 2.15) have to success.
My guess (not verified) is that the router peer stay down if no correct network interface on the remote router is found.

You can check the following lnet module parameters too:

parm:           alive_router_check_interval:Seconds between live router health checks (<= 0 to disable)   <-- default 60s
parm:           router_ping_timeout:Seconds to wait for the reply to a router health query (int) <-- default 50s
Comment by Etienne Aujames [ 26/Aug/22 ]

Sorry for my comment above, I did not see your update.

Comment by Serguei Smirnov [ 26/Aug/22 ]

Gian-Carlo,

Could you please let me know how to get the source with "2.15.0_3.llnl" tag?

I wasn't able to find this tag in LLNL Lustre repo.

Thanks,

Serguei.

Comment by Gian-Carlo Defazio [ 26/Aug/22 ]

Serguei,

Sorry, it wasn't pushed. It's on github now at https://github.com/LLNL/lustre/tree/2.15.0_3.llnl

Comment by Gian-Carlo Defazio [ 26/Aug/22 ]

Serguei,

I've seen the same issue now for our local lustre 2.14 and 2.12 when all other router nodes are down (or have lnet down) and a single node starts lnet. I need to collect more data on what combinations or nodes, startup orders, and module parameters cause the problem. I'll post that info early next week.

Comment by Gerrit Updater [ 28/Aug/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48355
Subject: LU-16106 lnet: ignore peer ni down status if it was never up
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13e7664eef06d081366c7be2e8b43c186f70a429

Comment by Gian-Carlo Defazio [ 30/Aug/22 ]

The patch seems to solve the issues we were having. Starting the routers in any order works now. This is with check_routers_before_use=1 of course. The routers can ping each other and compute nodes can get through to routers to the file system.

Comment by Olaf Faaland [ 12/Sep/22 ]

Gian,

Have we tested with the final version of the patch (set 3)?

thanks,

Comment by Gerrit Updater [ 12/Sep/22 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/48529
Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d747b9c24b9c8366f0551a7b790aad30b3a80786

Comment by Gian-Carlo Defazio [ 12/Sep/22 ]

We have not tested patch set 3.

Comment by Gerrit Updater [ 17/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48355/
Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3345a8a54e89c342a4ce2d8d4bcb04ee919bcd52

Comment by Peter Jones [ 17/Sep/22 ]

This fix has now landed for 2.16. We still need to track the port to b2_15 being merged and confirm that LLNL's testing is successful with the latest version

Comment by Gerrit Updater [ 26/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48529/
Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9ae1fc3e0e4507c242c5f379e6364ad270d865c0

Comment by Gian-Carlo Defazio [ 26/Sep/22 ]

I tested patch set 3 and it looks good.

Sorry for the late notification, the testing resources with the correct setup for that test have been unavailable lately.

Generated at Sat Feb 10 03:24:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.