[LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5 Created: 14/Oct/20  Updated: 12/Sep/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: llnl, topllnl
Environment:

lustre 2.10.8 and 2.12.5
mixed OFED, MOFED, Omnipath, and tcp


Attachments: File 2020-oct-14-copper1.tgz     File 2020-oct-14-orelic.tgz     File 2020-oct-14-zrelic.tgz     File 2022-sep-07.lu14026.tgz     File lu-14026.2021-04-28.tgz    
Issue Links:
Related
is related to LU-15453 MDT shutdown hangs on mutex_lock, po... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110

along with side-affects such as reconnect attempts.



 Comments   
Comment by Olaf Faaland [ 14/Oct/20 ]

A more detailed explanation of the topology from a Lustre perspective:

network topology:
(each name represents a cluster)

          o2ib100      /    o2ib600
syrah----+            /         +---quartz
surface--+           /          +---ruby
corona---+-orelic--------zrelic-+---copper(lustre1)
catalyst-+         /            +---zinc(lustre2)
...------+        /             +---...

The clusters on the left, including orelic, are on the o2ib100 Infiniband SAN.
The clusters on the right, including zrelic, are on the o2ib600 Infiniband SAN.

orelic and zrelic are router clusters, each with 4 LNet router nodes.
Orelic and zrelic communicate with each other over a 40GigE network.
Copper and zinc are lustre server clusters, each serving a lustre file system.
The server nodes are attached directly to o2ib600.
All the other clusters are structured with some number of compute nodes on an
internal fabric (IB or OmniPath) which connects them to one or more LNet router
nodes. Those router nodes connect to the internal fabric and also to the SAN
(either o2ib100 or o2ib600).

compute1----+
compute2----+--router1--SAN
compute3----+--router2--SAN
compute...--+

Both file systems, lustre1 and lustre2, are mounted on all compute clusters.
All clusters except orelic and zrelic are at Lustre lustre-2.12.5_5.llnl.
Orelic and zrelic are at lustre-2.10.8_10.chaos.

Comment by Olaf Faaland [ 14/Oct/20 ]

Amir asked:

In your upgrade procedure do you bring down a router, upgrade to 2.12 and then bring up the router? And that's when you start seeing timeout issues?

Either we stop lnet and then power cycle the router, or just power cycle the router; it then boots into an image with Lustre 2.12.

Have you tried disabling discovery on the router as you bring it up?

Discover is disabled on all our machines right now, including these routers.

Would we be able to setup a debugging session to get to the bottom of this?

Yes

Comment by Olaf Faaland [ 14/Oct/20 ]

One oddity from this is that if orelic and zrelic are running 2.12.5, but each reduced to one router node each (for example, [oz]relic5 is on, but [oz]relic[2-4] are off), the symptoms go away.  We tested that configuration originally thinking that one of the routers had a bad NIC, cable, switch port, etc.  So we tried it with only[oz]relic2 and had success; and then with [oz]relic5, and also had success.

Comment by Peter Jones [ 14/Oct/20 ]

Serguei

Can you please assist on this one?

Thanks

Peter

Comment by Serguei Smirnov [ 14/Oct/20 ]

Hi Olaf,

I'll retrace a bit to make sure I understand the problem correctly. My understanding is that you have the following topology:

A <– o2ibA –> R1 < – tcp0 –> R2 <–o2ibB–> B

and you experience issues when there are multiple nodes in R1 and R2.

1) When |R1|>1 and |R2|>1, can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load?

2) Have you tried with |R1| = 1and |R2|>1 and reproduced the problem?

Thanks,

Serguei.

 

Comment by Olaf Faaland [ 14/Oct/20 ]

Hi Serguei,

Sounds good. The problematic topology is:

A <-o2ibA-> R1 <-o2ibB-> R2 <-tcp0-> R3 <-o2ibC-> C

where o2ibA may be OFED, MOFED, or OmniPath (we have some of each). There are several compute clusters (which include client A and router R1). So R1 is as low as 1 for some clusters, and as high as 12 for others.

can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load?

When ( |R2| >1 and |R3| >1) and we are seeing the symptoms, we see intermittent "lnetctl ping" failures.

I'm not sure if we've tried both ( |R2|==1 and |R3|>1) and ( |R2|>1 and |R3|==1).  I'll try that and report back.

edit: I originally typed |R1| but meant |R2| in the "I'm not sure" sentence.

Comment by Olaf Faaland [ 14/Oct/20 ]

Hi Serguei,

We've made some changes since our original experiment and now see somewhat different symptoms.   I set up both R2 and R3 to run lustre 2.12.5, with just a single router in each of those clusters.

I saw intermittent failures with "lctl ping" from orelic to copper, with this topology (orelic == R2, zrelic == R3, and copper == C):

orelic3 <- tcp0 -> zrelic3 <- o2ib600 -> copper1 

The logs are 2020-oct-14-(orelic,zrelic,copper1).tgz

The logs include dmesg, debug logs (with +net), config files, and the output of "lctl ping" with timestamps in case it helps correlate ping failures with debug logs.

Comment by Serguei Smirnov [ 15/Oct/20 ]

Hi Olaf,

As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic.

Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of 

lnetctl stats show
perfquery
ip -s link show

before and after the test? Could you please also run 

lnetctl global show 

How does orelic3 cpu usage look in top?

Thanks,

Serguei.

Comment by Olaf Faaland [ 29/Apr/21 ]

Hi Serguei,

I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine.

1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful.

CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz a tarball with the information you asked for, although as I mentioned pings were fine this time.

The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet.

thanks

Comment by Serguei Smirnov [ 15/Jun/21 ]

Here's a summary of today's online session with Olaf:

zrelic5 was upgraded to 2.12. 

lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics)

lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing.

There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted.

Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.

 

Comment by Olaf Faaland [ 15/Jun/21 ]

Yes, the ref counts were from

lctl get_param peers 
Comment by Olaf Faaland [ 31/Aug/22 ]

Hi Serguei,

As I mentioned in LU-15234, 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl. The tag is 2.12.9_3.llnl.

All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos.

When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10.

So it seems there is at least one other issue. 

I'm still thinking about what to try or look for next.  Ideas would be welcome.

thanks

Comment by Olaf Faaland [ 07/Sep/22 ]

Hi Serguei,

Logs in attached file 2022-sep-07.lu14026.tgz

I think that's everything you said you wanted, but if not let me know.

thanks,

Olaf

Comment by Serguei Smirnov [ 12/Sep/22 ]

Hi Olaf,

Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade?

On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect.

Consider the following topology:

C1 <-o2ib0-> R <-o2ib1-> C2 

For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order:

  1. Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8)
  2. Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8)
  3. Change config on node C2 to use (32/16/64). All connections will use (32/16/64)

Thanks,

Serguei

Comment by Olaf Faaland [ 12/Sep/22 ]

Hi Serguei,

I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon.

thanks

Generated at Sat Feb 10 03:06:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.