[LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5 Created: 14/Oct/20 Updated: 12/Sep/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, topllnl | ||
| Environment: |
lustre 2.10.8 and 2.12.5 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers. All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5. The routers in the RELIC clusters are at lustre 2.10.8. In this configuration, the system is stable. When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine. There are no errors or warnings on the console, nor indications of failure in the debug log with +net. However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages: LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ... LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ... Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110 along with side-affects such as reconnect attempts. |
| Comments |
| Comment by Olaf Faaland [ 14/Oct/20 ] |
|
A more detailed explanation of the topology from a Lustre perspective: network topology:
(each name represents a cluster)
o2ib100 / o2ib600
syrah----+ / +---quartz
surface--+ / +---ruby
corona---+-orelic--------zrelic-+---copper(lustre1)
catalyst-+ / +---zinc(lustre2)
...------+ / +---...
The clusters on the left, including orelic, are on the o2ib100 Infiniband SAN. orelic and zrelic are router clusters, each with 4 LNet router nodes. compute1----+ compute2----+--router1--SAN compute3----+--router2--SAN compute...--+ Both file systems, lustre1 and lustre2, are mounted on all compute clusters. |
| Comment by Olaf Faaland [ 14/Oct/20 ] |
|
Amir asked:
Either we stop lnet and then power cycle the router, or just power cycle the router; it then boots into an image with Lustre 2.12.
Discover is disabled on all our machines right now, including these routers.
Yes |
| Comment by Olaf Faaland [ 14/Oct/20 ] |
|
One oddity from this is that if orelic and zrelic are running 2.12.5, but each reduced to one router node each (for example, [oz]relic5 is on, but [oz]relic[2-4] are off), the symptoms go away. We tested that configuration originally thinking that one of the routers had a bad NIC, cable, switch port, etc. So we tried it with only[oz]relic2 and had success; and then with [oz]relic5, and also had success. |
| Comment by Peter Jones [ 14/Oct/20 ] |
|
Serguei Can you please assist on this one? Thanks Peter |
| Comment by Serguei Smirnov [ 14/Oct/20 ] |
|
Hi Olaf, I'll retrace a bit to make sure I understand the problem correctly. My understanding is that you have the following topology: A <– o2ibA –> R1 < – tcp0 –> R2 <–o2ibB–> B and you experience issues when there are multiple nodes in R1 and R2. 1) When |R1|>1 and |R2|>1, can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load? 2) Have you tried with |R1| = 1and |R2|>1 and reproduced the problem? Thanks, Serguei.
|
| Comment by Olaf Faaland [ 14/Oct/20 ] |
|
Hi Serguei, Sounds good. The problematic topology is: A <-o2ibA-> R1 <-o2ibB-> R2 <-tcp0-> R3 <-o2ibC-> C where o2ibA may be OFED, MOFED, or OmniPath (we have some of each). There are several compute clusters (which include client A and router R1). So R1 is as low as 1 for some clusters, and as high as 12 for others.
When ( |R2| >1 and |R3| >1) and we are seeing the symptoms, we see intermittent "lnetctl ping" failures. I'm not sure if we've tried both ( |R2|==1 and |R3|>1) and ( |R2|>1 and |R3|==1). I'll try that and report back. edit: I originally typed |R1| but meant |R2| in the "I'm not sure" sentence. |
| Comment by Olaf Faaland [ 14/Oct/20 ] |
|
Hi Serguei, We've made some changes since our original experiment and now see somewhat different symptoms. I set up both R2 and R3 to run lustre 2.12.5, with just a single router in each of those clusters. I saw intermittent failures with "lctl ping" from orelic to copper, with this topology (orelic == R2, zrelic == R3, and copper == C): orelic3 <- tcp0 -> zrelic3 <- o2ib600 -> copper1 The logs are 2020-oct-14-(orelic,zrelic,copper1).tgz The logs include dmesg, debug logs (with +net), config files, and the output of "lctl ping" with timestamps in case it helps correlate ping failures with debug logs. |
| Comment by Serguei Smirnov [ 15/Oct/20 ] |
|
Hi Olaf, As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic. Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of lnetctl stats show perfquery ip -s link show before and after the test? Could you please also run lnetctl global show How does orelic3 cpu usage look in top? Thanks, Serguei. |
| Comment by Olaf Faaland [ 29/Apr/21 ] |
|
Hi Serguei, I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine. 1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful. CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet. thanks |
| Comment by Serguei Smirnov [ 15/Jun/21 ] |
|
Here's a summary of today's online session with Olaf: zrelic5 was upgraded to 2.12. lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics) lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing. There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted. Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.
|
| Comment by Olaf Faaland [ 15/Jun/21 ] |
|
Yes, the ref counts were from lctl get_param peers |
| Comment by Olaf Faaland [ 31/Aug/22 ] |
|
Hi Serguei, As I mentioned in All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable. At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos. When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2. For the time being, I've reverted that so all the zrelic nodes are back to running 2.10. So it seems there is at least one other issue. I'm still thinking about what to try or look for next. Ideas would be welcome. thanks |
| Comment by Olaf Faaland [ 07/Sep/22 ] |
|
Hi Serguei, Logs in attached file 2022-sep-07.lu14026.tgz I think that's everything you said you wanted, but if not let me know. thanks, Olaf |
| Comment by Serguei Smirnov [ 12/Sep/22 ] |
|
Hi Olaf, Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade? On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect. Consider the following topology: C1 <-o2ib0-> R <-o2ib1-> C2 For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order:
Thanks, Serguei |
| Comment by Olaf Faaland [ 12/Sep/22 ] |
|
Hi Serguei, I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade. I'll try to do it soon. thanks |