Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14026

symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • None
    • lustre 2.10.8 and 2.12.5
      mixed OFED, MOFED, Omnipath, and tcp
    • 3
    • 9223372036854775807

    Description

      We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

      When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
      LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
      LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
      LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
      LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
      Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
      Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
      

      along with side-affects such as reconnect attempts.

      Attachments

        1. 2020-oct-14-copper1.tgz
          12.62 MB
        2. 2020-oct-14-orelic.tgz
          36.73 MB
        3. 2020-oct-14-zrelic.tgz
          22.32 MB
        4. 2022-sep-07.lu14026.tgz
          287 kB
        5. lu-14026.2021-04-28.tgz
          3.80 MB

        Issue Links

          Activity

            [LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

            Our routers are now all on 2.12 or (in some cases) 2.15. I believe these symptoms were likely due to a combination of issues, and we fixed enough of them to allow the 2.12 upgrade. Closing.

            ofaaland Olaf Faaland added a comment - Our routers are now all on 2.12 or (in some cases) 2.15. I believe these symptoms were likely due to a combination of issues, and we fixed enough of them to allow the 2.12 upgrade. Closing.

            Hi Serguei,

            > Did you get the chance to experiment with removing related routes prior to upgrading
            > the router and adding them back after the upgrade?

            This turned out to be harder than I anticipated. We have enough systems and routers that someone there is often one or more routers whose routes have altered to work around some problem being fixed or troubleshot, leading to some difficult corner cases.

            At this point we've switched entirely off of Lustre 2.10, and now exclusively use Lustre 2.12 and Lustre 2.15 on the relics. We've seen some of the same symptoms, intermittently, which Gian is investigating. Given that the versions involved are different, and the underlying cause(s) may well be different, I'll close this ticket, and we'll open a new one when there are details.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, > Did you get the chance to experiment with removing related routes prior to upgrading > the router and adding them back after the upgrade? This turned out to be harder than I anticipated. We have enough systems and routers that someone there is often one or more routers whose routes have altered to work around some problem being fixed or troubleshot, leading to some difficult corner cases. At this point we've switched entirely off of Lustre 2.10, and now exclusively use Lustre 2.12 and Lustre 2.15 on the relics. We've seen some of the same symptoms, intermittently, which Gian is investigating. Given that the versions involved are different, and the underlying cause(s) may well be different, I'll close this ticket, and we'll open a new one when there are details. thanks
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon. thanks

            Hi Olaf,

            Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade?

            On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect.

            Consider the following topology:

            C1 <-o2ib0-> R <-o2ib1-> C2 

            For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order:

            1. Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8)
            2. Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8)
            3. Change config on node C2 to use (32/16/64). All connections will use (32/16/64)

            Thanks,

            Serguei

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade? On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect. Consider the following topology: C1 <-o2ib0-> R <-o2ib1-> C2 For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order: Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8) Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8) Change config on node C2 to use (32/16/64). All connections will use (32/16/64) Thanks, Serguei

            Hi Serguei,

            Logs in attached file 2022-sep-07.lu14026.tgz

            I think that's everything you said you wanted, but if not let me know.

            thanks,

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, Logs in attached file 2022-sep-07.lu14026.tgz I think that's everything you said you wanted, but if not let me know. thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            As I mentioned in LU-15234, 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl. The tag is 2.12.9_3.llnl.

            All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos.

            When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10.

            So it seems there is at least one other issue. 

            I'm still thinking about what to try or look for next.  Ideas would be welcome.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, As I mentioned in LU-15234 , 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl . The tag is 2.12.9_3.llnl. All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos. When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10. So it seems there is at least one other issue.  I'm still thinking about what to try or look for next.  Ideas would be welcome. thanks

            Yes, the ref counts were from

            lctl get_param peers 
            ofaaland Olaf Faaland added a comment - Yes, the ref counts were from lctl get_param peers

            Here's a summary of today's online session with Olaf:

            zrelic5 was upgraded to 2.12. 

            lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics)

            lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing.

            There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted.

            Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.

             

            ssmirnov Serguei Smirnov added a comment - Here's a summary of today's online session with Olaf: zrelic5 was upgraded to 2.12.  lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics) lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing. There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted. Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.  
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine.

            1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful.

            CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz a tarball with the information you asked for, although as I mentioned pings were fine this time.

            The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine. 1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful. CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz a tarball with the information you asked for, although as I mentioned pings were fine this time. The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet. thanks

            Hi Olaf,

            As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic.

            Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of 

            lnetctl stats show
            perfquery
            ip -s link show

            before and after the test? Could you please also run 

            lnetctl global show 

            How does orelic3 cpu usage look in top?

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic. Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of  lnetctl stats show perfquery ip -s link show before and after the test? Could you please also run  lnetctl global show How does orelic3 cpu usage look in top? Thanks, Serguei.

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: