Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14026

symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • None
    • lustre 2.10.8 and 2.12.5
      mixed OFED, MOFED, Omnipath, and tcp
    • 3
    • 9223372036854775807

    Description

      We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

      When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
      LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
      LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
      LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
      LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
      Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
      Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
      

      along with side-affects such as reconnect attempts.

      Attachments

        1. 2020-oct-14-copper1.tgz
          12.62 MB
        2. 2020-oct-14-orelic.tgz
          36.73 MB
        3. 2020-oct-14-zrelic.tgz
          22.32 MB
        4. 2022-sep-07.lu14026.tgz
          287 kB
        5. lu-14026.2021-04-28.tgz
          3.80 MB

        Issue Links

          Activity

            [LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon. thanks

            Hi Olaf,

            Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade?

            On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect.

            Consider the following topology:

            C1 <-o2ib0-> R <-o2ib1-> C2 

            For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order:

            1. Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8)
            2. Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8)
            3. Change config on node C2 to use (32/16/64). All connections will use (32/16/64)

            Thanks,

            Serguei

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade? On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect. Consider the following topology: C1 <-o2ib0-> R <-o2ib1-> C2 For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order: Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8) Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8) Change config on node C2 to use (32/16/64). All connections will use (32/16/64) Thanks, Serguei

            Hi Serguei,

            Logs in attached file 2022-sep-07.lu14026.tgz

            I think that's everything you said you wanted, but if not let me know.

            thanks,

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, Logs in attached file 2022-sep-07.lu14026.tgz I think that's everything you said you wanted, but if not let me know. thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            As I mentioned in LU-15234, 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl. The tag is 2.12.9_3.llnl.

            All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos.

            When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10.

            So it seems there is at least one other issue. 

            I'm still thinking about what to try or look for next.  Ideas would be welcome.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, As I mentioned in LU-15234 , 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl . The tag is 2.12.9_3.llnl. All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos. When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10. So it seems there is at least one other issue.  I'm still thinking about what to try or look for next.  Ideas would be welcome. thanks

            Yes, the ref counts were from

            lctl get_param peers 
            ofaaland Olaf Faaland added a comment - Yes, the ref counts were from lctl get_param peers

            Here's a summary of today's online session with Olaf:

            zrelic5 was upgraded to 2.12. 

            lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics)

            lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing.

            There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted.

            Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.

             

            ssmirnov Serguei Smirnov added a comment - Here's a summary of today's online session with Olaf: zrelic5 was upgraded to 2.12.  lnetctl ping was reliable from zinc to zrelic5, to one of orelic routers, to slug (node behind orelic) and surface84(another node from ib routing cluster behind orelic): able to ping multiple times without failing. Debug logs confirmed that zrelic5 was selected for routing (among other zrelics) lnet selftest failed between zinc and surface84. It looked like a load-induced failure: the 30sec test appeared to be fine in the beginning, then reported bandwidth started to go down. After the node got rebooted, the issue couldn't be reproduced. Selftest between orelic and zinc, slug and zinc, slug and zrelic5 didn't fail anymore. Debug logs confirmed that zrelic5 was selected for routing. There were occasional bulk transfer errors reported by different nodes in the system (not involved in selftest runs). It looked like they started appearing in the logs after zrelic5 got upgraded. Some errors could still be seen after zrelic5 got rebooted. Olaf, could you please provide some more detail on the ref count dump you did on zrelic5? It showed high counts compared to other nodes, but I don't remember the exact command you used to dump the counts.  
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine.

            1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful.

            CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz a tarball with the information you asked for, although as I mentioned pings were fine this time.

            The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I simplified the experiment a bit. I simply rebooted one of the "relic" router nodes, zrelic5, into Lustre 2.12.6. I began to see the same symptoms in the console logs of the clients and servers. As before zrelic5 thinks everything is fine. 1000 lctl pings from zrelic5 to orelic5 (across the ethernet) were successful, and 1000 lctl pings from zrelic5 to zinc1 (across the SAN zrelic5 is connected to) were successful. CPU usage was very low on zrelic5. I'm attaching lu-14026.2021-04-28.tgz a tarball with the information you asked for, although as I mentioned pings were fine this time. The first debug log was dumped while the issue was occurring. I did not have +net set. Note the lnet_attempt_msg_resend retry count messages. The second debug log was dumped after I'd attempted to stop lnet. thanks

            Hi Olaf,

            As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic.

            Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of 

            lnetctl stats show
            perfquery
            ip -s link show

            before and after the test? Could you please also run 

            lnetctl global show 

            How does orelic3 cpu usage look in top?

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, As far as I can tell from the logs, failed pings initiated from orelic3 are not even sent, probably because there's no resources at the time. If it is a general issue, you should be able to see the same problem if you lnetctl-ping any other node from orelic3. If it is tcp-specific somehow, then you should be able to see this issue by lnetctl-pinging zrelic. Orelic3 export dumps indicate drops on both interfaces during the test. I wonder what makes the router node so "busy" that it can't be bothered to send pings. If you manage to reproduce the issue with lnetctl-pinging from orelic to anything, could you please provide the output of  lnetctl stats show perfquery ip -s link show before and after the test? Could you please also run  lnetctl global show How does orelic3 cpu usage look in top? Thanks, Serguei.
            ofaaland Olaf Faaland added a comment - - edited

            Hi Serguei,

            We've made some changes since our original experiment and now see somewhat different symptoms.   I set up both R2 and R3 to run lustre 2.12.5, with just a single router in each of those clusters.

            I saw intermittent failures with "lctl ping" from orelic to copper, with this topology (orelic == R2, zrelic == R3, and copper == C):

            orelic3 <- tcp0 -> zrelic3 <- o2ib600 -> copper1 

            The logs are 2020-oct-14-(orelic,zrelic,copper1).tgz

            The logs include dmesg, debug logs (with +net), config files, and the output of "lctl ping" with timestamps in case it helps correlate ping failures with debug logs.

            ofaaland Olaf Faaland added a comment - - edited Hi Serguei, We've made some changes since our original experiment and now see somewhat different symptoms.   I set up both R2 and R3 to run lustre 2.12.5, with just a single router in each of those clusters. I saw intermittent failures with "lctl ping" from orelic to copper, with this topology (orelic == R2, zrelic == R3, and copper == C): orelic3 <- tcp0 -> zrelic3 <- o2ib600 -> copper1 The logs are 2020-oct-14-(orelic,zrelic,copper1).tgz The logs include dmesg, debug logs (with +net), config files, and the output of "lctl ping" with timestamps in case it helps correlate ping failures with debug logs.
            ofaaland Olaf Faaland added a comment - - edited

            Hi Serguei,

            Sounds good. The problematic topology is:

            A <-o2ibA-> R1 <-o2ibB-> R2 <-tcp0-> R3 <-o2ibC-> C
            

            where o2ibA may be OFED, MOFED, or OmniPath (we have some of each). There are several compute clusters (which include client A and router R1). So R1 is as low as 1 for some clusters, and as high as 12 for others.

            can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load?

            When ( |R2| >1 and |R3| >1) and we are seeing the symptoms, we see intermittent "lnetctl ping" failures.

            I'm not sure if we've tried both ( |R2|==1 and |R3|>1) and ( |R2|>1 and |R3|==1).  I'll try that and report back.

            edit: I originally typed |R1| but meant |R2| in the "I'm not sure" sentence.

            ofaaland Olaf Faaland added a comment - - edited Hi Serguei, Sounds good. The problematic topology is: A <-o2ibA-> R1 <-o2ibB-> R2 <-tcp0-> R3 <-o2ibC-> C where o2ibA may be OFED, MOFED, or OmniPath (we have some of each). There are several compute clusters (which include client A and router R1). So R1 is as low as 1 for some clusters, and as high as 12 for others. can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load? When ( |R2| >1 and |R3| >1) and we are seeing the symptoms, we see intermittent "lnetctl ping" failures. I'm not sure if we've tried both ( |R2|==1 and |R3|>1) and ( |R2|>1 and |R3|==1).  I'll try that and report back. edit: I originally typed |R1| but meant |R2| in the "I'm not sure" sentence.
            ssmirnov Serguei Smirnov added a comment - - edited

            Hi Olaf,

            I'll retrace a bit to make sure I understand the problem correctly. My understanding is that you have the following topology:

            A <– o2ibA –> R1 < – tcp0 –> R2 <–o2ibB–> B

            and you experience issues when there are multiple nodes in R1 and R2.

            1) When |R1|>1 and |R2|>1, can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load?

            2) Have you tried with |R1| = 1and |R2|>1 and reproduced the problem?

            Thanks,

            Serguei.

             

            ssmirnov Serguei Smirnov added a comment - - edited Hi Olaf, I'll retrace a bit to make sure I understand the problem correctly. My understanding is that you have the following topology: A <– o2ibA –> R1 < – tcp0 –> R2 <–o2ibB–> B and you experience issues when there are multiple nodes in R1 and R2. 1) When |R1|>1 and |R2|>1, can the problem be reproduced simply with repeated "lnetctl ping" (between A <–> B)? Or only under load? 2) Have you tried with |R1| = 1and |R2|>1 and reproduced the problem? Thanks, Serguei.  

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: