Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14026

symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • None
    • lustre 2.10.8 and 2.12.5
      mixed OFED, MOFED, Omnipath, and tcp
    • 3
    • 9223372036854775807

    Description

      We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

      When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
      LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
      LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
      LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
      LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
      Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
      Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
      

      along with side-affects such as reconnect attempts.

      Attachments

        1. 2020-oct-14-copper1.tgz
          12.62 MB
        2. 2020-oct-14-orelic.tgz
          36.73 MB
        3. 2020-oct-14-zrelic.tgz
          22.32 MB
        4. 2022-sep-07.lu14026.tgz
          287 kB
        5. lu-14026.2021-04-28.tgz
          3.80 MB

        Issue Links

          Activity

            [LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

            Our routers are now all on 2.12 or (in some cases) 2.15. I believe these symptoms were likely due to a combination of issues, and we fixed enough of them to allow the 2.12 upgrade. Closing.

            ofaaland Olaf Faaland added a comment - Our routers are now all on 2.12 or (in some cases) 2.15. I believe these symptoms were likely due to a combination of issues, and we fixed enough of them to allow the 2.12 upgrade. Closing.

            Hi Serguei,

            > Did you get the chance to experiment with removing related routes prior to upgrading
            > the router and adding them back after the upgrade?

            This turned out to be harder than I anticipated. We have enough systems and routers that someone there is often one or more routers whose routes have altered to work around some problem being fixed or troubleshot, leading to some difficult corner cases.

            At this point we've switched entirely off of Lustre 2.10, and now exclusively use Lustre 2.12 and Lustre 2.15 on the relics. We've seen some of the same symptoms, intermittently, which Gian is investigating. Given that the versions involved are different, and the underlying cause(s) may well be different, I'll close this ticket, and we'll open a new one when there are details.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, > Did you get the chance to experiment with removing related routes prior to upgrading > the router and adding them back after the upgrade? This turned out to be harder than I anticipated. We have enough systems and routers that someone there is often one or more routers whose routes have altered to work around some problem being fixed or troubleshot, leading to some difficult corner cases. At this point we've switched entirely off of Lustre 2.10, and now exclusively use Lustre 2.12 and Lustre 2.15 on the relics. We've seen some of the same symptoms, intermittently, which Gian is investigating. Given that the versions involved are different, and the underlying cause(s) may well be different, I'll close this ticket, and we'll open a new one when there are details. thanks
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I have not yet done experiment with removing related routes prior to upgrading the router and adding them back after the upgrade.  I'll try to do it soon. thanks

            Hi Olaf,

            Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade?

            On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect.

            Consider the following topology:

            C1 <-o2ib0-> R <-o2ib1-> C2 

            For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order:

            1. Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8)
            2. Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8)
            3. Change config on node C2 to use (32/16/64). All connections will use (32/16/64)

            Thanks,

            Serguei

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Did you get the chance to experiment with removing related routes prior to upgrading the router and adding them back after the upgrade? On the subject of setting o2iblnd parameters (peer_credits/peer_credits_hiw/concurrent_sends) per o2ib lnet: indeed, currently it is not possible to do this via modparams or lnetctl. However, in theory it should be possible to rely on per-connection credits negotiation process in order to achieve the desired effect. Consider the following topology: C1 <-o2ib0-> R <-o2ib1-> C2 For example, if initially (peer_credits/peer_credits_hiw/concurrent_sends) is (8/4/8) on all nodes, and you want to gradually upgrade to (32/16/64), you can do it in the following order: Change config on router R to use (32/16/64). Connections initiated by C1 and C2 will negotiate down to (8/7/8) Change config on node C1 to use (32/16/64). Connections between C1 and R will use (32/16/64). C2 to R will still be at (8/7/8) Change config on node C2 to use (32/16/64). All connections will use (32/16/64) Thanks, Serguei

            Hi Serguei,

            Logs in attached file 2022-sep-07.lu14026.tgz

            I think that's everything you said you wanted, but if not let me know.

            thanks,

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, Logs in attached file 2022-sep-07.lu14026.tgz I think that's everything you said you wanted, but if not let me know. thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            As I mentioned in LU-15234, 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl. The tag is 2.12.9_3.llnl.

            All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos.

            When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10.

            So it seems there is at least one other issue. 

            I'm still thinking about what to try or look for next.  Ideas would be welcome.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, As I mentioned in LU-15234 , 2.12.9 + change 48190 resolved the climbing peer reference counts, so we've added that patch to our current 2.12 branch, https://github.com/LLNL/lustre/commits/2.12.9-llnl . The tag is 2.12.9_3.llnl. All the orelic nodes have been running 2.12.9_3.llnl for 6 days now and the system is stable.  At this point our clusters are all running lustre-2.12.9_2.llnl, 2.12.9_3.llnl (orelic), or lustre-2.14.0_17.llnl, except for zrelic, which is running lustre-2.10.8_11.chaos. When I updated one zrelic node from 2.10 to 2.12.9_3.llnl (zrelic2), the peer refcounts did not rise with time, but I still saw the console log messages indicating Lustre timeouts and reconnects. I did not observe obvious indications of problems on zrelic2.  For the time being, I've reverted that so all the zrelic nodes are back to running 2.10. So it seems there is at least one other issue.  I'm still thinking about what to try or look for next.  Ideas would be welcome. thanks

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: