Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14026

symptoms of message loss or corruption after upgrading routers to lustre 2.12.5

Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • None
    • lustre 2.10.8 and 2.12.5
      mixed OFED, MOFED, Omnipath, and tcp
    • 3
    • 9223372036854775807

    Description

      We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

      When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
      LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
      LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
      LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
      LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
      LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
      Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
      Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
      

      along with side-affects such as reconnect attempts.

      Attachments

        1. 2020-oct-14-copper1.tgz
          12.62 MB
          Olaf Faaland
        2. 2020-oct-14-orelic.tgz
          36.73 MB
          Olaf Faaland
        3. 2020-oct-14-zrelic.tgz
          22.32 MB
          Olaf Faaland
        4. 2022-sep-07.lu14026.tgz
          287 kB
          Olaf Faaland
        5. lu-14026.2021-04-28.tgz
          3.80 MB
          Olaf Faaland

        Issue Links

          Activity

            [LU-14026] symptoms of message loss or corruption after upgrading routers to lustre 2.12.5
            ofaaland Olaf Faaland made changes -
            Resolution New: Incomplete [ 4 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            ofaaland Olaf Faaland made changes -
            Labels Original: llnl topllnl New: llnl
            ofaaland Olaf Faaland made changes -
            Attachment New: 2022-sep-07.lu14026.tgz [ 45580 ]
            ofaaland Olaf Faaland made changes -
            Summary Original: symptoms of message corruption after upgrading routers to lustre 2.12.5 New: symptoms of message loss or corruption after upgrading routers to lustre 2.12.5
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-29 [ JFC-29 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-21 [ JFC-21 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-15453 [ LU-15453 ]
            ofaaland Olaf Faaland made changes -
            Attachment New: lu-14026.2021-04-28.tgz [ 38452 ]
            ofaaland Olaf Faaland made changes -
            Description Original: We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

            When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

            {noformat}
            LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
            LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
            LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
            LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
            LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
            LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
            LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
            Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
            Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
            {noformat}

            along with side-affects such as reconnect attempts.
            New: We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers.  All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5.  The routers in the RELIC clusters are at lustre 2.10.8.  In this configuration, the system is stable.

            When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine.  There are no errors or warnings on the console, nor indications of failure in the debug log with +net.  However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:

            {noformat}
            LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds
            LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds
            LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z
            LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX
            LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ...
            LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ
            LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX
            Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ...
            Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
            {noformat}

            along with side-affects such as reconnect attempts.
            ofaaland Olaf Faaland made changes -
            Attachment New: 2020-oct-14-copper1.tgz [ 36338 ]
            Attachment New: 2020-oct-14-zrelic.tgz [ 36339 ]
            Attachment New: 2020-oct-14-orelic.tgz [ 36340 ]

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: