Details
-
Bug
-
Resolution: Incomplete
-
Critical
-
None
-
None
-
lustre 2.10.8 and 2.12.5
mixed OFED, MOFED, Omnipath, and tcp
-
3
-
9223372036854775807
Description
We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers. All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5. The routers in the RELIC clusters are at lustre 2.10.8. In this configuration, the system is stable.
When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine. There are no errors or warnings on the console, nor indications of failure in the debug log with +net. However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages:
LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ... LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ... Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110
along with side-affects such as reconnect attempts.
Attachments
Issue Links
- is related to
-
LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock
-
- Open
-
Activity
Resolution | New: Incomplete [ 4 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Labels | Original: llnl topllnl | New: llnl |
Attachment | New: 2022-sep-07.lu14026.tgz [ 45580 ] |
Summary | Original: symptoms of message corruption after upgrading routers to lustre 2.12.5 | New: symptoms of message loss or corruption after upgrading routers to lustre 2.12.5 |
Link | New: This issue is related to JFC-29 [ JFC-29 ] |
Link | Original: This issue is related to JFC-21 [ JFC-21 ] |
Attachment | New: lu-14026.2021-04-28.tgz [ 38452 ] |
Description |
Original:
We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building. All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5. The routers in the RELIC clusters are at lustre 2.10.8. In this configuration, the system is stable.
When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine. There are no errors or warnings on the console, nor indications of failure in the debug log with +net. However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages: {noformat} LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ... LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ... Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110 {noformat} along with side-affects such as reconnect attempts. |
New:
We have two router clusters, which we call RELICs, which connect the infiniband SAN in one building with the infiniband SAN in another building, with ethernet between the routers. All the servers and clients in both buildings, and the router nodes within the clusters which connect to the SAN, are already at lustre 2.12.5. The routers in the RELIC clusters are at lustre 2.10.8. In this configuration, the system is stable.
When we power cycle the RELIC routers and boot them from an image with lustre 2.12.5, the router nodes themselves think everything is fine. There are no errors or warnings on the console, nor indications of failure in the debug log with +net. However, we begin to see symptoms on server nodes which seem to indicate corrupt, dropped, or delayed messages: {noformat} LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, X seconds LNetError: PPPP:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with ZZZ@o2ib600 (0): c: X, oc: Y, rc: Z LustreError: PPPP:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffffXXXX LustreError: PPPP:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk READ... LustreError: PPPP:0:(ldlm_lib.c:3285:target_bulk_io()) @@@ network error on bulk READ LustreError: PPPP:0:(ldlm_lib.c:3294:target_bulk_io()) @@@ truncated bulk READ 0(1048576) XXX Lustre: PPPP:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: ... Lustre: ls1-OST000e: Bulk IO read error with XXX (at ZZZ@o2ib36), client will retry: rc -110 {noformat} along with side-affects such as reconnect attempts. |
Attachment | New: 2020-oct-14-copper1.tgz [ 36338 ] | |
Attachment | New: 2020-oct-14-zrelic.tgz [ 36339 ] | |
Attachment | New: 2020-oct-14-orelic.tgz [ 36340 ] |