Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
None
-
2
-
9223372036854775807
Description
Hello,
The past few days we've been suffering a lot of LNET and LNET routing related issues across our filesystems, and it's currently affecting large numbers of users who are seeing frequent hangs and timeouts attempting to use our storage across an LNET router.
The LNET environment is as follows:
Clients and Servers
===============
o2ib1 (10.47.0.0/16) : ~1000 Intel OPA clients
o2ib2 (10.44.0.0/16) : ~1200 HDR IB clients
: All servers are now on HDR IB also
o2ib0 (10.144.0.0/16) : ~10 FDR IB clients
tcp2 (10.43.0.0/16) : ~100 Ethernet clients
Routers
======
8x LNET routers, in 2 groups of 4:
lnet1...lnet4: Routes between o2ib1, o2ib2 and tcp2
lnet5...lnet8: Routes all LNET networks, o2ib0, o2ib1, o2ib2 and tcp2
There is only a single level of LNET routers, there is no multi-hop routing. This is achieved by simply having all necessary interfaces present on the routers.
Software Environment
=================
We recently completed updating all of our environment to 2.12.7, with a single additional patch on top from: https://jira.whamcloud.com/browse/LU-13356 as discussed on a separate issue of ours: https://jira.whamcloud.com/browse/LU-14802
All o2ib2 nodes are using Mellanox MOFED 5.4-1.0.3.0, with a mixture of EDR and HDR IB cards. All storage is on HDR IB.
All o2ib1 nodes are using OPA version 10.11.0.1.
All o2ib0 nodes are using MOFED 4.9-2.2.4.0.
I will document all LNET tunables used and module parameters at the bottom of the description along with system logs from the routers and MDS servers particularly.
What we observe during these periods is clients becoming unresponsive to simple operations on the filesystem, 'lfs df', 'ls' etc will hang for many seconds or even minutes and then eventually complete.
Looking around in Jira, I noticed a ticket from Stefan that has broadly similar sounding symptoms to what we see: https://jira.whamcloud.com/browse/LU-14652
In the system logs of servers and routers I see lots of 'Timed out RDMA' messages, particularly between the servers and routers (router IPs are all in the range: 10.44.240.16[1-8]@o2ib2):
Sep 03 15:38:39 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff91725b6e6800
Sep 03 15:42:55 csd3-mds1 kernel: Lustre: rds-d3-MDT0000: haven't heard from client 2a643e36-e76e-f73b-a766-559148ad8494 (at 10.43.240.199@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9171c959c000, cur 1630680175 expire 163068002
Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 9254:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.43.240.198@tcp2 ns: mdt-rds-d3-MDT0000_UUID lock: ffff915555524d80/0xb4757f1f0b0182be lrc: 3/0,
Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 6246:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff917249503a80 x1705934796555264/t0(0) o104->rds-d3-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff
Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 6246:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
Sep 03 15:45:13 csd3-mds1 kernel: LustreError: 137-5: rds-d5-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 03 15:45:13 csd3-mds1 kernel: LustreError: Skipped 32 previous similar messages
Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 65 previous similar messages
Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 21, oc: 11, rc: 31
Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 65 previous similar messages
Sep 03 15:47:43 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff916f74d60800
Sep 03 15:47:48 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff917278f2a400
Sep 03 15:32:24 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.101.11@tcp2: -113
Sep 03 15:32:24 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message
Sep 03 15:33:21 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.43.240.198@tcp2[1], ip 10.43.240.198:1023, with error, wanted: 520, left: 520, last alive is 52 secs ago
Sep 03 15:33:21 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 80 previous similar messages
Sep 03 15:34:15 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.101.11: -32.
Sep 03 15:35:05 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.102.226@tcp2: -113
Sep 03 15:35:05 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message
Sep 03 15:41:15 lnet7 kernel: LNetError: 13126:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.102.60@tcp2: -113
Sep 03 15:41:15 lnet7 kernel: LNetError: 13126:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message
Sep 03 15:45:11 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.43.240.199@tcp2[1], ip 10.43.240.199:1023, with error, wanted: 520, left: 520, last alive is 339 secs ago
Sep 03 15:45:11 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 24 previous similar messages
Sep 03 15:45:56 lnet7 kernel: LNet: 13111:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.44.240.96@o2ib2
Sep 03 15:47:20 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.102.60: -32.
Sep 03 15:49:29 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.101.11: -32.
Sep 03 15:50:03 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.102.226: -32.
Overall it doesn't look very healthy of a network, with many clients being timed out continually.
Another symptom I noticed similar to what Stefan reported is looking at the 'peers' counters from an MDS node, many of the LNET routers will show large values in 'queue' at times, and the 'min' value is usually very large and negative.
10.44.240.161@o2ib2 6 up -1 32 32 32 32 -157885 0 10.44.240.166@o2ib2 7 up -1 32 32 32 32 -11445 0 10.44.240.163@o2ib2 6 up -1 32 32 32 32 -157911 0 10.44.240.168@o2ib2 7 up -1 32 32 32 32 -11480 0 10.44.240.165@o2ib2 21 up -1 32 32 32 18 -11478 20808 10.44.240.162@o2ib2 6 up -1 32 32 32 32 -157755 0 10.44.240.167@o2ib2 21 up -1 32 32 32 18 -11403 20576 10.44.240.164@o2ib2 6 up -1 32 32 32 32 -157786 0
Generally I've been restarting LNET routers that develop into a state of having persistently very large values in the 'queue' column on servers, that isn't going down. This has so far been bringing the system into a temporary period of stability, but it's not fixing the underlying problem and the issues return after a while.
Another observation is watching 'routerstat' on the LNET routers, and noticing very large numbers in the 'M' column, for messages being processed. We have occassionally seen this spike up to extremely large values before the whole LNET router runs out of memory and panics, similar to as described in this ticket: https://jira.whamcloud.com/browse/CAM-82
However this is less frequent now since we have removed nearly all of our FDR servers/clients from the environment.
On the fabric side, we don't see any obvious issues from the fabric manager - all nodes are reachable and respond to LNET ping except in periods when we experience the above symptoms.
I will attach all the LNET configurations, LND parameters, observed counters and log messages to the ticket.
I was hoping that an expert eye could do a sanity check of the configuration listed and offer any insight into why we might be experiencing these timeouts suddenly, and if there is any configuration changes we could do to eliminate the problem?
Kind regards,
Matt