[LU-14652] LNet router stuck generating RDMA tx timeout Created: 29/Apr/21 Updated: 01/Feb/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9 (3.10.0-1160.24.1.el7.x86_64) on routers, Lustre 2.12.6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hello, we are still seeing weird behavior with LNet over IB with 2.12+. We have tried to upgrade clients and routers to 2.13 and then 2.14 without success. We went back to 2.12.6 LTS, but we are still seeing occasional kiblnd errors and timeout. The IB fabrics are healthy, sometimes a little bit of congestion but no discard. I'm starting to suspect a deeper problem with LNet/ko2iblnd, where sometimes credits are exhausted? We didn't have with 2.10. To me, the problem seems similar to the one reported in LU-14026 by LLNL. We do have the following setup: Fir (serves /scratch) o2ib7 — 8 x lnet routers (IB) — Sherlock v3 (o2ib3) Last night, one of the 8 routers (sh03-fir06) started to have problems. I've taken traces so that we can investigate. Router NIDs: [root@sh03-fir06 ~]# lctl list_nids 10.51.0.116@o2ib3 10.0.10.237@o2ib7 LNet config on the router (we have discovery enabled and using a few Multi-Rail nodes on o2ib3): [root@sh03-fir06 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
Fir Lustre servers on o2ib7 started to exhibit the following errors: fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800 fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800 fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9480cfc4c800 fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00 fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00 fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400 fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400 fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff945be5538400 fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94886b2a9000 fir-io7-s2: Apr 28 23:26:17 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94428fd56000 fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00 fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00 fir-io7-s2: Apr 28 23:26:43 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9462486d2400 fir-io7-s2: Apr 28 23:26:51 fir-io7-s2 kernel: LustreError: 68967:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff949f1a1a3850 x1696974434037056/t0(0) o4->12f8a639-7e97-4157-8d89-6e1e00a728eb@10.51.13.20@o2ib3:363/0 lens 488/448 e 0 to 0 dl 1619677703 ref 1 fl Interpret:/0/0 rc 0/0 fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400 fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400 fir-io6-s1: Apr 28 23:24:18 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff91913b76a800 fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000 fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000 fir-io6-s1: Apr 28 23:24:36 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911f0108a400 fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00 fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00 fir-io6-s1: Apr 28 23:25:08 fir-io6-s1 kernel: LNet: 24015:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.0.10.237@o2ib7: error -110(waiting) One interesting thing about that router, vs. the 7 others, is that it had a lot of refs (> 3000) in {{/sys/kernel/debug/lnet/nis }} and tx stuck at -367. The high refs count is similar to an issue we noticed with 2.14 routers reported in LU-14584, and we thought that maybe this was a CPT issue. Here, it happened with Lustre 2.12.6 routers. nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.51.0.116@o2ib3 up 0 1 8 0 64 64 47 10.51.0.116@o2ib3 up 0 0 8 0 64 64 47 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.51.0.116@o2ib3 up 0 0 8 0 64 64 41 10.51.0.116@o2ib3 up 0 0 8 0 64 64 48 10.51.0.116@o2ib3 up 0 3493 8 0 64 -367 -367 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.0.10.237@o2ib7 up 0 3318 8 0 64 64 48 10.0.10.237@o2ib7 up 0 3062 8 0 64 64 50 10.0.10.237@o2ib7 up 0 6202 8 0 64 64 47 10.0.10.237@o2ib7 up 0 3032 8 0 64 64 49 10.0.10.237@o2ib7 up 0 6082 8 0 64 64 48 10.0.10.237@o2ib7 up 0 5467 8 0 64 64 48 10.0.10.237@o2ib7 up 0 3115 8 0 64 64 50 10.0.10.237@o2ib7 up 0 3206 8 0 64 64 48 I took traces on this router at the time of the problem. I'm attaching a zip files sh03-fir06-20210428.zip
Rebooting the router fixed the problem. |
| Comments |
| Comment by Peter Jones [ 30/Apr/21 ] |
|
Serguei Could you please advise? Thanks Peter |
| Comment by Serguei Smirnov [ 30/Apr/21 ] |
|
Hi Stephane, Could you please provide the output of lnetctl net show -v 4 from the router and one of the server and one of the clients? It looks like the number of credits configured for the router may be low. Also, what is the output of lnetctl global show on clients and servers? In the 20 sec captured in the logs there are a few transaction expiring. Perhaps transaction_timeout can be increased, but first let's check the situation with the credits. Thanks, Serguei. |
| Comment by Stephane Thiell [ 30/Apr/21 ] |
|
Hi Serguei, I'm attaching the output of lnetctl net show -v 4:
Output of lnetctl global show:
As you can see, we are already making sure that transaction_timeout is 50 on all clients and servers, even with 2.13 clients where it was 10 by default. But maybe you will some other problems here? Let me know if you need more data from us. Thanks! |
| Comment by Serguei Smirnov [ 01/May/21 ] |
|
Stephane, On servers: retry_count: 2 transaction_timeout: 50 This means that resulting lnd_timeout is 50/3 seconds, while on other nodes it is 50. If you want to keep the retry_count at 2, please increase transaction_timeout to 150 on the servers, it will make sure every node is using the same lnd_timeout. Thanks, Serguei. |
| Comment by Stephane Thiell [ 01/May/21 ] |
|
Thanks for catching that, Serguei! Ok, I'll try either that or disable lnet health and set retry_count to 0 like for the other nodes. |
| Comment by Stephane Thiell [ 01/May/21 ] |
|
I fixed that issue: ---------------
fir-io[1-8]-s[1-2],fir-md1-s[1-4] (20)
---------------
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
but shortly after that change, another router started to exhibit a similar problem (sh02-fir04). This time it's a router between the same servers (o2ib7, HDR) and o2ib2 (Sherlock v2, another fabric, EDR based), but it's the same idea. [root@sh02-fir04 ~]# lctl list_nids 10.50.0.114@o2ib2 10.0.10.227@o2ib7 [root@sh02-fir04 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.50.0.114@o2ib2 up 0 2 8 0 128 127 64 10.50.0.114@o2ib2 up 0 1 8 0 128 127 57 10.0.10.227@o2ib7 up 0 29 8 0 128 127 64 10.0.10.227@o2ib7 up 0 379 8 0 128 128 56 and refs is growing. On a server: Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9f68619b2c00 Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.227@o2ib7: LNetM[DE]Unlink() already called on the MD/ME. Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Skipped 2 previous similar messages Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400 Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400 From a server (10.0.10.115@o2ib7): [root@fir-io8-s1 ~]# lnetctl peer show --nid 10.0.10.227@o2ib7 -v 4
peer:
- primary nid: 10.50.0.114@o2ib2
Multi-Rail: True
peer state: 137
peer ni:
- nid: 10.0.10.227@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: 7
min_tx_credits: -64957
tx_q_num_of_buf: 480
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 5
statistics:
send_count: 4282001375
recv_count: 119463736
drop_count: 0
sent_stats:
put: 3953956139
get: 328045236
reply: 0
ack: 0
hello: 0
received_stats:
put: 3102103707
get: 493
reply: 336149354
ack: 976177478
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 6
timeout: 250
error: 0
network timeout: 0
- nid: 10.50.0.114@o2ib2
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
Peer show of the server from the router (10.0.10.227@o2ib7/10.50.0.114@o2ib2): [root@sh02-fir04 ~]# lnetctl peer show --nid 10.0.10.115@o2ib7 -v 4
peer:
- primary nid: 10.0.10.115@o2ib7
Multi-Rail: True
peer state: 137
peer ni:
- nid: 10.0.10.115@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -89
tx_q_num_of_buf: 0
available_rtr_credits: 6
min_rtr_credits: -8
refcount: 3
statistics:
send_count: 291018436
recv_count: 286578888
drop_count: 607
sent_stats:
put: 198315455
get: 14434
reply: 21538685
ack: 71149862
hello: 0
received_stats:
put: 265177056
get: 21401720
reply: 55
ack: 57
hello: 0
dropped_stats:
put: 0
get: 577
reply: 26
ack: 4
hello: 0
health stats:
health value: 1000
dropped: 577
timeout: 0
error: 0
network timeout: 0
I wonder if there is a way to see what is causing the router to be stuck and accumulating refs. I took another trace with +net on this router while it was stuck. Attaching as sh02-fir04-dknet.log.gz |
| Comment by Serguei Smirnov [ 03/May/21 ] |
|
Stephane, From the router net log you provided it looks like the router is not reporting any errors. Was the issue still happening when you were capturing it? I wonder how persistent this issue is, or maybe it just occurs in short bursts. Would it be possible to get net logs from the client and the server, too? That would give some insight into how they qualify failed connections and whether failed connections involve the same router. "lnetctl net show" output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet?
|
| Comment by Stephane Thiell [ 10/May/21 ] |
|
Hi Serguei, > "lnetctl net show" output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet? Not that we know of. Some clients have dual ports card but only one port is up. We do have a few MR clients on o2ib3 though. Not easy to gather logs on this system (Fir) anymore, as the issue didn't reappear. But this morning, we had a similar issue between another storage system Oak (2.12.6) o2ib5 and Sherlock (o2ib[1-3]) Timeouts and network errors: May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8be0d58dcc00 May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54967:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00 May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54966:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00 May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54969:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8be0d58dcc00 May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bf5a32dec00 May 10 00:08:53 oak-io2-s1 kernel: LustreError: 218398:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk READ req@ffff8bdff554c850 x1698172291403584/t0(0) o3->a703c65e-c48f-97d7-efaf-c377b3ded349@10.51.15.3@o2ib3:410/0 lens 488/440 e 0 to 0 dl 1620630560 ref 1 fl Interpret:/0/0 rc 0/0 May 10 00:08:53 oak-io2-s1 kernel: Lustre: oak-OST0052: Bulk IO read error with a703c65e-c48f-97d7-efaf-c377b3ded349 (at 10.51.15.3@o2ib3), client will retry: rc -110 I'm attaching this OSS (oak-io2-s1 10.0.2.105@o2ib5) kernel logs as oak-io2-s1.kern.log Very high refs count on the o2ib5 <> o2ib[1-3] routers : [root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b "cat /sys/kernel/debug/lnet/nis" --------------- sh01-oak01 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.49.0.131@o2ib1 up 0 1 8 0 128 128 97 10.49.0.131@o2ib1 up 0 0 8 0 128 128 91 10.0.2.212@o2ib5 up 0 1 8 0 128 128 92 10.0.2.212@o2ib5 up 0 1 8 0 128 128 94 --------------- sh01-oak02 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.49.0.132@o2ib1 up 0 1 8 0 128 128 96 10.49.0.132@o2ib1 up 0 0 8 0 128 128 94 10.0.2.213@o2ib5 up 0 4 8 0 128 128 92 10.0.2.213@o2ib5 up 0 1 8 0 128 128 95 --------------- sh02-oak01 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.50.0.131@o2ib2 up 0 1032 8 0 128 125 62 10.50.0.131@o2ib2 up 0 959 8 0 128 126 54 10.0.2.214@o2ib5 up 0 6 8 0 128 127 74 10.0.2.214@o2ib5 up 0 1143 8 0 128 120 61 --------------- sh02-oak02 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.50.0.132@o2ib2 up 0 927 8 0 128 128 59 10.50.0.132@o2ib2 up 0 838 8 0 128 128 39 10.0.2.215@o2ib5 up 0 296 8 0 128 120 75 10.0.2.215@o2ib5 up 0 687 8 0 128 120 57 --------------- sh03-oak01 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.51.0.131@o2ib3 up 0 60 8 0 64 64 26 10.51.0.131@o2ib3 up 0 61 8 0 64 64 28 10.51.0.131@o2ib3 up 0 87 8 0 64 64 9 10.51.0.131@o2ib3 up 0 180 8 0 64 64 27 10.51.0.131@o2ib3 up 0 126 8 0 64 64 26 10.51.0.131@o2ib3 up 0 150 8 0 64 64 12 10.51.0.131@o2ib3 up 0 63 8 0 64 64 32 10.51.0.131@o2ib3 up 0 93 8 0 64 64 19 10.0.2.216@o2ib5 up 0 1 8 0 64 64 56 10.0.2.216@o2ib5 up 0 0 8 0 64 64 58 10.0.2.216@o2ib5 up 0 0 8 0 64 64 40 10.0.2.216@o2ib5 up 0 160 8 0 64 56 48 10.0.2.216@o2ib5 up 0 0 8 0 64 64 56 10.0.2.216@o2ib5 up 0 0 8 0 64 64 48 10.0.2.216@o2ib5 up 0 0 8 0 64 64 45 10.0.2.216@o2ib5 up 0 290 8 0 64 56 40 --------------- sh03-oak02 --------------- nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.51.0.132@o2ib3 up 0 89 8 0 64 64 26 10.51.0.132@o2ib3 up 0 76 8 0 64 64 28 10.51.0.132@o2ib3 up 0 76 8 0 64 64 22 10.51.0.132@o2ib3 up 0 143 8 0 64 64 21 10.51.0.132@o2ib3 up 0 113 8 0 64 64 29 10.51.0.132@o2ib3 up 0 98 8 0 64 64 14 10.51.0.132@o2ib3 up 0 89 8 0 64 64 22 10.51.0.132@o2ib3 up 0 78 8 0 64 64 25 10.0.2.217@o2ib5 up 0 1 8 0 64 64 56 10.0.2.217@o2ib5 up 0 0 8 0 64 64 56 10.0.2.217@o2ib5 up 0 294 8 0 64 56 40 10.0.2.217@o2ib5 up 0 1 8 0 64 64 48 10.0.2.217@o2ib5 up 0 0 8 0 64 64 56 10.0.2.217@o2ib5 up 0 180 8 0 64 56 48 10.0.2.217@o2ib5 up 0 1 8 0 64 64 40 10.0.2.217@o2ib5 up 0 0 8 0 64 64 40 Peers with queuing from the routers: [root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b "cat /sys/kernel/debug/lnet/peers | awk '/^nid|o2ib/ && \$NF!=0'" --------------- sh01-oak[01-02],sh03-oak01 (3) --------------- nid refs state last max rtr min tx min queue --------------- sh02-oak01 --------------- nid refs state last max rtr min tx min queue 10.0.2.106@o2ib5 316 up 176 8 7 -8 -306 -2561 2551117 --------------- sh02-oak02 --------------- nid refs state last max rtr min tx min queue 10.0.2.110@o2ib5 206 up 94 8 1 -8 -190 -3121 1520563 10.0.2.101@o2ib5 242 up 84 8 7 -8 -232 -2145 70352 --------------- sh03-oak02 --------------- nid refs state last max rtr min tx min queue 10.0.2.105@o2ib5 1163 up 66 8 8 -16 -1154 -1987 7682915 I gathered some short +net debug logs from Oak's OSS oak-io2-s1 10.0.2.105@o2ib5, where we can see a drop: 00000400:00000100:37.0:1620667146.491444:0:54966:0:(lib-move.c:3930:lnet_parse_reply()) 10.0.2.105@o2ib5: Dropping REPLY from 12345-10.51.1.14@o2ib3 for invalid MD 0x1678074439ff2d4c.0x476212d0d Attaching as oak-io2-s1.dknet9.gz Attaching +net debugs from a client on Sherlock (10.50.14.15@o2ib2), one on o2ib2 I saw some queueing from the routers, but other than that, I picked it randomly and not sure it will be relevant ( sh02-14n15.dknet1.gz |
| Comment by Stephane Thiell [ 10/May/21 ] |
|
Also attaching +net debug logs from client sh02-14n15 10.51.1.14@o2ib3, the one shown above on the OSS with Dropping REPLY. But I think it might have been too late when I gathered the logs. sh03-01n14.dknet.gz |
| Comment by Serguei Smirnov [ 13/May/21 ] |
|
Stephane, I still can't see what's going wrong exctly, but for 2.12.6 I'd consider disabling discovery on the routers to avoid extra complexity. For example, it would make sure that the messages sent by a node get to the specific nid of the remote node as decided by the sender, as opposed to the router making this decision when it has discovery enabled. It is more of a general recommendation that may be unrelated to the problem you're seeing though. If you choose to disable discovery on the routers, you can do so by going one at a time:
|
| Comment by Stephane Thiell [ 21/Oct/21 ] |
|
Hi Serguei, We still have this problem with 2.12.7 routers and clients. The servers (Fir) are still running 2.12.5 though. Some routers started to get a high refs count on their NIs and on specific peers, which are generating random RDMA timeouts. If we reboot the routers, that usually doesn't change anything, they become unusable. In some very rare case, (like after requeuing a lot of jobs), we're able to put a router back into production (we were able to do that today), but otherwise, the refs count start to increase immediately.
Clients are on o2ib2, servers on o2ib7. We're pretty much confident the IB fabrics are OK. The problem occurs between routers and servers (o2ib7). Example of a router (at 10.0.10.226@o2ib7) <-> server (OSS) ( at 10.0.10.101@o2ib7) that is stuck: [root@sh02-fir03 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v4
peer:
- primary nid: 10.0.10.101@o2ib7
Multi-Rail: False
peer ni:
- nid: 10.0.10.101@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -3
tx_q_num_of_buf: 0
available_rtr_credits: -44 <<<
min_rtr_credits: -44
refcount: 53
statistics:
send_count: 43532
recv_count: 2843
drop_count: 226
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl set_param debug=+ALL; lctl clear; lctl ping 10.0.10.101@o2ib7; lctl dk /tmp/debug.log; lctl set_param debug=-ALL
debug=+ALL
failed to ping 10.0.10.101@o2ib7: Input/output error
Debug log: 193037 lines, 193037 kept, 0 dropped, 0 bad.
debug=-ALL
Debug logs from the router (sh02-fir03 at 10.0.10.226@o2ib7 ) attached as sh02-fir03-debug-lctl_ping_sync.log.gz So the router sh02-fir03 (10.0.10.226@o2ib7) is running out of rtr credits, and from the OSS point of view (fir-io1-s1 at 10.0.10.101@o2ib7), it's out of tx credits when this happens, as shown below: [root@fir-io1-s1 ~]# lnetctl peer show --nid 10.0.10.226@o2ib7 -v 4
peer:
- primary nid: 10.0.10.226@o2ib7
Multi-Rail: False
peer state: 0
peer ni:
- nid: 10.0.10.226@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: -2 <<<
min_tx_credits: -154
tx_q_num_of_buf: 7230544
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 14
statistics:
send_count: 427582
recv_count: 2109316
drop_count: 0
sent_stats:
put: 389065
get: 38517
reply: 0
ack: 0
hello: 0
received_stats:
put: 1514551
get: 6
reply: 122123
ack: 472636
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 12
timeout: 298
error: 0
network timeout: 0
We also noticed that the NIs refs are increasing on the routers when the following messages show up on some OSS's (not all...): Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ae220efb800 Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9a810b5e6000 From the OSS, the lnet pinger shows this: 00000400:00000200:20.0:1634761462.846900:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.224@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 1 lpni_ping_timestamp 1748161 00000400:00000200:20.0:1634761462.846903:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.225@o2ib7 60: deadline 1748227 ping_notsent 1 alive 1 alive_count 351 lpni_ping_timestamp 1748167 00000400:00000200:20.0:1634761462.846905:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.226@o2ib7 60: deadline 1748200 ping_notsent 1 alive 1 alive_count 384 lpni_ping_timestamp 1748140 00000400:00000200:20.0:1634761462.846908:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.227@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 85 lpni_ping_timestamp 1748141 Our problems are with routers 10.0.10.225@o2ib7 and 10.0.10.226@o2ib7 at the moment. Rebooting them doesn't fix the problem. Our OSS have this configured: options lnet live_router_check_interval=60 options lnet dead_router_check_interval=300 options lnet router_ping_timeout=60 options lnet avoid_asym_router_failure=1 options lnet check_routers_before_use=1 Clients have the default settings (so the difference is check_routers_before_use=0). Also attaching the peer list of this problematic router (sh02-fir03 at 10.0.10.226@o2ib7) as sh02-fir03.20211020.peers.txt Do you know if there a way to identify in the logs what is using these high NI/peers refs count? |
| Comment by Serguei Smirnov [ 22/Oct/21 ] |
|
Hi Stephane, My understanding is that the ref counts go high because the messages on the router queue don't get cleared, same reason as for the negative "available credits". The messages can get stuck on the queue if for some reason the OSS is not responding. After some timeout (50 seconds?) this causes transactions to start expiring. I'll take a closer look at the OSS logs for clues why that may be. We may need to retrieve "ldlm" and "ptlrpc" logs in addition to "net". In the meantime, is there any information about CPU usage on the OSS? Are there any signs of lock-up? Perhaps we can get the output of "echo l > /proc/sysrq-trigger" on the server and the router to check for lock-ups? Thanks, Serguei. |
| Comment by Stephane Thiell [ 26/Oct/21 ] |
|
Hi Serguei, Thanks for this. I checked the OSS for lock-ups but found nothing. Kernel logs of OSS fir-io1-s2 (10.0.10.102@o2ib7) when we started to put two "bad" routers (10.0.10.225@o2ib7 and 10.0.10.226@o2ib7) back online and problems started again: fir-io1-s2_kernel_sysrq-l_20211026.log.gz Lustre logs on this OSS with dlmtrace and net: fir-io1-s2_dk_net_dlmtrace_20211026.log.gz Every time we put these two routers back online, the problem shows up with the same two OSS (10.0.10.101@o2ib7 and 0.0.10.102@o2ib7). These OSS are fine with the other routers... # clush -u5 -Lw sh02-fir[01-04] -b cat /sys/kernel/debug/lnet/peers | awk '$3 > 10 {print }'
sh02-fir[01-04]: nid refs state last max rtr min tx min queue
sh02-fir02: 10.0.10.102@o2ib7 125 up 179 8 -116 -116 8 -15 0
sh02-fir03: 10.0.10.102@o2ib7 115 up 33 8 -106 -106 8 -12 0
sh02-fir03: 10.0.10.101@o2ib7 75 up 32 8 -66 -66 8 -1 0
|
| Comment by Stephane Thiell [ 26/Oct/21 ] |
|
Adding sh02-fir03-kern_sysrq-l-t_20211026.log.gz |
| Comment by Etienne Aujames [ 27/Oct/21 ] |
|
Hello, |
| Comment by Stephane Thiell [ 27/Oct/21 ] |
|
Hi Etienne, Please interfere! It's interesting, and could perhaps match our problem. I also noticed occasional messages like this one on the OSS (here, 10.0.10.225@o2ib7 is one of the bad router): 00000400:00000100:24.0:1635289738.440616:0:56886:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.225@o2ib7: LNetM[DE]Unlink() already called on the MD/ME. I'm still wondering why this issue could survive OSS and/or router reboot. Have you tried a patch for b2_12? Thanks! |
| Comment by Stephane Thiell [ 29/Oct/21 ] |
|
Hi Etienne and Serguei, I added the patch from In this instance, on the router 10.0.10.225@o2ib7, the rtr credits for OSS 10.0.10.102@o2ib7 were quickly dropping way below 0 immediately after putting the router online: [root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.102@o2ib7 -v
peer:
- primary nid: 10.0.10.102@o2ib7
Multi-Rail: False
peer ni:
- nid: 10.0.10.102@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -1
tx_q_num_of_buf: 0
available_rtr_credits: -176
min_rtr_credits: -176
refcount: 185
statistics:
send_count: 18606
recv_count: 2924
drop_count: 0
|
| Comment by Serguei Smirnov [ 29/Oct/21 ] |
|
Hi Stephane, Just to clarify, was the patch from Thanks, Serguei. |
| Comment by Stephane Thiell [ 29/Oct/21 ] |
|
Hi Seirguei, Yes, we actually started by applying the patch on this OSS (at 10.0.10.102@o2ib7), tested that without success, then we also applied the patch on the router, but hit the same issue, unfortunately. |
| Comment by Serguei Smirnov [ 09/Nov/21 ] |
|
Amir and I went over the logs and tried to track down the issue. I don't have a conclusive answer, but here's roughly what appears to be happening, based on the logs provided for the lnetctl ping above: 1) There's a client that stops being responsive. Lnd-level "no credits" messages in the router debug log point to that. 2) If the client is unresponsive, the messages going to it from the server (via router) are not getting finalized. It causes the router credits count for the server to go negative and get stuck at that. 3) The message queue on the server backs up. There are lnd-level "no credits" from ko2iblnd in the server debug log. Normally when the high-watermark level is hit, the node sends a special "NOOP" request bypassing the queue, requesting to release the credits from the other side (the router) 4) If there's no reaction to the "noop", lnd credits are not released and all of them are used up. The server is not able to respond to the lnetctl ping or send any message via the router using this connection. There are possible causes:
There are some additional questions/ideas: At the time when the router debug log was taken, there are actually messages for multiple clients with "no credits" issue. How is the routing configured on the clients? Are all clients able to use all routers or just a subset? Have you tried increasing the number of credits (go up to peer_credits = 32 peer_credits_hiw = 16 concurrent_sends = 64) ? Thanks, Serguei.
|
| Comment by Stephane Thiell [ 09/Nov/21 ] |
|
Hi Serguei, Thanks much for spending time to look at this in more details with Amir. We really appreciate it. I'm attaching an image to give you an overview LNet architecture of /scratch on Sherlock: Note: In orange, those are the servers/routers with the repeating credit issues. There is nothing too fancy here I believe. We have clients on 3 generations of IB fabric (o2ib1, o2ib2 and o2ib3), each connected to Fir (the servers) via their own LNet routers (so o2ib1<>o2ib7, o2ib2<>o2ib7 and o2ib3<>o2ib7). On a single cluster fabric, all compute nodes have the same routing configuration, so yes all clients are using the same routers per fabric. This is a an example from a compute node in o2ib2: [root@sh02-10n19 ~]# lnetctl route show route: - net: o2ib5 gateway: 10.50.0.132@o2ib2 - net: o2ib5 gateway: 10.50.0.131@o2ib2 - net: o2ib7 gateway: 10.50.0.112@o2ib2 - net: o2ib7 gateway: 10.50.0.111@o2ib2 - net: o2ib7 gateway: 10.50.0.113@o2ib2 - net: o2ib7 gateway: 10.50.0.114@o2ib2 (you can ignore o2ib5 here, which are routers to another storage system)
To answer your second question, no we haven't tried to increase the number of credits. We have been using the default settings. Changing that would mean a full cluster down time and it's ... complicated. |
| Comment by Olaf Faaland [ 15/Nov/21 ] |
|
Hi Serguei, You wrote:
Assuming the router or client receives the "NOOP" request, does it release those credits by dropping the messages that were sent but not yet acknowledged (ie freeing the buffers holding those messages, releasing the credits, emitting the LNetError message)? thanks |
| Comment by Chris Horn [ 16/Nov/21 ] |
I don't understand why the LND doesn't eventually just timeout these transactions when it can't acquire the necessary credits. That would then return the tx credits when those messages are finalized. |
| Comment by Serguei Smirnov [ 16/Nov/21 ] |
|
Chris, I do believe that If noop msg doesn't come so that there are no more credits available to send, eventually there's going to be a timeout for the transaction. The problem may be that during this time the server was not able to talk via the given router to any other node, causing all kinds of backups. Olaf, Looks like I got it a bit backwards it my summary. In fact, if the server is sending, it is the receiving side (router), once it figures it accumulated "high-water-mark" or more credits on the receiving end, that is expected to send the "noop" to the server, which, when received by the server, releases the credits to be used again for sending. Thanks, Serguei |
| Comment by Stephane Thiell [ 30/Nov/21 ] |
|
It looks like the race that Chris Horn found in |
| Comment by Stephane Thiell [ 27/Jan/22 ] |
|
Hello! It looks like the patch I mentioned above has been abandoned as it doesn't resolve this issue. I believe it is now our most impactful problem with 2.12 as it seems to start randomly. For example, this morning, we noticed lots of errors of the type on an OSS: Jan 27 11:33:36 fir-io1-s2 kernel: LustreError: 31477:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffffa1313e562400 (that's 10.0.10.102@o2ib7) And checking the ref count on the routers, we saw a very high refcnt on the router (sh02-fir02) for this OSS (10.0.10.102@o2ib7): [root@sh02-hn01 sthiell.root]# clush -u5 -Lw @rtr_fir -b cat /sys/kernel/debug/lnet/peers | awk '$3 > 20 {print }'
sh03-fir[01-08],sh02-fir[01-04]: nid refs state last max rtr min tx min queue
sh02-fir02: 10.0.10.102@o2ib7 8829 up 110 8 -8820 -8820 8 -87 0
Rebooting the router didn't help, its refcnt for 10.0.10.102@o2ib7 is increasing again at this time. Sometimes, rebooting the "bad" router does help though, but unfortunately not always. |
| Comment by Serguei Smirnov [ 31/Jan/22 ] |
|
Hi Stephane, There was a couple more patches added for Thanks, Serguei. |
| Comment by Stephane Thiell [ 01/Feb/22 ] |
|
Hi Serguei, OK – Thank you! |