Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.6
-
None
-
CentOS 7.9 (3.10.0-1160.24.1.el7.x86_64) on routers, Lustre 2.12.6
-
3
-
9223372036854775807
Description
Hello, we are still seeing weird behavior with LNet over IB with 2.12+. We have tried to upgrade clients and routers to 2.13 and then 2.14 without success. We went back to 2.12.6 LTS, but we are still seeing occasional kiblnd errors and timeout. The IB fabrics are healthy, sometimes a little bit of congestion but no discard. I'm starting to suspect a deeper problem with LNet/ko2iblnd, where sometimes credits are exhausted? We didn't have with 2.10. To me, the problem seems similar to the one reported in LU-14026 by LLNL.
We do have the following setup:
Fir (serves /scratch) o2ib7 — 8 x lnet routers (IB) — Sherlock v3 (o2ib3)
Last night, one of the 8 routers (sh03-fir06) started to have problems. I've taken traces so that we can investigate.
Router NIDs:
[root@sh03-fir06 ~]# lctl list_nids 10.51.0.116@o2ib3 10.0.10.237@o2ib7
LNet config on the router (we have discovery enabled and using a few Multi-Rail nodes on o2ib3):
[root@sh03-fir06 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
Fir Lustre servers on o2ib7 started to exhibit the following errors:
fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800 fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800 fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9480cfc4c800 fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00 fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00 fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400 fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400 fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff945be5538400 fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94886b2a9000 fir-io7-s2: Apr 28 23:26:17 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94428fd56000 fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00 fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00 fir-io7-s2: Apr 28 23:26:43 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9462486d2400 fir-io7-s2: Apr 28 23:26:51 fir-io7-s2 kernel: LustreError: 68967:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff949f1a1a3850 x1696974434037056/t0(0) o4->12f8a639-7e97-4157-8d89-6e1e00a728eb@10.51.13.20@o2ib3:363/0 lens 488/448 e 0 to 0 dl 1619677703 ref 1 fl Interpret:/0/0 rc 0/0 fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400 fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400 fir-io6-s1: Apr 28 23:24:18 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff91913b76a800 fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000 fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000 fir-io6-s1: Apr 28 23:24:36 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911f0108a400 fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00 fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00 fir-io6-s1: Apr 28 23:25:08 fir-io6-s1 kernel: LNet: 24015:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.0.10.237@o2ib7: error -110(waiting)
One interesting thing about that router, vs. the 7 others, is that it had a lot of refs (> 3000) in {{/sys/kernel/debug/lnet/nis }} and tx stuck at -367. The high refs count is similar to an issue we noticed with 2.14 routers reported in LU-14584, and we thought that maybe this was a CPT issue. Here, it happened with Lustre 2.12.6 routers.
nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.51.0.116@o2ib3 up 0 1 8 0 64 64 47 10.51.0.116@o2ib3 up 0 0 8 0 64 64 47 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.51.0.116@o2ib3 up 0 0 8 0 64 64 41 10.51.0.116@o2ib3 up 0 0 8 0 64 64 48 10.51.0.116@o2ib3 up 0 3493 8 0 64 -367 -367 10.51.0.116@o2ib3 up 0 0 8 0 64 64 46 10.0.10.237@o2ib7 up 0 3318 8 0 64 64 48 10.0.10.237@o2ib7 up 0 3062 8 0 64 64 50 10.0.10.237@o2ib7 up 0 6202 8 0 64 64 47 10.0.10.237@o2ib7 up 0 3032 8 0 64 64 49 10.0.10.237@o2ib7 up 0 6082 8 0 64 64 48 10.0.10.237@o2ib7 up 0 5467 8 0 64 64 48 10.0.10.237@o2ib7 up 0 3115 8 0 64 64 50 10.0.10.237@o2ib7 up 0 3206 8 0 64 64 48
I took traces on this router at the time of the problem. I'm attaching a zip files sh03-fir06-20210428.zip
with the output of:
- lnetctl stats show
- lnetctl peer show
- cat /sys/kernel/debug/lnet/nis
- cat /sys/kernel/debug/lnet/peers
- kernel logs
- short dk logs with +net enabled, just in case that would show something interesting
Rebooting the router fixed the problem.