Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
Lustre server 2.15.3 RoCE
Lustre MGS 2.15.3 Infiniband
Lustre client 2.15.3 RoCE
Lustre router 2.12.9 Infiniband/RoCE
-
3
-
9223372036854775807
Description
Here is the following reproducer:
- Mount lustre on a RoCE network
- Add a route with the gateway down
- Generate lnet traffic (find /mnt/lustre)
- umount client
- lustre_rmmod
lustre_rmmod hangs around 1 min in "lnetctl net unconfigure":
PID: 2995 TASK: <task> CPU: 4 COMMAND: "lnetctl" #0 __schedule #1 schedule #2 schedule_timeout #3 kiblnd_shutdown #4 lnet_shutdown_lndni #5 lnet_shutdown_lndnet #6 lnet_shutdown_lndnets #7 LNetNIFini #8 lnet_ioctl #9 notifier_call_chain #10 blocking_notifier_call_chain #11 libcfs_psdev_ioctl #12 do_vfs_ioctl #13 ksys_ioctl #14 __x64_sys_ioctl #15 do_syscall_64
dk log from client:
00000800:00000200:47.0:1706285707.687699:0:197221:0:(o2iblnd.c:3046:kiblnd_shutdown()) x.y.z.75@o2ib50: waiting for 2 peers to disconnect 00000800:00000100:1.0F:1706285708.135711:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.90@o2ib50: UNREACHABLE -110 00000800:00000200:1.0:1706285708.135713:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.90@o2ib50: active(1), version(12), status(-100) 00000800:00000010:1.0:1706285708.135714:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 000000009aa0d65a (tot 19395077). 00000400:00000200:1.0:1706285708.135717:0:192402:0:(router.c:1739:lnet_notify()) x.y.z.75@o2ib50 notifying x.y.z.90@o2ib50: down 00000800:00000200:1.0:1706285708.135920:0:192402:0:(o2iblnd_cb.c:2253:kiblnd_finalise_conn()) abort connection with x.y.z.90@o2ib50 00000800:00000200:1.0:1706285708.135922:0:192402:0:(o2iblnd_cb.c:3267:kiblnd_cm_callback()) conn[00000000f9491194] (19)-- 00000800:00000100:1.0:1706285708.135938:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.99@o2ib50: UNREACHABLE -110 00000800:00000200:1.0:1706285708.135939:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.99@o2ib50: active(1), version(12), status(-100) 00000800:00000010:1.0:1706285708.135940:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 00000000868f6d6f (tot 19394941). 00000400:00000200:1.0:1706285708.135942:0:192402:0:(router.c:1739:lnet_notify()) xxxx@o2ib50 notifying x.y.z.99@o2ib50: down 00000800:00000200:29.2F:1706285708.135964:0:0:0:(o2iblnd_cb.c:3780:kiblnd_cq_completion()) conn[00000000f9491194] (18)++ 00000800:00000200:33.0F:1706285708.135973:0:195209:0:(o2iblnd_cb.c:3894:kiblnd_scheduler()) conn[00000000f9491194] (19)++
The unconfigure task seems to wait a timeout for the LNet gateway down "x.y.z.99@o2ib50" and x.y.z.90@o2ib50 (UNREACHABLE -110).
The workarround is to remove LNet routes before the unconfigure.
Attachments
Issue Links
- is related to
-
LU-18755 landing a LU-17480 don't fixes an LBUG in CM event handler.
-
- Open
-
-
LU-18275 o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE
-
- Open
-
-
LU-18364 rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect
-
- Open
-
-
LU-18260 o2iblnd: graceful handling of unexpected RDMA_CM_EVENT_REJECTED
-
- Resolved
-
-
LU-17632 o2iblnd: graceful handling of unexpected CM_EVENT_CONNECT_ERROR
-
- Resolved
-
- is related to
-
LU-16283 o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect
-
- Open
-