[LU-17480] lustre_rmmod hangs if a lnet route is down - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None
Environment:
Lustre server 2.15.3 RoCE
Lustre MGS 2.15.3 Infiniband
Lustre client 2.15.3 RoCE
Lustre router 2.12.9 Infiniband/RoCE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Here is the following reproducer:

Mount lustre on a RoCE network
Add a route with the gateway down
Generate lnet traffic (find /mnt/lustre)
umount client
lustre_rmmod

lustre_rmmod hangs around 1 min in "lnetctl net unconfigure":

PID: 2995     TASK: <task>  CPU: 4    COMMAND: "lnetctl"
#0 __schedule 
#1 schedule 
#2 schedule_timeout 
#3 kiblnd_shutdown 
#4 lnet_shutdown_lndni 
#5 lnet_shutdown_lndnet 
#6 lnet_shutdown_lndnets 
#7 LNetNIFini 
#8 lnet_ioctl 
#9 notifier_call_chain 
#10 blocking_notifier_call_chain 
#11 libcfs_psdev_ioctl 
#12 do_vfs_ioctl 
#13 ksys_ioctl 
#14 __x64_sys_ioctl 
#15 do_syscall_64

dk log from client:

00000800:00000200:47.0:1706285707.687699:0:197221:0:(o2iblnd.c:3046:kiblnd_shutdown()) x.y.z.75@o2ib50: waiting for 2 peers to disconnect
00000800:00000100:1.0F:1706285708.135711:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.90@o2ib50: UNREACHABLE -110
00000800:00000200:1.0:1706285708.135713:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.90@o2ib50: active(1), version(12), status(-100)
00000800:00000010:1.0:1706285708.135714:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 000000009aa0d65a (tot 19395077).
00000400:00000200:1.0:1706285708.135717:0:192402:0:(router.c:1739:lnet_notify()) x.y.z.75@o2ib50 notifying x.y.z.90@o2ib50: down
00000800:00000200:1.0:1706285708.135920:0:192402:0:(o2iblnd_cb.c:2253:kiblnd_finalise_conn()) abort connection with x.y.z.90@o2ib50
00000800:00000200:1.0:1706285708.135922:0:192402:0:(o2iblnd_cb.c:3267:kiblnd_cm_callback()) conn[00000000f9491194] (19)--
00000800:00000100:1.0:1706285708.135938:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.99@o2ib50: UNREACHABLE -110
00000800:00000200:1.0:1706285708.135939:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.99@o2ib50: active(1), version(12), status(-100)
00000800:00000010:1.0:1706285708.135940:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 00000000868f6d6f (tot 19394941).
00000400:00000200:1.0:1706285708.135942:0:192402:0:(router.c:1739:lnet_notify()) xxxx@o2ib50 notifying x.y.z.99@o2ib50: down
00000800:00000200:29.2F:1706285708.135964:0:0:0:(o2iblnd_cb.c:3780:kiblnd_cq_completion()) conn[00000000f9491194] (18)++
00000800:00000200:33.0F:1706285708.135973:0:195209:0:(o2iblnd_cb.c:3894:kiblnd_scheduler()) conn[00000000f9491194] (19)++

The unconfigure task seems to wait a timeout for the LNet gateway down "x.y.z.99@o2ib50" and x.y.z.90@o2ib50 (UNREACHABLE -110).

The workarround is to remove LNet routes before the unconfigure.

Attachments

Issue Links

is related to

LU-18755 landing a LU-17480 don't fixes an LBUG in CM event handler.

Open

LU-18275 o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

Open

LU-18364 rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect

Open

LU-18260 o2iblnd: graceful handling of unexpected RDMA_CM_EVENT_REJECTED

Resolved

LU-17632 o2iblnd: graceful handling of unexpected CM_EVENT_CONNECT_ERROR

Resolved

is related to

LU-16283 o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect

Open

(1 is related to )

Activity

People

Assignee:: Etienne Aujames

Reporter:: Etienne Aujames

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 29/Jan/24 10:34 AM

Updated:: 26/Feb/25 12:25 PM

Resolved:: 25/Jun/24 12:53 PM