[LU-17480] lustre_rmmod hangs if a lnet route is down Created: 29/Jan/24  Updated: 09/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Etienne Aujames Assignee: Etienne Aujames
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Lustre server 2.15.3 RoCE
Lustre MGS 2.15.3 Infiniband
Lustre client 2.15.3 RoCE
Lustre router 2.12.9 Infiniband/RoCE


Issue Links:
Related
is related to LU-16283 o2iblnd.c:3049:kiblnd_shutdown() <NID... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here is the following reproducer:

  • Mount lustre on a RoCE network
  • Add a route with the gateway down
  • Generate lnet traffic (find /mnt/lustre)
  • umount client
  • lustre_rmmod

lustre_rmmod hangs around 1 min in "lnetctl net unconfigure":

PID: 2995     TASK: <task>  CPU: 4    COMMAND: "lnetctl"
#0 __schedule 
#1 schedule 
#2 schedule_timeout 
#3 kiblnd_shutdown 
#4 lnet_shutdown_lndni 
#5 lnet_shutdown_lndnet 
#6 lnet_shutdown_lndnets 
#7 LNetNIFini 
#8 lnet_ioctl 
#9 notifier_call_chain 
#10 blocking_notifier_call_chain 
#11 libcfs_psdev_ioctl 
#12 do_vfs_ioctl 
#13 ksys_ioctl 
#14 __x64_sys_ioctl 
#15 do_syscall_64 

dk log from client:

00000800:00000200:47.0:1706285707.687699:0:197221:0:(o2iblnd.c:3046:kiblnd_shutdown()) x.y.z.75@o2ib50: waiting for 2 peers to disconnect
00000800:00000100:1.0F:1706285708.135711:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.90@o2ib50: UNREACHABLE -110
00000800:00000200:1.0:1706285708.135713:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.90@o2ib50: active(1), version(12), status(-100)
00000800:00000010:1.0:1706285708.135714:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 000000009aa0d65a (tot 19395077).
00000400:00000200:1.0:1706285708.135717:0:192402:0:(router.c:1739:lnet_notify()) x.y.z.75@o2ib50 notifying x.y.z.90@o2ib50: down
00000800:00000200:1.0:1706285708.135920:0:192402:0:(o2iblnd_cb.c:2253:kiblnd_finalise_conn()) abort connection with x.y.z.90@o2ib50
00000800:00000200:1.0:1706285708.135922:0:192402:0:(o2iblnd_cb.c:3267:kiblnd_cm_callback()) conn[00000000f9491194] (19)--
00000800:00000100:1.0:1706285708.135938:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.99@o2ib50: UNREACHABLE -110
00000800:00000200:1.0:1706285708.135939:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.99@o2ib50: active(1), version(12), status(-100)
00000800:00000010:1.0:1706285708.135940:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 00000000868f6d6f (tot 19394941).
00000400:00000200:1.0:1706285708.135942:0:192402:0:(router.c:1739:lnet_notify()) xxxx@o2ib50 notifying x.y.z.99@o2ib50: down
00000800:00000200:29.2F:1706285708.135964:0:0:0:(o2iblnd_cb.c:3780:kiblnd_cq_completion()) conn[00000000f9491194] (18)++
00000800:00000200:33.0F:1706285708.135973:0:195209:0:(o2iblnd_cb.c:3894:kiblnd_scheduler()) conn[00000000f9491194] (19)++

The unconfigure task seems to wait a timeout for the LNet gateway down "x.y.z.99@o2ib50" and x.y.z.90@o2ib50 (UNREACHABLE -110).

The workarround is to remove LNet routes before the unconfigure.



 Comments   
Comment by Gerrit Updater [ 09/Feb/24 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53986
Subject: LU-17480 o2iblnd: add a timeout for rdma_connect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7499c8a3af228c7672acd4f9eb39ac60c77c07b1

Generated at Sat Feb 10 03:35:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.