Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17480

lustre_rmmod hangs if a lnet route is down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • Lustre server 2.15.3 RoCE
      Lustre MGS 2.15.3 Infiniband
      Lustre client 2.15.3 RoCE
      Lustre router 2.12.9 Infiniband/RoCE
    • 3
    • 9223372036854775807

    Description

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Add a route with the gateway down
      • Generate lnet traffic (find /mnt/lustre)
      • umount client
      • lustre_rmmod

      lustre_rmmod hangs around 1 min in "lnetctl net unconfigure":

      PID: 2995     TASK: <task>  CPU: 4    COMMAND: "lnetctl"
      #0 __schedule 
      #1 schedule 
      #2 schedule_timeout 
      #3 kiblnd_shutdown 
      #4 lnet_shutdown_lndni 
      #5 lnet_shutdown_lndnet 
      #6 lnet_shutdown_lndnets 
      #7 LNetNIFini 
      #8 lnet_ioctl 
      #9 notifier_call_chain 
      #10 blocking_notifier_call_chain 
      #11 libcfs_psdev_ioctl 
      #12 do_vfs_ioctl 
      #13 ksys_ioctl 
      #14 __x64_sys_ioctl 
      #15 do_syscall_64 
      

      dk log from client:

      00000800:00000200:47.0:1706285707.687699:0:197221:0:(o2iblnd.c:3046:kiblnd_shutdown()) x.y.z.75@o2ib50: waiting for 2 peers to disconnect
      00000800:00000100:1.0F:1706285708.135711:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.90@o2ib50: UNREACHABLE -110
      00000800:00000200:1.0:1706285708.135713:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.90@o2ib50: active(1), version(12), status(-100)
      00000800:00000010:1.0:1706285708.135714:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 000000009aa0d65a (tot 19395077).
      00000400:00000200:1.0:1706285708.135717:0:192402:0:(router.c:1739:lnet_notify()) x.y.z.75@o2ib50 notifying x.y.z.90@o2ib50: down
      00000800:00000200:1.0:1706285708.135920:0:192402:0:(o2iblnd_cb.c:2253:kiblnd_finalise_conn()) abort connection with x.y.z.90@o2ib50
      00000800:00000200:1.0:1706285708.135922:0:192402:0:(o2iblnd_cb.c:3267:kiblnd_cm_callback()) conn[00000000f9491194] (19)--
      00000800:00000100:1.0:1706285708.135938:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.99@o2ib50: UNREACHABLE -110
      00000800:00000200:1.0:1706285708.135939:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.99@o2ib50: active(1), version(12), status(-100)
      00000800:00000010:1.0:1706285708.135940:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 00000000868f6d6f (tot 19394941).
      00000400:00000200:1.0:1706285708.135942:0:192402:0:(router.c:1739:lnet_notify()) xxxx@o2ib50 notifying x.y.z.99@o2ib50: down
      00000800:00000200:29.2F:1706285708.135964:0:0:0:(o2iblnd_cb.c:3780:kiblnd_cq_completion()) conn[00000000f9491194] (18)++
      00000800:00000200:33.0F:1706285708.135973:0:195209:0:(o2iblnd_cb.c:3894:kiblnd_scheduler()) conn[00000000f9491194] (19)++
      

      The unconfigure task seems to wait a timeout for the LNet gateway down "x.y.z.99@o2ib50" and x.y.z.90@o2ib50 (UNREACHABLE -110).

      The workarround is to remove LNet routes before the unconfigure.

      Attachments

        Issue Links

          Activity

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: