Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17480

lustre_rmmod hangs if a lnet route is down

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • Lustre server 2.15.3 RoCE
      Lustre MGS 2.15.3 Infiniband
      Lustre client 2.15.3 RoCE
      Lustre router 2.12.9 Infiniband/RoCE
    • 3
    • 9223372036854775807

    Description

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Add a route with the gateway down
      • Generate lnet traffic (find /mnt/lustre)
      • umount client
      • lustre_rmmod

      lustre_rmmod hangs around 1 min in "lnetctl net unconfigure":

      PID: 2995     TASK: <task>  CPU: 4    COMMAND: "lnetctl"
      #0 __schedule 
      #1 schedule 
      #2 schedule_timeout 
      #3 kiblnd_shutdown 
      #4 lnet_shutdown_lndni 
      #5 lnet_shutdown_lndnet 
      #6 lnet_shutdown_lndnets 
      #7 LNetNIFini 
      #8 lnet_ioctl 
      #9 notifier_call_chain 
      #10 blocking_notifier_call_chain 
      #11 libcfs_psdev_ioctl 
      #12 do_vfs_ioctl 
      #13 ksys_ioctl 
      #14 __x64_sys_ioctl 
      #15 do_syscall_64 
      

      dk log from client:

      00000800:00000200:47.0:1706285707.687699:0:197221:0:(o2iblnd.c:3046:kiblnd_shutdown()) x.y.z.75@o2ib50: waiting for 2 peers to disconnect
      00000800:00000100:1.0F:1706285708.135711:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.90@o2ib50: UNREACHABLE -110
      00000800:00000200:1.0:1706285708.135713:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.90@o2ib50: active(1), version(12), status(-100)
      00000800:00000010:1.0:1706285708.135714:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 000000009aa0d65a (tot 19395077).
      00000400:00000200:1.0:1706285708.135717:0:192402:0:(router.c:1739:lnet_notify()) x.y.z.75@o2ib50 notifying x.y.z.90@o2ib50: down
      00000800:00000200:1.0:1706285708.135920:0:192402:0:(o2iblnd_cb.c:2253:kiblnd_finalise_conn()) abort connection with x.y.z.90@o2ib50
      00000800:00000200:1.0:1706285708.135922:0:192402:0:(o2iblnd_cb.c:3267:kiblnd_cm_callback()) conn[00000000f9491194] (19)--
      00000800:00000100:1.0:1706285708.135938:0:192402:0:(o2iblnd_cb.c:3265:kiblnd_cm_callback()) x.y.z.99@o2ib50: UNREACHABLE -110
      00000800:00000200:1.0:1706285708.135939:0:192402:0:(o2iblnd_cb.c:2345:kiblnd_connreq_done()) x.y.z.99@o2ib50: active(1), version(12), status(-100)
      00000800:00000010:1.0:1706285708.135940:0:192402:0:(o2iblnd_cb.c:2353:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at 00000000868f6d6f (tot 19394941).
      00000400:00000200:1.0:1706285708.135942:0:192402:0:(router.c:1739:lnet_notify()) xxxx@o2ib50 notifying x.y.z.99@o2ib50: down
      00000800:00000200:29.2F:1706285708.135964:0:0:0:(o2iblnd_cb.c:3780:kiblnd_cq_completion()) conn[00000000f9491194] (18)++
      00000800:00000200:33.0F:1706285708.135973:0:195209:0:(o2iblnd_cb.c:3894:kiblnd_scheduler()) conn[00000000f9491194] (19)++
      

      The unconfigure task seems to wait a timeout for the LNet gateway down "x.y.z.99@o2ib50" and x.y.z.90@o2ib50 (UNREACHABLE -110).

      The workarround is to remove LNet routes before the unconfigure.

      Attachments

        Issue Links

          Activity

            [LU-17480] lustre_rmmod hangs if a lnet route is down

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56523
            Subject: LU-17480 o2iblnd: add a timeout for rdma_connect
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: a13571b53c152df7168aa5894509ec232a851670

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56523 Subject: LU-17480 o2iblnd: add a timeout for rdma_connect Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a13571b53c152df7168aa5894509ec232a851670
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53986/
            Subject: LU-17480 o2iblnd: add a timeout for rdma_connect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0b8c18d8c86357c557e959779e219ca7fd24d5d8

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53986/ Subject: LU-17480 o2iblnd: add a timeout for rdma_connect Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0b8c18d8c86357c557e959779e219ca7fd24d5d8

            I sent a kernel patch to tune "CM retries" and "CM timeout" per connection: "[PATCH rdma-next] IB/cma: Define options to set CM timeouts and retries"
            (https://lore.kernel.org/linux-rdma/ZgxeQbxfKXHkUlQG@eaujamesDDN/T/#u)

            eaujames Etienne Aujames added a comment - I sent a kernel patch to tune "CM retries" and "CM timeout" per connection: " [PATCH rdma-next] IB/cma: Define options to set CM timeouts and retries" ( https://lore.kernel.org/linux-rdma/ZgxeQbxfKXHkUlQG@eaujamesDDN/T/#u )

            Hi,

            The CEA sniffed the RDMA RoCE traffic to determine the origin of the connection timeout (tcpdump -i mlx5_0 -w roce.pcap).

            Here what we observed for an unreachable node:

            • 16 CM ConnectRequest are sent
            • The send period is 18 seconds
            • For each ConnectRequest an ICMP request is emitted from the gateway 3s after

            I dug into the MOFED code and those timeouts are explained by those 3 fields/constants:

            • max_cm_retries / CMA_MAX_CM_RETRIES
            • (local|remote)_cm_response_timeout / CMA_CM_RESPONSE_TIMEOUT
            • packet_life_time / CMA_IBOE_PACKET_LIFETIME

            The retry_count and rnr_retry_count does not have any impact for a non-connected QP (tested on the field).

            To compute the connection timeout, I have used this:

            static inline int cm_convert_to_ms(int iba_time)                       
            {                                                                      
                    /* approximate conversion to ms from 4.096us x 2^iba_time */   
                    return 1 << max(iba_time - 8, 0);                              
            }                                                                      
            ....
            int ib_send_cm_req(struct ib_cm_id *cm_id,            
                               struct ib_cm_req_param *param)     
            ...
                    cm_id_priv->timeout_ms = cm_convert_to_ms(                              
                                                param->primary_path->packet_life_time) * 2 +
                                             cm_convert_to_ms(                              
                                                param->remote_cm_response_timeout);         
            
            

            Here the different CM parameters and the computed timeout by MOFED version:

            OFED version CMA_MAX_CM_RETRIES CMA_IBOE_PACKET_LIFETIME CMA_CM_RESPONSE_TIMEOUT timeout ConnectRequest (s) timeout rdma_connect (s)
            V4.9 15 16 22 16.896 270.336
            V5.4 15 18 22 18.432 294.912
            V5.8 15 18 22 18.432 294.912
            V24.01 15 16 22 16.896 270.336
            kernel v6.8-rc3 15 16 20 4.608 73.728

            The computed timeouts for the CEA MOFED (5.4) match what we observed on the field. Those timeouts are statically defined in the MOFED driver.
            So to make the current LND implementation works with RoCE, the network has to be flat (no IPv4 routes): that way the ARP request will return an error or timeout and the ConnectRequest will not be sent. But if the remote node is in ARP cache, the issue stills exist.

            Note that for Infiniband there is no issue because of OpenSM. In rdma_resolve_route() a PathRecord request will be sent to the SM node, if an empty record is returned then the connection is aborted.

            This looks like RoCE design flaw or bug. The CEA will contact Mellanox/NVIDIA support to get more information.

            In the meantime, we need to manage the connection request timeout on the LND to make the Lustre pingers working for RoCE network.

            eaujames Etienne Aujames added a comment - Hi, The CEA sniffed the RDMA RoCE traffic to determine the origin of the connection timeout (tcpdump -i mlx5_0 -w roce.pcap). Here what we observed for an unreachable node: 16 CM ConnectRequest are sent The send period is 18 seconds For each ConnectRequest an ICMP request is emitted from the gateway 3s after I dug into the MOFED code and those timeouts are explained by those 3 fields/constants: max_cm_retries / CMA_MAX_CM_RETRIES (local|remote)_cm_response_timeout / CMA_CM_RESPONSE_TIMEOUT packet_life_time / CMA_IBOE_PACKET_LIFETIME The retry_count and rnr_retry_count does not have any impact for a non-connected QP (tested on the field). To compute the connection timeout, I have used this: static inline int cm_convert_to_ms( int iba_time) { /* approximate conversion to ms from 4.096us x 2^iba_time */ return 1 << max(iba_time - 8, 0); } .... int ib_send_cm_req( struct ib_cm_id *cm_id, struct ib_cm_req_param *param) ... cm_id_priv->timeout_ms = cm_convert_to_ms( param->primary_path->packet_life_time) * 2 + cm_convert_to_ms( param->remote_cm_response_timeout); Here the different CM parameters and the computed timeout by MOFED version: OFED version CMA_MAX_CM_RETRIES CMA_IBOE_PACKET_LIFETIME CMA_CM_RESPONSE_TIMEOUT timeout ConnectRequest (s) timeout rdma_connect (s) V4.9 15 16 22 16.896 270.336 V5.4 15 18 22 18.432 294.912 V5.8 15 18 22 18.432 294.912 V24.01 15 16 22 16.896 270.336 kernel v6.8-rc3 15 16 20 4.608 73.728 The computed timeouts for the CEA MOFED (5.4) match what we observed on the field. Those timeouts are statically defined in the MOFED driver. So to make the current LND implementation works with RoCE, the network has to be flat (no IPv4 routes): that way the ARP request will return an error or timeout and the ConnectRequest will not be sent. But if the remote node is in ARP cache, the issue stills exist. Note that for Infiniband there is no issue because of OpenSM. In rdma_resolve_route() a PathRecord request will be sent to the SM node, if an empty record is returned then the connection is aborted. This looks like RoCE design flaw or bug. The CEA will contact Mellanox/NVIDIA support to get more information. In the meantime, we need to manage the connection request timeout on the LND to make the Lustre pingers working for RoCE network.

            The "UNREACHABLE" event is sent after a rdma_connect, the CM seems to take more than 4min to returns the event. rdma_resolve_addr/rdma_resolve_route does not return errors because the CEA use several VLANs (for client/routers/servers) and route packets between them (no ARP ping on the final node).

            For an infiniband fabric, we do not this that issue. If the node is not up, rdma_resolve_route() will generate a "ADDR_ERROR" event. If the node is up but kiblnd not started, this will generate a "REJECTED" event.

            For now, I am not sure if the problem is on the fabric side or not. I found some people reporting those kinds of behaviors:

            The patch above try to mitigate this issue by tracking the connect requests and checking a timeout on the LND side to destroy the hanging connections.

            eaujames Etienne Aujames added a comment - The "UNREACHABLE" event is sent after a rdma_connect, the CM seems to take more than 4min to returns the event. rdma_resolve_addr/rdma_resolve_route does not return errors because the CEA use several VLANs (for client/routers/servers) and route packets between them (no ARP ping on the final node). For an infiniband fabric, we do not this that issue. If the node is not up, rdma_resolve_route() will generate a "ADDR_ERROR" event. If the node is up but kiblnd not started, this will generate a "REJECTED" event. For now, I am not sure if the problem is on the fabric side or not. I found some people reporting those kinds of behaviors: Bug 214523 rdma_connect() "timeout" The patch above try to mitigate this issue by tracking the connect requests and checking a timeout on the LND side to destroy the hanging connections.

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53986
            Subject: LU-17480 o2iblnd: add a timeout for rdma_connect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7499c8a3af228c7672acd4f9eb39ac60c77c07b1

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53986 Subject: LU-17480 o2iblnd: add a timeout for rdma_connect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7499c8a3af228c7672acd4f9eb39ac60c77c07b1

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: