Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14987

LNET routers becoming stuck, RDMA timeout, large numbers of messages processing

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      Hello,
      The past few days we've been suffering a lot of LNET and LNET routing related issues across our filesystems, and it's currently affecting large numbers of users who are seeing frequent hangs and timeouts attempting to use our storage across an LNET router.

      The LNET environment is as follows:

                                                                     
      Clients and Servers                                                      
      ===============                                                          
      o2ib1 (10.47.0.0/16)   : ~1000 Intel OPA clients                         
      o2ib2 (10.44.0.0/16)   : ~1200 HDR IB clients                            
                             : All servers are now on HDR IB also              
                                                                               
      o2ib0 (10.144.0.0/16)  : ~10 FDR IB clients                              
      tcp2   (10.43.0.0/16)  : ~100 Ethernet clients                           
                                                                               
      Routers                                                                  
      ======                                                                   
      8x LNET routers, in 2 groups of 4:                                       
                                                                               
      lnet1...lnet4:  Routes between o2ib1, o2ib2 and tcp2                     
      lnet5...lnet8:  Routes all LNET networks, o2ib0, o2ib1, o2ib2 and tcp2   
                                                                               
      There is only a single level of LNET routers, there is no multi-hop routing. This is achieved by simply having all necessary interfaces present on the routers.
                                                                               
      Software Environment                                                     
      =================                                                        
      We recently completed updating all of our environment to 2.12.7, with a single additional patch on top from: https://jira.whamcloud.com/browse/LU-13356 as discussed on a separate issue of ours: https://jira.whamcloud.com/browse/LU-14802
                                                                               
      All o2ib2 nodes are using Mellanox MOFED 5.4-1.0.3.0, with a mixture of EDR and HDR IB cards. All storage is on HDR IB.
                                                                               
      All o2ib1 nodes are using OPA version 10.11.0.1.                         
                                                                               
      All o2ib0 nodes are using MOFED 4.9-2.2.4.0.                                     
      

      I will document all LNET tunables used and module parameters at the bottom of the description along with system logs from the routers and MDS servers particularly.

      What we observe during these periods is clients becoming unresponsive to simple operations on the filesystem, 'lfs df', 'ls' etc will hang for many seconds or even minutes and then eventually complete.

      Looking around in Jira, I noticed a ticket from Stefan that has broadly similar sounding symptoms to what we see: https://jira.whamcloud.com/browse/LU-14652

      In the system logs of servers and routers I see lots of 'Timed out RDMA' messages, particularly between the servers and routers (router IPs are all in the range: 10.44.240.16[1-8]@o2ib2):

      Snippet from MDS server logs
                                 
      Sep 03 15:38:39 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff91725b6e6800 
      Sep 03 15:42:55 csd3-mds1 kernel: Lustre: rds-d3-MDT0000: haven't heard from client 2a643e36-e76e-f73b-a766-559148ad8494 (at 10.43.240.199@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9171c959c000, cur 1630680175 expire 163068002
      Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 9254:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d3-MDT0000_UUID lock: ffff915555524d80/0xb4757f1f0b0182be lrc: 3/0,
      Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 6246:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff917249503a80 x1705934796555264/t0(0) o104->rds-d3-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff
      Sep 03 15:44:49 csd3-mds1 kernel: LustreError: 6246:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
      Sep 03 15:45:13 csd3-mds1 kernel: LustreError: 137-5: rds-d5-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Sep 03 15:45:13 csd3-mds1 kernel: LustreError: Skipped 32 previous similar messages 
      Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
      Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 65 previous similar messages
      Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 21, oc: 11, rc: 31
      Sep 03 15:47:43 csd3-mds1 kernel: LNetError: 3021:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 65 previous similar messages
      Sep 03 15:47:43 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff916f74d60800
      Sep 03 15:47:48 csd3-mds1 kernel: LustreError: 3021:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff917278f2a400
      
      Snippet from LNET router log
                                 
      Sep 03 15:32:24 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.101.11@tcp2: -113
      Sep 03 15:32:24 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message
      Sep 03 15:33:21 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.43.240.198@tcp2[1], ip 10.43.240.198:1023, with error, wanted: 520, left: 520, last alive is 52 secs ago
      Sep 03 15:33:21 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 80 previous similar messages
      Sep 03 15:34:15 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.101.11: -32.
      Sep 03 15:35:05 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.102.226@tcp2: -113
      Sep 03 15:35:05 lnet7 kernel: LNetError: 13129:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message 
      Sep 03 15:41:15 lnet7 kernel: LNetError: 13126:0:(lib-move.c:3903:lnet_parse_get()) 10.43.240.167@tcp2: Unable to send REPLY for GET from 12345-10.43.102.60@tcp2: -113
      Sep 03 15:41:15 lnet7 kernel: LNetError: 13126:0:(lib-move.c:3903:lnet_parse_get()) Skipped 1 previous similar message
      Sep 03 15:45:11 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.43.240.199@tcp2[1], ip 10.43.240.199:1023, with error, wanted: 520, left: 520, last alive is 339 secs ago
      Sep 03 15:45:11 lnet7 kernel: LNetError: 13123:0:(socklnd.c:1665:ksocknal_destroy_conn()) Skipped 24 previous similar messages
      Sep 03 15:45:56 lnet7 kernel: LNet: 13111:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.44.240.96@o2ib2
      Sep 03 15:47:20 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.102.60: -32.
      Sep 03 15:49:29 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.101.11: -32.
      Sep 03 15:50:03 lnet7 kernel: LNet: There was an unexpected network error while writing to 10.43.102.226: -32.
      

      Overall it doesn't look very healthy of a network, with many clients being timed out continually.

      Another symptom I noticed similar to what Stefan reported is looking at the 'peers' counters from an MDS node, many of the LNET routers will show large values in 'queue' at times, and the 'min' value is usually very large and negative.

      sys/kernel/debug/lnet/peers table of LNET routers as seen on MDS server
      10.44.240.161@o2ib2         6    up    -1    32    32    32    32 -157885 0 
      10.44.240.166@o2ib2         7    up    -1    32    32    32    32 -11445 0
      10.44.240.163@o2ib2         6    up    -1    32    32    32    32 -157911 0
      10.44.240.168@o2ib2         7    up    -1    32    32    32    32 -11480 0
      10.44.240.165@o2ib2        21    up    -1    32    32    32    18 -11478 20808
      10.44.240.162@o2ib2         6    up    -1    32    32    32    32 -157755 0
      10.44.240.167@o2ib2        21    up    -1    32    32    32    18 -11403 20576
      10.44.240.164@o2ib2         6    up    -1    32    32    32    32 -157786 0
      

      Generally I've been restarting LNET routers that develop into a state of having persistently very large values in the 'queue' column on servers, that isn't going down. This has so far been bringing the system into a temporary period of stability, but it's not fixing the underlying problem and the issues return after a while.

      Another observation is watching 'routerstat' on the LNET routers, and noticing very large numbers in the 'M' column, for messages being processed. We have occassionally seen this spike up to extremely large values before the whole LNET router runs out of memory and panics, similar to as described in this ticket: https://jira.whamcloud.com/browse/CAM-82

      However this is less frequent now since we have removed nearly all of our FDR servers/clients from the environment.

      On the fabric side, we don't see any obvious issues from the fabric manager - all nodes are reachable and respond to LNET ping except in periods when we experience the above symptoms.

      I will attach all the LNET configurations, LND parameters, observed counters and log messages to the ticket.

      I was hoping that an expert eye could do a sanity check of the configuration listed and offer any insight into why we might be experiencing these timeouts suddenly, and if there is any configuration changes we could do to eliminate the problem?

      Kind regards,
      Matt

      Attachments

        1. client.mount.debug.log
          185 kB
        2. lnet-routers.png
          lnet-routers.png
          55 kB
        3. lustre-log.1630947202.26758.gz
          5.68 MB
        4. mds.log
          244 kB
        5. rds-d2-client.2021-09-15.llog
          30 kB
        6. rds-d2-MDT0000.2021-09-15.llog
          31 kB
        7. rds-d2-OST0000.2021-09-15.llog
          0.6 kB
        8. router.log
          611 kB
        9. screenshot-1.png
          screenshot-1.png
          64 kB

        Activity

          People

            ssmirnov Serguei Smirnov
            mrb Matt Rásó-Barnett (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: