[LU-14652] LNet router stuck generating RDMA tx timeout Created: 29/Apr/21  Updated: 01/Feb/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9 (3.10.0-1160.24.1.el7.x86_64) on routers, Lustre 2.12.6


Attachments: File fir-io1-s1_lctl_ping_sync.log.gz     File fir-io1-s2_dk_net_dlmtrace_20211026.log.gz     File fir-io1-s2_kernel_sysrq-l_20211026.log.gz     PNG File image-2021-11-09-15-37-32-277.png     Text File net_show_client_2.12.txt     Text File net_show_client_2.13.txt     Text File net_show_routers.txt     Text File net_show_servers.txt     File oak-io2-s1.dknet10.gz     File oak-io2-s1.dknet9.gz     Text File oak-io2-s1.kern.log     File sh02-14n15.dknet1.gz     File sh02-14n15.dknet2.gz     File sh02-14n15.dknet3.gz     File sh02-fir03-debug-lctl_ping_sync.log.gz     File sh02-fir03-kern_sysrq-l-t_20211026.log.gz     Text File sh02-fir03.20211020.peers.txt     File sh02-fir04-dknet.log.gz     File sh03-01n14.dknet.gz     Zip Archive sh03-fir06-20210428.zip    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello, we are still seeing weird behavior with LNet over IB with 2.12+. We have tried to upgrade clients and routers to 2.13 and then 2.14 without success. We went back to 2.12.6 LTS, but we are still seeing occasional kiblnd errors and timeout. The IB fabrics are healthy, sometimes a little bit of congestion but no discard. I'm starting to suspect a deeper problem with LNet/ko2iblnd, where sometimes credits are exhausted? We didn't have with 2.10. To me, the problem seems similar to the one reported in LU-14026 by LLNL.

We do have the following setup:

Fir (serves /scratch) o2ib7 — 8 x lnet routers (IB) — Sherlock v3 (o2ib3)

Last night, one of the 8 routers (sh03-fir06) started to have problems. I've taken traces so that we can investigate.

Router NIDs:

[root@sh03-fir06 ~]# lctl list_nids
10.51.0.116@o2ib3
10.0.10.237@o2ib7

LNet config on the router (we have discovery enabled and using a few Multi-Rail nodes on o2ib3):

[root@sh03-fir06 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

Fir Lustre servers on o2ib7 started to exhibit the following errors:

fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9480cfc4c800
fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff945be5538400
fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94886b2a9000
fir-io7-s2: Apr 28 23:26:17 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94428fd56000
fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
fir-io7-s2: Apr 28 23:26:43 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9462486d2400
fir-io7-s2: Apr 28 23:26:51 fir-io7-s2 kernel: LustreError: 68967:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff949f1a1a3850 x1696974434037056/t0(0) o4->12f8a639-7e97-4157-8d89-6e1e00a728eb@10.51.13.20@o2ib3:363/0 lens 488/448 e 0 to 0 dl 1619677703 ref 1 fl Interpret:/0/0 rc 0/0
fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
fir-io6-s1: Apr 28 23:24:18 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff91913b76a800
fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
fir-io6-s1: Apr 28 23:24:36 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911f0108a400
fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
fir-io6-s1: Apr 28 23:25:08 fir-io6-s1 kernel: LNet: 24015:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.0.10.237@o2ib7: error -110(waiting)

One interesting thing about that router, vs. the 7 others, is that it had a lot of refs (> 3000) in {{/sys/kernel/debug/lnet/nis }} and tx stuck at -367. The high refs count is similar to an issue we noticed with 2.14 routers reported in LU-14584, and we thought that maybe this was a CPT issue. Here, it happened with Lustre 2.12.6 routers.

nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.116@o2ib3            up     0    1    8    0    64    64    47
10.51.0.116@o2ib3            up     0    0    8    0    64    64    47
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.51.0.116@o2ib3            up     0    0    8    0    64    64    41
10.51.0.116@o2ib3            up     0    0    8    0    64    64    48
10.51.0.116@o2ib3            up     0 3493    8    0    64  -367  -367
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.0.10.237@o2ib7            up     0 3318    8    0    64    64    48
10.0.10.237@o2ib7            up     0 3062    8    0    64    64    50
10.0.10.237@o2ib7            up     0 6202    8    0    64    64    47
10.0.10.237@o2ib7            up     0 3032    8    0    64    64    49
10.0.10.237@o2ib7            up     0 6082    8    0    64    64    48
10.0.10.237@o2ib7            up     0 5467    8    0    64    64    48
10.0.10.237@o2ib7            up     0 3115    8    0    64    64    50
10.0.10.237@o2ib7            up     0 3206    8    0    64    64    48

I took traces on this router at the time of the problem. I'm attaching a zip files sh03-fir06-20210428.zip with the output of:

  • lnetctl stats show
  • lnetctl peer show
  • cat /sys/kernel/debug/lnet/nis
  • cat /sys/kernel/debug/lnet/peers
  • kernel logs
  • short dk logs with +net enabled, just in case that would show something interesting

Rebooting the router fixed the problem.



 Comments   
Comment by Peter Jones [ 30/Apr/21 ]

Serguei

Could you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 30/Apr/21 ]

Hi Stephane,

Could you please provide the output of 

lnetctl net show -v 4

from the router and one of the server and one of the clients? It looks like the number of credits configured for the router may be low.

Also, what is the output of 

lnetctl global show 

on clients and servers?  In the 20 sec captured in the logs there are a few transaction expiring. Perhaps transaction_timeout can be increased, but first let's check the situation with the credits.

Thanks,

Serguei.

Comment by Stephane Thiell [ 30/Apr/21 ]

Hi Serguei,

I'm attaching the output of lnetctl net show -v 4:

Output of lnetctl global show:

  • on servers:
    # clush -w@mds,@oss -b 'lnetctl global show'
    ---------------
    fir-io[1-8]-s[1-2],fir-md1-s[1-4] (20)
    ---------------
    global:
        numa_range: 0
        max_intf: 200
        discovery: 1
        drop_asym_route: 0
        retry_count: 2
        transaction_timeout: 50
        health_sensitivity: 100
        recovery_interval: 1
    
  • on routers:
    clush -w sh03-fir[01-08] -b "lnetctl global show"
    ---------------
    sh03-fir[01-08] (8)
    ---------------
    global:
        numa_range: 0
        max_intf: 200
        discovery: 1
        drop_asym_route: 0
        retry_count: 0
        transaction_timeout: 50
        health_sensitivity: 0
        recovery_interval: 1
    
  • on 2.13 clients (1,294 total):
    global:                                                                         
        numa_range: 0                                                               
        max_intf: 200                                                               
        discovery: 1                                                                
        drop_asym_route: 0                                                          
        retry_count: 0                                                              
        transaction_timeout: 50                                                     
        health_sensitivity: 0                                                       
        recovery_interval: 5                                                        
        router_sensitivity: 100 
    
  • on 2.12.6 clients (370 total):
    global:                                                                         
        numa_range: 0                                                               
        max_intf: 200                                                               
        discovery: 1                                                                
        drop_asym_route: 0                                                          
        retry_count: 0                                                              
        transaction_timeout: 50                                                     
        health_sensitivity: 0                                                       
        recovery_interval: 1 
    
  • on 2.14 clients (only 6):
    global:                                                                         
        numa_range: 0                                                               
        max_intf: 200                                                               
        discovery: 1                                                                
        drop_asym_route: 0                                                          
        retry_count: 0                                                              
        transaction_timeout: 50                                                     
        health_sensitivity: 0                                                       
        recovery_interval: 1                                                        
        router_sensitivity: 100                                                     
        lnd_timeout: 49                                                             
        response_tracking: 3                                                        
        recovery_limit: 0 
    

As you can see, we are already making sure that transaction_timeout is 50 on all clients and servers, even with 2.13 clients where it was 10 by default. But maybe you will some other problems here? Let me know if you need more data from us. Thanks!

Comment by Serguei Smirnov [ 01/May/21 ]

Stephane,

On servers:

 retry_count: 2
 transaction_timeout: 50

This means that resulting lnd_timeout is 50/3 seconds, while on other nodes it is 50. If you want to keep the retry_count at 2, please increase transaction_timeout to 150 on the servers, it will make sure every node is using the same lnd_timeout.

Thanks,

Serguei.

Comment by Stephane Thiell [ 01/May/21 ]

Thanks for catching that, Serguei! Ok, I'll try either that or disable lnet health and set retry_count to 0 like for the other nodes.

Comment by Stephane Thiell [ 01/May/21 ]

I fixed that issue:

---------------
fir-io[1-8]-s[1-2],fir-md1-s[1-4] (20)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

but shortly after that change, another router started to exhibit a similar problem (sh02-fir04). This time it's a router between the same servers (o2ib7, HDR) and o2ib2 (Sherlock v2, another fabric, EDR based), but it's the same idea.

[root@sh02-fir04 ~]# lctl list_nids
10.50.0.114@o2ib2
10.0.10.227@o2ib7

[root@sh02-fir04 ~]# cat /sys/kernel/debug/lnet/nis 
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.114@o2ib2            up     0    2    8    0   128   127    64
10.50.0.114@o2ib2            up     0    1    8    0   128   127    57
10.0.10.227@o2ib7            up     0   29    8    0   128   127    64
10.0.10.227@o2ib7            up     0  379    8    0   128   128    56

and refs is growing.

On a server:

Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9f68619b2c00
Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.227@o2ib7: LNetM[DE]Unlink() already called on the MD/ME.
Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Skipped 2 previous similar messages
Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400
Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400

From a server (10.0.10.115@o2ib7):

[root@fir-io8-s1 ~]# lnetctl peer show --nid 10.0.10.227@o2ib7 -v 4
peer:
    - primary nid: 10.50.0.114@o2ib2
      Multi-Rail: True
      peer state: 137
      peer ni:
        - nid: 10.0.10.227@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 7
          min_tx_credits: -64957
          tx_q_num_of_buf: 480
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 5
          statistics:
              send_count: 4282001375
              recv_count: 119463736
              drop_count: 0
          sent_stats:
              put: 3953956139
              get: 328045236
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 3102103707
              get: 493
              reply: 336149354
              ack: 976177478
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 6
              timeout: 250
              error: 0
              network timeout: 0
        - nid: 10.50.0.114@o2ib2
          state: NA
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0

Peer show of the server from the router (10.0.10.227@o2ib7/10.50.0.114@o2ib2):

[root@sh02-fir04 ~]# lnetctl peer show --nid 10.0.10.115@o2ib7 -v 4
peer:
    - primary nid: 10.0.10.115@o2ib7
      Multi-Rail: True
      peer state: 137
      peer ni:
        - nid: 10.0.10.115@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -89
          tx_q_num_of_buf: 0
          available_rtr_credits: 6
          min_rtr_credits: -8
          refcount: 3
          statistics:
              send_count: 291018436
              recv_count: 286578888
              drop_count: 607
          sent_stats:
              put: 198315455
              get: 14434
              reply: 21538685
              ack: 71149862
              hello: 0
          received_stats:
              put: 265177056
              get: 21401720
              reply: 55
              ack: 57
              hello: 0
          dropped_stats:
              put: 0
              get: 577
              reply: 26
              ack: 4
              hello: 0
          health stats:
              health value: 1000
              dropped: 577
              timeout: 0
              error: 0
              network timeout: 0

I wonder if there is a way to see what is causing the router to be stuck and accumulating refs. I took another trace with +net on this router while it was stuck. Attaching as sh02-fir04-dknet.log.gz

Comment by Serguei Smirnov [ 03/May/21 ]

Stephane,

From the router net log you provided it looks like the router is not reporting any errors. Was the issue still happening when you were capturing it? I wonder how persistent this issue is, or maybe it just occurs in short bursts. Would it be possible to get net logs from the client and the server, too? That would give some insight into how they qualify failed connections and whether failed connections involve the same router. 

"lnetctl net show" output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet?

 

Comment by Stephane Thiell [ 10/May/21 ]

Hi Serguei,

> "lnetctl net show" output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet?

Not that we know of. Some clients have dual ports card but only one port is up. We do have a few MR clients on o2ib3 though.

Not easy to gather logs on this system (Fir) anymore, as the issue didn't reappear. But this morning, we had a similar issue between another storage system Oak (2.12.6) o2ib5 and Sherlock (o2ib[1-3])

Timeouts and network errors:

May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8be0d58dcc00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54967:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54966:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54969:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8be0d58dcc00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bf5a32dec00
May 10 00:08:53 oak-io2-s1 kernel: LustreError: 218398:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk READ  req@ffff8bdff554c850 x1698172291403584/t0(0) o3->a703c65e-c48f-97d7-efaf-c377b3ded349@10.51.15.3@o2ib3:410/0 lens 488/440 e 0 to 0 dl 1620630560 ref 1 fl Interpret:/0/0 rc 0/0
May 10 00:08:53 oak-io2-s1 kernel: Lustre: oak-OST0052: Bulk IO read error with a703c65e-c48f-97d7-efaf-c377b3ded349 (at 10.51.15.3@o2ib3), client will retry: rc -110

I'm attaching this OSS (oak-io2-s1 10.0.2.105@o2ib5) kernel logs as oak-io2-s1.kern.log

Very high refs count on the o2ib5 <> o2ib[1-3] routers :

[root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b "cat /sys/kernel/debug/lnet/nis"
---------------
sh01-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.49.0.131@o2ib1            up     0    1    8    0   128   128    97
10.49.0.131@o2ib1            up     0    0    8    0   128   128    91
10.0.2.212@o2ib5             up     0    1    8    0   128   128    92
10.0.2.212@o2ib5             up     0    1    8    0   128   128    94
---------------
sh01-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.49.0.132@o2ib1            up     0    1    8    0   128   128    96
10.49.0.132@o2ib1            up     0    0    8    0   128   128    94
10.0.2.213@o2ib5             up     0    4    8    0   128   128    92
10.0.2.213@o2ib5             up     0    1    8    0   128   128    95
---------------
sh02-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.131@o2ib2            up     0 1032    8    0   128   125    62
10.50.0.131@o2ib2            up     0  959    8    0   128   126    54
10.0.2.214@o2ib5             up     0    6    8    0   128   127    74
10.0.2.214@o2ib5             up     0 1143    8    0   128   120    61
---------------
sh02-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.132@o2ib2            up     0  927    8    0   128   128    59
10.50.0.132@o2ib2            up     0  838    8    0   128   128    39
10.0.2.215@o2ib5             up     0  296    8    0   128   120    75
10.0.2.215@o2ib5             up     0  687    8    0   128   120    57
---------------
sh03-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.131@o2ib3            up     0   60    8    0    64    64    26
10.51.0.131@o2ib3            up     0   61    8    0    64    64    28
10.51.0.131@o2ib3            up     0   87    8    0    64    64     9
10.51.0.131@o2ib3            up     0  180    8    0    64    64    27
10.51.0.131@o2ib3            up     0  126    8    0    64    64    26
10.51.0.131@o2ib3            up     0  150    8    0    64    64    12
10.51.0.131@o2ib3            up     0   63    8    0    64    64    32
10.51.0.131@o2ib3            up     0   93    8    0    64    64    19
10.0.2.216@o2ib5             up     0    1    8    0    64    64    56
10.0.2.216@o2ib5             up     0    0    8    0    64    64    58
10.0.2.216@o2ib5             up     0    0    8    0    64    64    40
10.0.2.216@o2ib5             up     0  160    8    0    64    56    48
10.0.2.216@o2ib5             up     0    0    8    0    64    64    56
10.0.2.216@o2ib5             up     0    0    8    0    64    64    48
10.0.2.216@o2ib5             up     0    0    8    0    64    64    45
10.0.2.216@o2ib5             up     0  290    8    0    64    56    40
---------------
sh03-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.132@o2ib3            up     0   89    8    0    64    64    26
10.51.0.132@o2ib3            up     0   76    8    0    64    64    28
10.51.0.132@o2ib3            up     0   76    8    0    64    64    22
10.51.0.132@o2ib3            up     0  143    8    0    64    64    21
10.51.0.132@o2ib3            up     0  113    8    0    64    64    29
10.51.0.132@o2ib3            up     0   98    8    0    64    64    14
10.51.0.132@o2ib3            up     0   89    8    0    64    64    22
10.51.0.132@o2ib3            up     0   78    8    0    64    64    25
10.0.2.217@o2ib5             up     0    1    8    0    64    64    56
10.0.2.217@o2ib5             up     0    0    8    0    64    64    56
10.0.2.217@o2ib5             up     0  294    8    0    64    56    40
10.0.2.217@o2ib5             up     0    1    8    0    64    64    48
10.0.2.217@o2ib5             up     0    0    8    0    64    64    56
10.0.2.217@o2ib5             up     0  180    8    0    64    56    48
10.0.2.217@o2ib5             up     0    1    8    0    64    64    40
10.0.2.217@o2ib5             up     0    0    8    0    64    64    40

Peers with queuing from the routers:

[root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b "cat /sys/kernel/debug/lnet/peers  | awk '/^nid|o2ib/ && \$NF!=0'" 
---------------
sh01-oak[01-02],sh03-oak01 (3)
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
---------------
sh02-oak01
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.106@o2ib5          316    up   176     8     7    -8  -306 -2561 2551117
---------------
sh02-oak02
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.110@o2ib5          206    up    94     8     1    -8  -190 -3121 1520563
10.0.2.101@o2ib5          242    up    84     8     7    -8  -232 -2145 70352
---------------
sh03-oak02
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.105@o2ib5         1163    up    66     8     8   -16 -1154 -1987 7682915

I gathered some short +net debug logs from Oak's OSS oak-io2-s1 10.0.2.105@o2ib5, where we can see a drop:

00000400:00000100:37.0:1620667146.491444:0:54966:0:(lib-move.c:3930:lnet_parse_reply()) 10.0.2.105@o2ib5: Dropping REPLY from 12345-10.51.1.14@o2ib3 for invalid MD 0x1678074439ff2d4c.0x476212d0d

Attaching as oak-io2-s1.dknet9.gz and oak-io2-s1.dknet10.gz (this one shows a Dropping REPLY)

Attaching +net debugs from a client on Sherlock (10.50.14.15@o2ib2), one on o2ib2 I saw some queueing from the routers, but other than that, I picked it randomly and not sure it will be relevant ( sh02-14n15.dknet1.gz sh02-14n15.dknet2.gz sh02-14n15.dknet3.gz )

Comment by Stephane Thiell [ 10/May/21 ]

Also attaching +net debug logs from client sh02-14n15 10.51.1.14@o2ib3, the one shown above on the OSS with Dropping REPLY. But I think it might have been too late when I gathered the logs. sh03-01n14.dknet.gz

Comment by Serguei Smirnov [ 13/May/21 ]

Stephane, 

I still can't see what's going wrong exctly, but for 2.12.6 I'd consider disabling discovery on the routers to avoid extra complexity. For example, it would make sure that the messages sent by a node get to the specific nid of the remote node as decided by the sender, as opposed to the router making this decision when it has discovery enabled. It is more of a general recommendation that may be unrelated to the problem you're seeing though.

If you choose to disable discovery on the routers, you can do so by going one at a time:

  • Remove the corresponding route from nodes that are using it (lnetctl route del)
  • Disable discovery on the router (if done via the conf file, reload modules)
  • Delete the peer representing the router from the nodes that are using it (lnetctl peer del)
  • Add the routes back (lnetctl route add)

 

Comment by Stephane Thiell [ 21/Oct/21 ]

Hi Serguei,

We still have this problem with 2.12.7 routers and clients. The servers (Fir) are still running 2.12.5 though. Some routers started to get a high refs count on their NIs and on specific peers, which are generating random RDMA timeouts. If we reboot the routers, that usually doesn't change anything, they become unusable. In some very rare case,  (like after requeuing a lot of jobs), we're able to put a router back into production (we were able to do that today), but otherwise, the refs count start to increase immediately.

 

Clients are on o2ib2, servers on o2ib7. We're pretty much confident the IB fabrics are OK. The problem occurs between routers and servers (o2ib7). Example of a router (at 10.0.10.226@o2ib7) <-> server (OSS) ( at 10.0.10.101@o2ib7) that is stuck:

[root@sh02-fir03 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v4
peer:
    - primary nid: 10.0.10.101@o2ib7
      Multi-Rail: False
      peer ni:
        - nid: 10.0.10.101@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -3
          tx_q_num_of_buf: 0
          available_rtr_credits: -44          <<<
          min_rtr_credits: -44
          refcount: 53
          statistics:
              send_count: 43532
              recv_count: 2843
              drop_count: 226
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl set_param debug=+ALL; lctl clear; lctl ping 10.0.10.101@o2ib7;  lctl dk /tmp/debug.log; lctl set_param debug=-ALL
debug=+ALL
failed to ping 10.0.10.101@o2ib7: Input/output error
Debug log: 193037 lines, 193037 kept, 0 dropped, 0 bad.
debug=-ALL

Debug logs from the router (sh02-fir03 at 10.0.10.226@o2ib7 ) attached as sh02-fir03-debug-lctl_ping_sync.log.gz
Debug logs from the OSS (fir-io1-s1 at 10.0.10.101@o2ib7) at the same time attached as fir-io1-s1_lctl_ping_sync.log.gz
 

So the router sh02-fir03 (10.0.10.226@o2ib7) is running out of rtr credits, and from the OSS point of view (fir-io1-s1 at 10.0.10.101@o2ib7), it's out of tx credits when this happens, as shown below:

[root@fir-io1-s1 ~]# lnetctl peer show --nid 10.0.10.226@o2ib7 -v 4
peer:
    - primary nid: 10.0.10.226@o2ib7
      Multi-Rail: False
      peer state: 0
      peer ni:
        - nid: 10.0.10.226@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: -2              <<<
          min_tx_credits: -154
          tx_q_num_of_buf: 7230544
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 14
          statistics:
              send_count: 427582
              recv_count: 2109316
              drop_count: 0
          sent_stats:
              put: 389065
              get: 38517
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 1514551
              get: 6
              reply: 122123
              ack: 472636
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 12
              timeout: 298
              error: 0
              network timeout: 0

We also noticed that the NIs refs are increasing on the routers when the following messages show up on some OSS's (not all...):

Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ae220efb800
Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9a810b5e6000

From the OSS, the lnet pinger shows this:

00000400:00000200:20.0:1634761462.846900:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.224@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 1 lpni_ping_timestamp 1748161
00000400:00000200:20.0:1634761462.846903:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.225@o2ib7 60: deadline 1748227 ping_notsent 1 alive 1 alive_count 351 lpni_ping_timestamp 1748167
00000400:00000200:20.0:1634761462.846905:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.226@o2ib7 60: deadline 1748200 ping_notsent 1 alive 1 alive_count 384 lpni_ping_timestamp 1748140
00000400:00000200:20.0:1634761462.846908:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.227@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 85 lpni_ping_timestamp 1748141

Our problems are with routers 10.0.10.225@o2ib7 and 10.0.10.226@o2ib7 at the moment. Rebooting them doesn't fix the problem.

Our OSS have this configured:

options lnet live_router_check_interval=60
options lnet dead_router_check_interval=300
options lnet router_ping_timeout=60
options lnet avoid_asym_router_failure=1
options lnet check_routers_before_use=1

Clients have the default settings (so the difference is check_routers_before_use=0).

Also attaching the peer list of this problematic router (sh02-fir03 at 10.0.10.226@o2ib7) as sh02-fir03.20211020.peers.txt , where we can see that OSS 10.0.10.101@o2ib7 and 10.0.10.102@o2ib7 are out of rtr credits.

Do you know if there a way to identify in the logs what is using these high NI/peers refs count?

Comment by Serguei Smirnov [ 22/Oct/21 ]

Hi Stephane,

My understanding is that the ref counts go high because the messages on the router queue don't get cleared, same reason as for the negative "available credits". The messages can get stuck on the queue if for some reason the OSS is not responding. After some timeout (50 seconds?) this causes transactions to start expiring. I'll take a closer look at the OSS logs for clues why that may be. We may need to retrieve "ldlm" and "ptlrpc" logs in addition to "net". In the meantime, is there any information about CPU usage on the OSS? Are there any signs of lock-up? Perhaps we can get the output of "echo l > /proc/sysrq-trigger" on the server and the router to check for lock-ups?

Thanks,

Serguei.

Comment by Stephane Thiell [ 26/Oct/21 ]

Hi Serguei,

Thanks for this. I checked the OSS for lock-ups but found nothing.

Kernel logs of OSS fir-io1-s2 (10.0.10.102@o2ib7) when we started to put two "bad" routers (10.0.10.225@o2ib7 and 10.0.10.226@o2ib7) back online and problems started again: fir-io1-s2_kernel_sysrq-l_20211026.log.gz (including output of echo l > /proc/sysrq-trigger several times)

Lustre logs on this OSS with dlmtrace and net: fir-io1-s2_dk_net_dlmtrace_20211026.log.gz

Every time we put these two routers back online, the problem shows up with the same two OSS (10.0.10.101@o2ib7 and 0.0.10.102@o2ib7). These OSS are fine with the other routers...

# clush -u5 -Lw sh02-fir[01-04]  -b cat /sys/kernel/debug/lnet/peers | awk '$3 > 10 {print }'
sh02-fir[01-04]: nid                      refs state  last   max   rtr   min    tx   min queue
sh02-fir02: 10.0.10.102@o2ib7         125    up   179     8  -116  -116     8   -15 0
sh02-fir03: 10.0.10.102@o2ib7         115    up    33     8  -106  -106     8   -12 0
sh02-fir03: 10.0.10.101@o2ib7          75    up    32     8   -66   -66     8    -1 0
Comment by Stephane Thiell [ 26/Oct/21 ]

Adding sh02-fir03-kern_sysrq-l-t_20211026.log.gzwith kernel logs of a router (10.0.10.226@o2ib7) including SysRq l and t dumps, just after the router has been powered on and when the problem happens.

Comment by Etienne Aujames [ 27/Oct/21 ]

Hello,
Sorry to interfere here. We observe the same type of issue under network load at the CEA.
Could it be related to LU-15068 ("Race between commit callback and reply_out_callback::LNET_EVENT_SEND")?

Comment by Stephane Thiell [ 27/Oct/21 ]

Hi Etienne,

Please interfere!  

It's interesting, and could perhaps match our problem. I also noticed occasional messages like this one on the OSS (here, 10.0.10.225@o2ib7 is one of the bad router):

00000400:00000100:24.0:1635289738.440616:0:56886:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.225@o2ib7: LNetM[DE]Unlink() already called on the MD/ME.

I'm still wondering why this issue could survive OSS and/or router reboot.

Have you tried a patch for b2_12?

Thanks!

Comment by Stephane Thiell [ 29/Oct/21 ]

Hi Etienne and Serguei,

I added the patch from LU-15068 on top of 2.12.7 and we tested it on one OSS and one problematic Lnet router, and unfortunately, that patch doesn't fix this problem. The patch seems to work ok otherwise (no regression noticed).

In this instance, on the router 10.0.10.225@o2ib7, the rtr credits for OSS 10.0.10.102@o2ib7 were quickly dropping way below 0 immediately after putting the router online:

[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.102@o2ib7 -v
peer:
    - primary nid: 10.0.10.102@o2ib7
      Multi-Rail: False
      peer ni:
        - nid: 10.0.10.102@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -1
          tx_q_num_of_buf: 0
          available_rtr_credits: -176
          min_rtr_credits: -176
          refcount: 185
          statistics:
              send_count: 18606
              recv_count: 2924
              drop_count: 0
Comment by Serguei Smirnov [ 29/Oct/21 ]

Hi Stephane,

Just to clarify, was the patch from LU-15068 applied on the 10.0.10.102@o2ib7 in the example above?

Thanks,

Serguei.

Comment by Stephane Thiell [ 29/Oct/21 ]

Hi Seirguei,

Yes, we actually started by applying the patch on this OSS (at 10.0.10.102@o2ib7), tested that without success, then we also applied the patch on the router, but hit the same issue, unfortunately.

Comment by Serguei Smirnov [ 09/Nov/21 ]

Amir and I went over the logs and tried to track down the issue. I don't have a conclusive answer, but here's roughly what appears to be happening, based on the logs provided for the lnetctl ping above:

1) There's a client that stops being responsive. Lnd-level "no credits" messages in the router debug log point to that. 

2) If the client is unresponsive, the messages going to it from the server (via router) are not getting finalized. It causes the router credits count for the server to go negative and get stuck at that. 

3) The message queue on the server backs up. There are lnd-level "no credits" from ko2iblnd in the server debug log. Normally when the high-watermark level is hit, the node sends a special "NOOP" request bypassing the queue, requesting to release the credits from the other side (the router)

4) If there's no reaction to the "noop",  lnd credits are not released and all of them are used up. The server is not able to respond to  the lnetctl ping or send any message via the router using this connection.

There are possible causes:

  • There's bug with sending the "noop" (server or router)
  • There's bug with receiving the "noop" (router or client)
  • Nodes are running out of credits legitimately. There's too much to do, they can't keep up.

There are some additional questions/ideas:

At the time when the router debug log was taken, there are actually messages for multiple clients with "no credits" issue.  How is the routing configured on the clients? Are all clients able to use all routers or just a subset?

Have you tried increasing the number of credits (go up to peer_credits = 32 peer_credits_hiw = 16 concurrent_sends = 64) ? 

Thanks,

Serguei.

 

 

 

Comment by Stephane Thiell [ 09/Nov/21 ]

Hi Serguei,

Thanks much for spending time to look at this in more details with Amir. We really appreciate it.

I'm attaching an image to give you an overview LNet architecture of /scratch on Sherlock:

Note: In orange, those are the servers/routers with the repeating credit issues.

There is nothing too fancy here I believe. We have clients on 3 generations of IB fabric (o2ib1, o2ib2 and o2ib3), each connected to Fir (the servers) via their own LNet routers (so o2ib1<>o2ib7, o2ib2<>o2ib7 and o2ib3<>o2ib7). On a single cluster fabric, all compute nodes have the same routing configuration, so yes all clients are using the same routers per fabric. This is a an example from a compute node in o2ib2:

[root@sh02-10n19 ~]# lnetctl route show
route:
    - net: o2ib5
      gateway: 10.50.0.132@o2ib2
    - net: o2ib5
      gateway: 10.50.0.131@o2ib2
    - net: o2ib7
      gateway: 10.50.0.112@o2ib2
    - net: o2ib7
      gateway: 10.50.0.111@o2ib2
    - net: o2ib7
      gateway: 10.50.0.113@o2ib2
    - net: o2ib7
      gateway: 10.50.0.114@o2ib2 

(you can ignore o2ib5 here, which are routers to another storage system)

 

To answer your second question, no we haven't tried to increase the number of credits. We have been using the default settings. Changing that would mean a full cluster down time and it's ... complicated.

Comment by Olaf Faaland [ 15/Nov/21 ]

Hi Serguei,

You wrote:

The message queue on the server backs up. There are lnd-level "no credits" from ko2iblnd in the server debug log. Normally when the high-watermark level is hit, the node sends a special "NOOP" request bypassing the queue, requesting to release the credits from the other side (the router)

Assuming the router or client receives the "NOOP" request, does it release those credits by dropping the messages that were sent but not yet acknowledged (ie freeing the buffers holding those messages, releasing the credits, emitting the LNetError message)?

thanks

Comment by Chris Horn [ 16/Nov/21 ]

4) If there's no reaction to the "noop", lnd credits are not released and all of them are used up. The server is not able to respond to the lnetctl ping or send any message via the router using this connection.

I don't understand why the LND doesn't eventually just timeout these transactions when it can't acquire the necessary credits. That would then return the tx credits when those messages are finalized.

Comment by Serguei Smirnov [ 16/Nov/21 ]

Chris,

I do believe that If noop msg doesn't come so that there are no more credits available to send, eventually there's going to be a timeout for the transaction. The problem may be that during this time the server was not able to talk via the given router to any other node, causing all kinds of backups.

Olaf,

Looks like I got it a bit backwards it my summary. In fact, if the server is sending, it is the receiving side (router), once it figures it accumulated "high-water-mark" or more credits on the receiving end, that is expected to send the "noop" to the server, which, when received by the server, releases the credits to be used again for sending.

Thanks,

Serguei 

Comment by Stephane Thiell [ 30/Nov/21 ]

It looks like the race that Chris Horn found in LU-15234 could be the cause of our occasional stuck routers and lnet high ref count problems. We look forward to testing the patch when y'all think it's ready on top of 2.12.7.

Comment by Stephane Thiell [ 27/Jan/22 ]

Hello! It looks like the patch I mentioned above has been abandoned as it doesn't resolve this issue. I believe it is now our most impactful problem with 2.12 as it seems to start randomly. For example, this morning, we noticed lots of errors of the type on an OSS:

Jan 27 11:33:36 fir-io1-s2 kernel: LustreError: 31477:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffffa1313e562400

(that's 10.0.10.102@o2ib7)

And checking the ref count on the routers, we saw a very high refcnt on the router (sh02-fir02) for this OSS (10.0.10.102@o2ib7):
 

[root@sh02-hn01 sthiell.root]# clush -u5 -Lw @rtr_fir  -b cat /sys/kernel/debug/lnet/peers | awk '$3 > 20 {print }'
sh03-fir[01-08],sh02-fir[01-04]: nid                      refs state  last   max   rtr   min    tx   min queue
sh02-fir02: 10.0.10.102@o2ib7        8829    up   110     8 -8820 -8820     8   -87 0

Rebooting the router didn't help, its refcnt for 10.0.10.102@o2ib7 is increasing again at this time. Sometimes, rebooting the "bad" router does help though, but unfortunately not always.
Any update on this issue? Thanks!
 

Comment by Serguei Smirnov [ 31/Jan/22 ]

Hi Stephane,

There was a couple more patches added for LU-15234: one fixes potential cause of the reference leak, another adds debug information in order to narrow down where the leak may be. The results should be available later this week.

Thanks,

Serguei.

Comment by Stephane Thiell [ 01/Feb/22 ]

Hi Serguei,

OK – Thank you!

Generated at Sat Feb 10 03:11:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.