[LU-14589] LNet routers don't reset peers after they reboot Created: 06/Apr/21  Updated: 29/Sep/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File sh02-fir02.dk-lnet.log.gz     Text File sh02-fir02.netshow.txt     Text File sh02-fir02.peers.txt     Text File sh02-fir02.peershow.all.txt     Text File sh02-fir02.peershow.all.txt     Text File sh02-oak02_peers.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are now using 2.14 LNet routers, with 2.12.6 servers and a mix of 2.13 and 2.14 clients, and it looks like routers are less resilient to node client failures.

We had two peers (clients) running Lustre 2.13 that were in a bad state (RDMA timeout, but no apparent IB fabric problem), and our routers (2.14) had a lot of refs for them. These are peers 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2 below. We rebooted them, but even after that, we couldn't mount the filesystem, likely because the routers still had old references?

This is after reboot of 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid                      refs state  last   max   rtr   min    tx   min queue
10.50.7.38@o2ib2            3    up    -1     8     6    -8     8   -39 0
10.50.15.11@o2ib2          17    up    -1     8    -8    -8     8   -43 0
10.50.5.55@o2ib2           14    up    -1     8    -5    -8     8   -16 0
10.50.14.13@o2ib2          17    up    -1     8    -8    -8     8   -19 0
10.50.5.68@o2ib2           17    up    -1     8    -8    -8     8   -27 0
10.50.5.47@o2ib2           17    up    -1     8    -8    -8     8   -45 0
10.50.5.60@o2ib2           17    up    -1     8    -8    -8     8   -61 0
10.50.16.6@o2ib2            4    up    -1     8     5    -8     8   -53 0
10.50.4.41@o2ib2            7    up    -1     8     2    -8     8   -15 0
10.50.14.10@o2ib2           8    up    -1     8     1    -8     8   -21 0
10.50.12.14@o2ib2        29128    up    -1     8 -29119 -29119     8   -82 0    <<<<
10.50.1.60@o2ib2            8    up    -1     8     1    -8     8   -20 0
10.50.13.4@o2ib2            3    up    -1     8     6    -8     8   -20 0
10.50.14.15@o2ib2           7    up    -1     8     2    -8     8   -28 0
10.50.1.18@o2ib2            3    up    -1     8     6    -8     8   -13 0
10.50.15.5@o2ib2           17    up    -1     8    -8    -8     8   -20 0
10.50.7.3@o2ib2             3    up    -1     8     6    -8     8   -32 0
10.50.13.14@o2ib2           4    up    -1     8     5    -8     8   -37 0
10.50.5.67@o2ib2            4    up    -1     8     5    -8     8   -28 0
10.50.10.41@o2ib2           3    up    -1     8     6    -8     8   -18 0
10.50.0.64@o2ib2            5    up    -1     8     4   -24     8   -75 0
10.50.5.38@o2ib2            3    up    -1     8     6    -8     8   -33 0
10.50.13.11@o2ib2          15    up    -1     8    -6    -8     8   -28 0
10.50.5.9@o2ib2             4    up    -1     8     5    -8     8   -20 0
10.50.12.13@o2ib2           4    up    -1     8     5    -8     8   -39 0
10.50.5.43@o2ib2            4    up    -1     8     5    -8     8   -24 0
10.50.1.59@o2ib2            6    up    -1     8     3    -8     8   -20 0
10.50.12.5@o2ib2            3    up    -1     8     6   -40     8   -52 0
10.50.16.2@o2ib2           16    up    -1     8    -7    -8     8   -25 0
10.50.4.8@o2ib2            12    up    -1     8    -3   -40     8   -37 0
10.50.15.9@o2ib2            8    up    -1     8     1    -8     8   -27 0
10.50.5.53@o2ib2           17    up    -1     8    -8    -8     8   -26 0
10.50.5.32@o2ib2            3    up    -1     8     6    -8     8   -28 0
10.50.13.13@o2ib2           6    up    -1     8     3    -8     8   -41 0
10.0.2.114@o2ib5          203    up    -1     8     8    -8  -194 -1877 15190572
10.50.12.15@o2ib2        36367    up    -1     8 -36358 -36358     8   -76 0      <<<
10.50.10.6@o2ib2            3    up    -1     8     6    -8     8   -23 0
10.50.15.1@o2ib2            3    up    -1     8     6    -8     8   -19 0
10.50.8.44@o2ib2            3    up    -1     8     6    -8     8   -10 0

The routers were shown as up from both peers, but the filesystem couldn't be mounted.

We manually deleted both peers from the routers:

[root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.15@o2ib2
[root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.14@o2ib2

The situation was much better after that in terms of peers:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid                      refs state  last   max   rtr   min    tx   min queue
[root@sh02-oak02 ~]# 

And we were able to mount the filesystem on these two peers without problem.

But I noticed that the refs have not been cleaned from the nis file:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis 
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
10.50.0.132@o2ib2            up     0 65815    8    0   256   256   195
10.0.2.215@o2ib5             up     0  225    8    0   256   248   179

 

Our config:

Routers:

[root@sh02-oak02 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
    router_sensitivity: 100
    lnd_timeout: 49
    response_tracking: 3
    recovery_limit: 0

Clients:

[root@sh02-12n14 ~]# lnetctl global show -v 3
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
    router_sensitivity: 100
    lnd_timeout: 49
    response_tracking: 3
    recovery_limit: 0


 Comments   
Comment by Amir Shehata (Inactive) [ 07/Apr/21 ]

My current suspicion is the problem is in the rtr credits value.
That value is decremented on every message routed and incremented when the message has completed and the credit is returned. The fact that we have such a big negative value means that there are messages which are queued on that peer waiting for available credits to be sent. Given that the peer is in a bad state, it's likely taking a long time for these credits to return.
It's also likely that even though the routed messages were not getting through, the client is able to still send the keep alive message to the router; therefore keeping the state as up.
When the client rebooted, how long did you wait for before dumping out /sys/kernel/debug/lnet/peers on the routers?
The state of the peer is changed to down if there is a connection failure with it and changed to up whenever we receive a message from it. So if you dumped the output after the client came up, it's likely the client had already discovered the routers it has routes through, which brought its state to up on the router.
This is likely not a new issue, maybe just a new situation?
I'm thinking one course of action is for the router to flush the messages to a particular peer when it detects it has gone down. The problem is the way a router detects a peer has gone down is if it has a connection failure, however, that doesn't mean a peer has been rebooted, it could be a temporary glitch in the network, so dropping all the messages might cause other symptoms to the client, since RPCs will fail left and right.
We need a way for the router to know the client has been rebooted, and when it detects it has been rebooted we flush all the queued messages and return all the credits.
One way which we had in the code, was on discovery we send a sequence number in the ping. That sequence number is continuously increasing. If the router keeps track of that per peer and sees for a peer the sequence number has been reset, then that would indicate the client has been reset, and we can safely dump all the messages to that particular client and return all the credits.
thoughts?

Comment by Stephane Thiell [ 07/Apr/21 ]

Thanks Amir for the explanation!

Something was definitely going on with these two clients. I'm not sure what. We saw LNet RDMA timeout errors but the fabric was fine. They were running Lustre 2.13 when this happened and Mellanox OFED 5.0. We're slowly upgrading nodes to use Lustre client 2.14 instead in the hope there are a few lnet patches between 2.13 and 2.14 that could help.

I waited more than a few minutes before dumping out /sys/kernel/debug/lnet/peers on the routers.

I think it is a new situation, not necessarily a new issues. We have had some weird LNet errors lately as I said with 2.13 clients on a specific IB fabric (o2ib2). We also recently increased at_min on this filesystem to 75 and that helped significantly (even though this is not LNet directly but rather ptlrpc, it's interesting to note that this has helped reduce request timeout errors).

 

> It's also likely that even though the routed messages were not getting through, the client is able to still send the keep alive message to the router; therefore keeping the state as up.

That sounds a bit weird to me. Perhaps if the routed messages are not getting through, the routers could do something to abort after some time? or switch the state to something else like down which would really mean "congested"? 

Otherwise, I like your idea of the sequence number!

Today things are doing well for these two nodes in particular, but I noticed that the high refs number in /sys/kernel/debug/lnet/nis have still not be cleared:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis 
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
10.50.0.132@o2ib2            up     0 65647    8    0   256   254   195
10.0.2.215@o2ib5             up     0  127    8    0   256   248   179

This doesn't match the refs in /sys/kernel/debug/lnet/peers. Output of this file attached as sh02-oak02_peers.txt

Also, please find detailed peer info (from today) for the two client NIDs that were causing problems yesterday, that were rebooted:

[root@sh02-oak02 ~]# lnetctl peer show --nid 10.50.12.14@o2ib2 -v 4
peer:
    - primary nid: 10.50.12.14@o2ib2
      Multi-Rail: True
      peer state: 273
      peer ni:
        - nid: 10.50.12.14@o2ib2
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -33
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: -5
          refcount: 1
          statistics:
              send_count: 429820
              recv_count: 424710
              drop_count: 0
          sent_stats:
              put: 427490
              get: 2313
              reply: 7
              ack: 10
              hello: 0
          received_stats:
              put: 422389
              get: 2319
              reply: 0
              ack: 2
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
[root@sh02-oak02 ~]# lnetctl peer show --nid 10.50.12.15@o2ib2 -v 4
peer:
    - primary nid: 10.50.12.15@o2ib2
      Multi-Rail: True
      peer state: 273
      peer ni:
        - nid: 10.50.12.15@o2ib2
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -35
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: -8
          refcount: 1
          statistics:
              send_count: 431486
              recv_count: 428411
              drop_count: 0
          sent_stats:
              put: 429155
              get: 2315
              reply: 11
              ack: 5
              hello: 0
          received_stats:
              put: 426088
              get: 2321
              reply: 0
              ack: 2
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
Comment by Stephane Thiell [ 09/Apr/21 ]

We’ve hit that issue with another set of routers that we recently upgraded to Lustre 2.14 (for another storage system (scratch) connected to our cluster Sherlock). To be clear, not really the issue when a server reboot, but the issue of accumulated refs leading to a broken router.

Example:

Clients: *@o2ib2
OSS: 10.0.10.101@o2ib7 (fir-io1-s1) running Lustre 2.12.5
Router sh02-fir02 between o2ib7 and o2ib2 running Lustre 2.14 with NIDs:

  • 10.50.0.112@o2ib2
  • 10.0.10.225@o2ib7

We have started to see RDMA timed out errors with this router last night from all OSS, so we took a closer look.

From router sh02-fir02, peer 10.0.10.101@o2ib7 is seen with many rtr credits:

[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7  -v 4 
peer:
    - primary nid: 10.0.10.101@o2ib7
      Multi-Rail: True
      peer state: 273
      peer ni:
        - nid: 10.0.10.101@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -84
          tx_q_num_of_buf: 0
          available_rtr_credits: -68955             <<<
          min_rtr_credits: -68955
          refcount: 68964
          statistics:
              send_count: 260217968
              recv_count: 264915739
              drop_count: 0
          sent_stats:
              put: 213920018
              get: 9654
              reply: 11769852
              ack: 34518444
              hello: 0
          received_stats:
              put: 253300757
              get: 11614959
              reply: 12
              ack: 11
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 721
              timeout: 0
              error: 0
              network timeout: 0

available_rtr_credits keeps increasing for this peer:

[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7  -v 4 | grep available_rtr_credits
          available_rtr_credits: -69577
[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7  -v 4 | grep available_rtr_credits
          available_rtr_credits: -69583

On all servers, we see errors with this router:

Apr 09 12:00:44 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bf5f10bf800
Apr 09 12:00:44 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bef240de800
Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bef240dd400
Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8be89d402c00
Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8c081dba8400
Apr 09 12:00:25 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff897ad4ce8000
Apr 09 12:00:31 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff899ef77ebc00
Apr 09 12:00:31 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89ce2685b400
Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400
Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400
Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400
Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 95 previous similar messages
Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.0.10.225@o2ib7 (0): c: 0, oc: 3, rc: 1
Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 95 previous similar messages 

Attaching:

  • output of /sys/kernel/debug/lnet/peers from the router sh02-fir02 as sh02-fir02.peers.txt
  • output of lnetctl net show -v 4 from the router as sh02-fir02.netshow.txt
  • output of lnetctl peer show -v 4 from the router as sh02-fir02.peershow.all.txt
  • 1 sec of lustre debug log with +net from the router as sh02-fir02.dk-lnet.log.gz
    I tried to capture the debug log that contains the following error:
    Apr 09 11:46:38 sh02-fir02.int kernel: LNet: Closing conn to 10.0.10.101@o2ib7: error 0(waiting)
    

We've seen that with another router too, but the problem is the same. We're going to downgrade these routers to Lustre 2.12.6 for now.

Comment by Stephane Thiell [ 29/Sep/21 ]

I'm still seeing these stuck routers, especially after reboot of routers (even if they are fine before), with lots of refs and shortly running out of rtr credits. Servers, routers and clients are running 2.12.7 in that case.

[root@sh03-oak02 ~]# lnetctl peer show --nid 10.0.2.51@o2ib5  -v 4
peer:
    - primary nid: 10.0.2.51@o2ib5
      Multi-Rail: False
      peer state: 16
      peer ni:
        - nid: 10.0.2.51@o2ib5
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -133
          tx_q_num_of_buf: 0
          available_rtr_credits: -1088            <<<
          min_rtr_credits: -1088
          refcount: 1097
          statistics:
              send_count: 938444
              recv_count: 893
              drop_count: 0
          sent_stats:
              put: 668717
              get: 6
              reply: 0
              ack: 269721
              hello: 0
          received_stats:
              put: 887
              get: 6
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 4
              timeout: 0
              error: 0
              network timeout: 0
$ cat /sys/kernel/debug/lnet/peers | awk '$2 > 20 {print }' 
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.51@o2ib5          1097    up   160     8 -1088 -1088     8  -133 0

Rebooting the stuck router doesn't necessarily help.

Generated at Sat Feb 10 03:11:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.