[LU-15295] LNet tx/rx timeout with routers Created: 01/Dec/21  Updated: 24/Feb/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9


Attachments: Text File clients_lnetctl_net_show.txt     Text File clients_stats_show.txt     Text File oak_oss_lnetctl_net_show.txt     Text File oak_oss_lnetctl_route.txt     Text File oak_oss_peer_show_131.txt     Text File oak_oss_peer_show_132.txt     Text File oak_oss_stats_show.txt     Text File routers_lnetctl_net_show.txt     Text File routers_stats_show.txt     File sh03-oak02.dk.log.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I wanted to report a problem also related to LNet but I think could be different than the high refs count issue on stuck routers reported in LU-14652 (not seen here) and hopefully fixed by the patch provided in LU-15234 by Chris Horn. In this other case, as seen today, we get many LNet timeout errors, leading to disconnect/reconnect of clients, slowness and tickets from users. We have transaction_timeout set to 50 (no retry). at_min=75, timeout=100. The issue seems to be related to a single router between two IB HDR fabrics: o2ib5 (Oak, Lustre servers) and o2ib3 (Sherlock 3, compute nodes / clients). This is the topology:

*@o2ib5 servers <> 10.0.2.[216-217]@o2ib5 / 10.51.0.[132-131]@o2ib3 routers <> *@o2ib3 clients

From the clients, the problem is typically noticed with errors like these:

Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (0): c: 0, oc: 1, rc: 8
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (1): c: 0, oc: 2, rc: 8
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (6): c: 0, oc: 1, rc: 8
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: 2552:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1638340518/real 1638340524]  req@ffffa0b58e61ec00 x1711246854620928/t0(0) o400->oak-
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: Skipped 50 previous similar messages
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)

The problem can be seen on the servers with the following errors, for example from an OSS 10.0.2.118@o2ib5:

Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff997bbb3a9000
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 3, status -103, desc ffff99c2245edc00
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 34635:0:(ldlm_lib.c:3362:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff99d6f1953050 x1711214487574208/t0(0) o4->f876dd06-7d3a-b247-9185-2267ed691876@10.51.2.43@o2ib3:532/0 lens 488/448 e 0 to 0 dl 1638340717 ref 1 fl Interpret:/0/0 rc 0/0

The suspect router itself (10.51.0.132@o2ib3) shows lots of Rx failed towards o2ib3 (the clients):

00000800:00000100:15.0:1638339585.687821:0:28664:0:(o2iblnd_cb.c:2881:kiblnd_check_reconnect()) 10.51.2.43@o2ib3: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:17.0:1638339585.687858:0:4020:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.51.2.43@o2ib3 failed: 5

also RDMA against the servers:

00000800:00020000:31.0:1638339631.216651:0:4019:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue_nocred, 0 seconds
00000800:00020000:31.0:1638339631.216653:0:4019:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.0.2.114@o2ib5 (6): c: 0, oc: 0, rc: 7
00000800:00000100:31.0:1638339631.216655:0:4019:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 10.0.2.114@o2ib5: error -110(sending_nocred)(waiting)

I'm attaching +net dk log of the router containing the messages above as sh03-oak02.dk.log.gz

If I check the peers on o2ib3 (the clients), the queued messages are mainly with router 10.51.0.132@o2ib3 (I checked multiple times and it's clear):

# clush -u 5 -w @island:3\&@compute "lctl get_param peers | awk '/up/ && \$10 > 0 {print}'" | sort -k3nr
sh03-05n57: 10.51.0.132@o2ib3         574    up    -1     8     8     8  -562 -4853 173368
sh03-06n37: 10.51.0.132@o2ib3         572    up    -1     8     8     8  -560 -5437 168720
sh03-04n38: 10.51.0.132@o2ib3         569    up    -1     8     8     8  -557 -13788 167536
sh03-02n56: 10.51.0.132@o2ib3         479    up    -1     8     8     8  -461 -9506 6428640
sh03-04n38: 10.51.0.131@o2ib3         476    up    -1     8     8     8  -464 -13727 140896
sh03-06n66: 10.51.0.132@o2ib3         434    up    -1     8     8     8  -422 -9398 129056
sh03-02n48: 10.51.0.132@o2ib3         384    up    -1     8     8     8  -372 -5318 113072
sh03-06n66: 10.51.0.131@o2ib3         384    up    -1     8     8     8  -372 -9363 113664
sh03-12n11: 10.51.0.132@o2ib3         380    up    -1     8     8     8  -368 -5749 116984
sh03-15n10: 10.51.0.132@o2ib3         318    up    -1     8     8     8  -306 -4950 92944
sh03-02n42: 10.51.0.132@o2ib3         302    up    -1     8     8     8  -281 -9604 11618384
sh03-05n14: 10.51.0.132@o2ib3         294    up    -1     8     8     8  -282 -1379 88440
sh03-05n62: 10.51.0.132@o2ib3         273    up    -1     8     8     8  -261 -5518 79888
sh03-02n01: 10.51.0.132@o2ib3         266    up    -1     8     8     8  -254 -3913 77552
sh03-12n20: 10.51.0.132@o2ib3         264    up    -1     8     8     8  -252 -9347 77488
sh03-02n41: 10.51.0.132@o2ib3         230    up    -1     8     8     8  -214 -1963 6356352
sh03-02n53: 10.51.0.132@o2ib3         219    up    -1     8     8     8  -207 -6578 63936
sh03-16n22: 10.51.0.132@o2ib3         199    up    -1     8     8     8  -187 -9558 57720
sh03-02n57: 10.51.0.132@o2ib3         195    up    -1     8     8     8  -182 -5138 3200560
sh03-06n21: 10.51.0.132@o2ib3         135    up    -1     8     8     8  -123 -4791 38776
sh03-02n56: 10.51.0.131@o2ib3         127    up    -1     8     8     8  -109 -9505 6329512
sh03-02n54: 10.51.0.132@o2ib3         112    up    -1     8     8     8   -91 -14602 4173560
sh03-02n49: 10.51.0.132@o2ib3          95    up    -1     8     8     8   -78 -5757 10510032
sh03-02n66: 10.51.0.132@o2ib3          92    up    -1     8     8     8   -51 -1255 49290688
sh03-05n29: 10.51.0.132@o2ib3          91    up    -1     8     8     8   -79 -6567 25752
sh03-01n07: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2130 25456
sh03-04n48: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -4891 25456
sh03-06n55: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2319 25456
sh03-06n32: 10.51.0.132@o2ib3          88    up    -1     8     8     8   -76 -6881 24864
sh03-05n01: 10.51.0.132@o2ib3          85    up    -1     8     8     8   -73 -9823 23976
sh03-02n72: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -72 -6336 23680
sh03-06n36: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -71 -9006 58137
sh03-04n24: 10.51.0.132@o2ib3          82    up    -1     8     8     8   -70 -4829 23088
sh03-02n50: 10.51.0.132@o2ib3          81    up    -1     8     8     8   -37 -12602 45092400
sh03-01n16: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -12919 22200
sh03-02n59: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -11661 22200
sh03-02n55: 10.51.0.132@o2ib3          78    up    -1     8     8     8   -63 -1779 2429624
sh03-02n13: 10.51.0.132@o2ib3          76    up    -1     8     8     8   -64 -7355 21312
sh03-06n33: 10.51.0.132@o2ib3          74    up    -1     8     8     8   -62 -17726 20952
sh03-05n22: 10.51.0.132@o2ib3          73    up    -1     8     8     8   -61 -10495 20424
sh03-01n18: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5547 20128
sh03-04n42: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5387 20128
sh03-02n44: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -1764 21064
sh03-06n03: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -5982 20096
sh03-06n26: 10.51.0.132@o2ib3          70    up    -1     8     8     8   -58 -6808 19536
sh03-04n44: 10.51.0.132@o2ib3          63    up    -1     8     8     8   -51 -6094 17464
sh03-12n24: 10.51.0.132@o2ib3          61    up    -1     8     8     8   -49 -4673 16872
sh03-02n47: 10.51.0.132@o2ib3          55    up    -1     8     8     8   -27 -4385 26218872
sh03-02n39: 10.51.0.132@o2ib3          52    up    -1     8     8     8   -40 -4968 14208
sh03-02n54: 10.51.0.131@o2ib3          51    up    -1     8     8     8   -39 -14431 4158208
sh03-06n35: 10.51.0.132@o2ib3          48    up    -1     8     8     8   -36 -4772 13024
sh03-04n22: 10.51.0.132@o2ib3          45    up    -1     8     8     8   -29 -10100 7350704
sh03-04n19: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -5526 10952
sh03-05n02: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -2591 11056
sh03-06n37: 10.51.0.131@o2ib3          38    up    -1     8     8     8   -26 -5438 10064
sh03-05n57: 10.51.0.131@o2ib3          36    up    -1     8     8     8   -24 -4852 9736
sh03-06n14: 10.51.0.132@o2ib3          34    up    -1     8     8     8   -22 -5228 9672
sh03-04n36: 10.51.0.132@o2ib3          32    up    -1     8     8     8   -20 -6917 8288
sh03-02n48: 10.51.0.131@o2ib3          31    up    -1     8     8     8   -19 -5319 7992
sh03-02n42: 10.51.0.131@o2ib3          28    up    -1     8     8     8    -8 -9221 5858432
sh03-05n04: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -5320 6216
sh03-05n37: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -4483 6576
sh03-02n66: 10.51.0.131@o2ib3          24    up    -1     8     8     8     7 -1886 1048648
sh03-06n23: 10.51.0.132@o2ib3          23    up    -1     8     8     8   -11 -17351 5624
sh03-02n50: 10.51.0.131@o2ib3          21    up    -1     8     8     8     5 -12627 3145944
sh03-04n69: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -5736 4736
sh03-05n25: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -7456 4736
sh03-16n22: 10.51.0.131@o2ib3          19    up    -1     8     8     8    -7 -10269 4560
sh03-06n30: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -4972 2368
sh03-06n34: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -6922 2368
sh03-02n46: 10.51.0.132@o2ib3           9    up    -1     8     8     8     4 -7325 4194592
sh03-06n64: 10.51.0.132@o2ib3           7    up    -1     8     8     8     5 -2482 888
sh03-14n24: 10.51.0.132@o2ib3           5    up    -1     8     8     8     7 -13642 4320

From a client again, this is the peer state of the router:

[root@sh03-04n19 ~]# lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
peer:
    - primary nid: 10.51.0.132@o2ib3
      Multi-Rail: True
      peer state: 75
      peer ni:
        - nid: 10.51.0.132@o2ib3
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -5526
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 4
          statistics:
              send_count: 19788794
              recv_count: 21223791
              drop_count: 0
          sent_stats:
              put: 19537434
              get: 251360
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 20948617
              get: 170582
              reply: 104585
              ack: 7
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 14
              timeout: 17
              error: 0
              network timeout: 0
        - nid: 10.0.2.217@o2ib5
          state: NA
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0

Our lnet config is homogenous with:

[root@sh03-04n19 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

And pretty much using default tunables:

[root@sh03-04n19 ~]# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
    - net type: o2ib3
      local NI(s):
        - nid: 10.51.4.19@o2ib3
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 267775369
              recv_count: 266668619
              drop_count: 59
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 1
          tcp bonding: 0
          CPT: "[0,1]"

Note that discovery is disabled on the routers.

The storage system at o2ib5 is connected to 3 other fabrics through routers without issues at the moment (1 x tcp and 2 other IB fabrics with even more clients), so it's definitely a problem with o2ib3 routers or clients, likely 10.51.0.132@o2ib3. Let me know if you have idea on how to troubleshoot these problems. We're now running 2.12.7 everywhere. Thanks!



 Comments   
Comment by Peter Jones [ 01/Dec/21 ]

Serguei

What do you suggest here?

Peter

Comment by Serguei Smirnov [ 01/Dec/21 ]

Hi,

If I understand correctly, there are two routers between o2ib3 and o2ib5, but the issue happens mostly with "132"?

If the router enters this "bad" state, can you still "lnetctl ping" it from clients and servers?

Is the router able to recover on its own or it just continues to cause problems until rebooted?

It would also be useful to take a look at 

lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
lnetctl peer show --nid 10.51.0.131@o2ib3 -v 4
lnetctl route show -v 4

from the servers and several clients in order to check if there's any difference in how they see the routers

and 

lnetctl global show
lnetctl stats show
lnetctl net show -v 4

from the two routers, servers and at least one "typical" client (you already provided the client "global" and "net" dumps above)

If you are using lnet or ko2iblnd module parameter settings that are different from default, please attach the corresponding conf files.

Thanks,

Serguei.

Comment by Stephane Thiell [ 01/Dec/21 ]

Thanks Seirguei,

Yes,  I see more RDMA timeout errors and queued messages towards "132". I'm still able to lctl ping, unlike when we hit the high refs count that never decreases problem described in LU-14652.

Today, the problem reported in this ticket is still here between o2ib5 and o2ib3. We haven't rebooted the two routers yet. We would like to investigate more and understand what could be wrong.

While gathering the commands you asked, I noticed one difference. We have 12 OSS, and from one of them (oak-io1-s1), the two routers are seen as non MR peers, while on the 11 other OSSes, the two routers are seen as MR peers.

Output of:

lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4

available in oak_oss_peer_show_132.txt

Output of:

lnetctl peer show --nid 10.51.0.131@o2ib3 -v 4

available in oak_oss_peer_show_131.txt

From oak-io1-s1, if I ping the routers on their o2ib5 interface, I see non-MR peers, like this:

[root@oak-io1-s1 ~]# lnetctl peer show --nid 10.0.2.217@o2ib5 -v 4
peer:
    - primary nid: 10.0.2.217@o2ib5
      Multi-Rail: False
      peer state: 0
      peer ni:
        - nid: 10.0.2.217@o2ib5
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -286
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 4
          statistics:
              send_count: 120315781
              recv_count: 110271823
              drop_count: 0
          sent_stats:
              put: 120240966
              get: 74815
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 101581707
              get: 93
              reply: 74563
              ack: 8615460
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 20
              timeout: 0
              error: 0
              network timeout: 0

oak-io1-s1 doesn't seem to be generating any error though, so I'm not sure this is related, but it's clearly a difference on the servers.

Output of:

lnetctl route show -v 4

available in oak_oss_lnetctl_route.txt

but it's not easily readable so I like to use old-school lctl route_list:

[root@oak-hn01 ~]# clush -w oak-io[1-6]-s[1-2] -b 'lctl route_list | sort' |& tee -a /tmp/lctlroute
---------------
oak-io[1-6]-s[1-2] (12)
---------------
net              o2ib1 hops 4294967295 gw                 10.0.2.212@o2ib5 up pri 0
net              o2ib1 hops 4294967295 gw                 10.0.2.213@o2ib5 up pri 0
net              o2ib2 hops 4294967295 gw                 10.0.2.214@o2ib5 up pri 0
net              o2ib2 hops 4294967295 gw                 10.0.2.215@o2ib5 up pri 0
net              o2ib3 hops 4294967295 gw                 10.0.2.216@o2ib5 up pri 0
net              o2ib3 hops 4294967295 gw                 10.0.2.217@o2ib5 up pri 0
net               tcp1 hops 4294967295 gw                 10.0.2.206@o2ib5 up pri 0
net               tcp1 hops 4294967295 gw                 10.0.2.207@o2ib5 up pri 0

Output of

lnetctl global show

From the servers:

---------------
oak-io[1-6]-s[1-2] (12)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

From the routers:

---------------
sh03-oak[01-02] (2)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 0
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

From the clients:

global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

Output of

lnetctl stats show

Output of

lnetctl net show -v 4
Comment by Serguei Smirnov [ 02/Dec/21 ]

Stephane,

This appears to be something that we can try addressing by increasing timeouts to allow things to complete.  We can set lnet_transaction_timeout closer to the lustre at_min by running the following on all nodes:

lnetctl set transaction_timeout 70

There's a chance that this can help deal with the delay due to high load and bursts being just a few seconds too long.

It would be best if this could be reproduced on some sort of smaller development cluster.

Thanks,

Serguei.

Comment by Stephane Thiell [ 03/Dec/21 ]

Hi Serguei,

Thanks! It's done, we have increased transaction_timeout from 50 to 70 on servers, routers and clients. We'll see how it goes and I will report back.

Comment by James A Simmons [ 24/Feb/22 ]

Did increasing the transaction_timeout help? ORNL is seeing similar problems and we increased our transactin_timeout and it didn't help.

Comment by Stephane Thiell [ 24/Feb/22 ]

It might have helped a bit. But 70 is not necessarily the best value I guess. We have it set to 49 at the moment. The thing seems to be to change transaction_timeout to something other than 50, otherwise, a very low default timeout value of 5 secs is used, which is easier to reach on a large system. Chris Horn explained this in his comment from 12/5/21 in LU-15234. Still, I have to say I'm still seeing occasional server_bulk_callback errors like reported above, but I believe less often.

Generated at Sat Feb 10 03:17:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.