[LU-10831] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 11 seconds Created: 20/Mar/18  Updated: 19/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Giuseppe Di Natale (Inactive) Assignee: Sonia Sharma (Inactive)
Resolution: Unresolved Votes: 0
Labels: llnl

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Our development filesystem lquake is currently running Lustre 2.10 for testing. We've begun noticing the following chatter in the console logs of jet (the host of lquake):

2018-03-16 09:10:18 [62861.230420] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 9 seconds
2018-03-16 09:10:18 [62861.242715] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 9 previous similar messages
2018-03-16 09:20:24 [63467.220203] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 3 seconds
2018-03-16 09:20:24 [63467.232510] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 9 previous similar messages
2018-03-16 09:30:30 [64073.210011] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 0 seconds
2018-03-16 09:30:30 [64073.222312] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 11 previous similar messages
2018-03-16 09:40:48 [64691.199577] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 9 seconds
2018-03-16 09:40:48 [64691.211874] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 10 previous similar messages
2018-03-16 09:50:54 [65297.189318] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 2 seconds
2018-03-16 09:50:54 [65297.201613] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 11 previous similar messages

These messages began showing up when lquake was mounted by Lustre 2.10 clients and those clients had the same chatter in their console logs.

Currently, lquake is being mounted by Lustre 2.8 clients and jet is still seeing the chatter in its console logs.

Is this a symptom of a larger problem and how does this impact performance of the filesystem? Clients appear to be able to use the filesystem without any problems.



 Comments   
Comment by Peter Jones [ 21/Mar/18 ]

Sonia

Can you please advise

Thanks

Peter

Comment by Sonia Sharma (Inactive) [ 22/Mar/18 ]

Hi Giuseppe,

Can you turn on neterror logging so we see the LNet errors if there any?

lctl set_param debug=+net
lctl set_param debug=+neterror

This timeout in kiblnd_check_conns() though might be because of some hardware issue. Can you check the health of IB on the node  172.19.1.53@o2ib100?

Comment by Olaf Faaland [ 22/Mar/18 ]

Joe,

The version affected says 2.10.0.  Is that correct?  I thought we were testing a later 2.10 tag.

Please provide specific version information for Sonia so she can look at the relevant code.

Thanks

Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ]

Opps, I selected the wrong version. Our version is based on 2.10.3 (lustre-2.10.3_3.chaos-1.ch6.x86_64).

Comment by Olaf Faaland [ 22/Mar/18 ]

Code is at https://github.com/LLNL/lustre/commits/2.10.3-llnl essentially 2.10.3 release plus a few unrelated patches. 

Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ]

So I looked into this some more. One of our router nodes was down according to all of jet's nodes (I checked /proc/sys/lnet/routers). The particular router had a value for deadline assigned on all jet nodes. I got the router up and the console log messages stopped. I glanced at the code and when a deadline is passed, it appears that connections are closed with ETIMEDOUT. But if those connections end up closed, why do we keep hitting the deadline and seeing the message?

I also decided to take down another router that jet connects with and according to /proc/sys/lnet/routers the connection is down, but there is no deadline. How are these deadlines assigned?

Comment by James A Simmons [ 22/Mar/18 ]

Is this a eth0 --> IB type of setup or all IB?

 

Comment by Olaf Faaland [ 23/Mar/18 ]

No tcp.  opal188 is a router, and has an mlx5 interface and an omnipath interface.

jet1 (MDS, complainer) mlx5 <===> 172.19.1.53@o2ib100 mlx5 – opal188 (router) – 192.168.128.188@o2ib18 omnipath <===> compute nodes

 

Comment by Sonia Sharma (Inactive) [ 26/Mar/18 ]

Hi Giuseppe,

This deadline is set when a tx is queued on a queue to be sent. It is set using the kiblnd tunable - kib_timeout in kiblnd_queue_tx_locked(). Here 

tx->tx_deadline = jiffies +
                  msecs_to_jiffies(*kiblnd_tunables.kib_timeout *
                                   MSEC_PER_SEC);

This deadline remains active until a confirmation is received for the tx to be completed. 

The message we see in the logs comes from the below code in kiblnd_check_conns() 

list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
        if (cfs_time_aftereq(jiffies, tx->tx_deadline)) {
                 CWARN("Timed out tx for %s: %lu seconds\n",
                       libcfs_nid2str(peer_ni->ibp_nid),
                       cfs_duration_sec(jiffies - tx->tx_deadline));
                 list_move(&tx->tx_list, &timedout_txs);
        }
}

Here the txs on peer_ni->ibp_tx_queue are the txs waiting for a connection and they are scanned to check for the ones whose deadline has passed, That;s when we see the "timed out" messages in the logs

 

Comment by Giuseppe Di Natale (Inactive) [ 27/Mar/18 ]

Hi Sonia,

Thanks for the reply and explanation. In the situation where a route has a deadline set and is considered down, we are seeing these messages flood our logs and they don't seem to stop. Is that the expected behavior? It makes me think that connections that should've been timed out, aren't in fact being timed out.

Looking at kiblnd_check_conns(), it seems that it eventually attempts to close connections that timed out. Does that mean that these messages should stop once that connection closes?

And where do the deadlines on the routes get assigned considering deadline is present in /proc/sys/lnet/routers?

Comment by Olaf Faaland [ 24/Apr/18 ]

Sonia,

It seems like the queued tx's should have all timed out eventually, and new tx's should all have been queued for the other connection to the router that was still up, and so - eventually - all of these timeout messages would stop.

Is that correct?  How can we determine how long that time period should be?  Minutes?  Hours?

thanks

 

Comment by Sonia Sharma (Inactive) [ 25/Apr/18 ]

Hi Olaf/Giuseppe

First my apologies for the delay in replying to this.

To answer Giuseppe's question -
what we see as deadline on the routers in /proc/sys/lnet/routers/  is the value set for the module param - lnet router_ping_timeout and its default value is 50. This parameter specifies a timeout for the router checker when it checks live or dead routers. If the router checker does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.

 In lnet/lnet/router.c, we have this

1079                 if (rtr->lpni_ping_deadline == 0) {
1080                         rtr->lpni_ping_deadline =
1081                                 cfs_time_shift(router_ping_timeout);
1082                 }

Now the question is why and when the router is being marked as down. Can you please collect net error logs like I asked in my earlier comment?

Also, I am thinking this router being marked down might be related to LU-10707. But you mentioned that the other router is not having this issue. right?

 

For the queued TXs being timed out in kiblnd_check_conns(), it looks like the timed out TXs are  queued on this list - timedout_txs. And this is later checked for here -

3236         if (!list_empty(&timedout_txs))
3237                 kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT);

Here the timedout TXs are eventually unmapped and marked done in kiblnd_tx_done() and lnet_finalize() is called ultimately signaling completion.

The deadline for these TXs to timeout is set in kiblnd_queue_tx_locked() using the module param - kib_timeout. So it will be after this time-period that a queued TXs waiting for a connection will timeout

tx->tx_deadline = jiffies +
                  msecs_to_jiffies(*kiblnd_tunables.kib_timeout *
                                   MSEC_PER_SEC);

For the particular connection timing out and being closed, this is checked below and then later is closed-

3208                         timedout = kiblnd_conn_timed_out_locked(conn);
3209                         if (!sendnoop && !timedout) {
3210                                 spin_unlock(&conn->ibc_lock);
3211                                 continue;
3212                         }
3213
3214                         if (timedout) {
3215                                 CERROR("Timed out RDMA with %s (%lu): "
3216                                        "c: %u, oc: %u, rc: %u\n",
3217                                        libcfs_nid2str(peer_ni->ibp_nid),
3218                                        cfs_duration_sec(cfs_time_current() -
3219                                                         peer_ni->ibp_last_alive),
 

 
Are  “Timed out RDMA” messages also seen in the logs?

Again, the neterror logs can give more information regarding the network related errors that are resulting in the issue.

Generated at Sat Feb 10 02:38:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.