[LU-10831] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 11 seconds Created: 20/Mar/18 Updated: 19/May/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Our development filesystem lquake is currently running Lustre 2.10 for testing. We've begun noticing the following chatter in the console logs of jet (the host of lquake): 2018-03-16 09:10:18 [62861.230420] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 9 seconds 2018-03-16 09:10:18 [62861.242715] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 9 previous similar messages 2018-03-16 09:20:24 [63467.220203] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 3 seconds 2018-03-16 09:20:24 [63467.232510] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 9 previous similar messages 2018-03-16 09:30:30 [64073.210011] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 0 seconds 2018-03-16 09:30:30 [64073.222312] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 11 previous similar messages 2018-03-16 09:40:48 [64691.199577] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 9 seconds 2018-03-16 09:40:48 [64691.211874] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 10 previous similar messages 2018-03-16 09:50:54 [65297.189318] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.53@o2ib100: 2 seconds 2018-03-16 09:50:54 [65297.201613] LNet: 15206:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 11 previous similar messages These messages began showing up when lquake was mounted by Lustre 2.10 clients and those clients had the same chatter in their console logs. Currently, lquake is being mounted by Lustre 2.8 clients and jet is still seeing the chatter in its console logs. Is this a symptom of a larger problem and how does this impact performance of the filesystem? Clients appear to be able to use the filesystem without any problems. |
| Comments |
| Comment by Peter Jones [ 21/Mar/18 ] |
|
Sonia Can you please advise Thanks Peter |
| Comment by Sonia Sharma (Inactive) [ 22/Mar/18 ] |
|
Hi Giuseppe, Can you turn on neterror logging so we see the LNet errors if there any? lctl set_param debug=+net lctl set_param debug=+neterror This timeout in kiblnd_check_conns() though might be because of some hardware issue. Can you check the health of IB on the node 172.19.1.53@o2ib100? |
| Comment by Olaf Faaland [ 22/Mar/18 ] |
|
Joe, The version affected says 2.10.0. Is that correct? I thought we were testing a later 2.10 tag. Please provide specific version information for Sonia so she can look at the relevant code. Thanks |
| Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ] |
|
Opps, I selected the wrong version. Our version is based on 2.10.3 (lustre-2.10.3_3.chaos-1.ch6.x86_64). |
| Comment by Olaf Faaland [ 22/Mar/18 ] |
|
Code is at https://github.com/LLNL/lustre/commits/2.10.3-llnl essentially 2.10.3 release plus a few unrelated patches. |
| Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ] |
|
So I looked into this some more. One of our router nodes was down according to all of jet's nodes (I checked /proc/sys/lnet/routers). The particular router had a value for deadline assigned on all jet nodes. I got the router up and the console log messages stopped. I glanced at the code and when a deadline is passed, it appears that connections are closed with ETIMEDOUT. But if those connections end up closed, why do we keep hitting the deadline and seeing the message? I also decided to take down another router that jet connects with and according to /proc/sys/lnet/routers the connection is down, but there is no deadline. How are these deadlines assigned? |
| Comment by James A Simmons [ 22/Mar/18 ] |
|
Is this a eth0 --> IB type of setup or all IB?
|
| Comment by Olaf Faaland [ 23/Mar/18 ] |
|
No tcp. opal188 is a router, and has an mlx5 interface and an omnipath interface. jet1 (MDS, complainer) mlx5 <===> 172.19.1.53@o2ib100 mlx5 – opal188 (router) – 192.168.128.188@o2ib18 omnipath <===> compute nodes
|
| Comment by Sonia Sharma (Inactive) [ 26/Mar/18 ] |
|
Hi Giuseppe, This deadline is set when a tx is queued on a queue to be sent. It is set using the kiblnd tunable - kib_timeout in kiblnd_queue_tx_locked(). Here tx->tx_deadline = jiffies +
msecs_to_jiffies(*kiblnd_tunables.kib_timeout *
MSEC_PER_SEC);
This deadline remains active until a confirmation is received for the tx to be completed. The message we see in the logs comes from the below code in kiblnd_check_conns() list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
if (cfs_time_aftereq(jiffies, tx->tx_deadline)) {
CWARN("Timed out tx for %s: %lu seconds\n",
libcfs_nid2str(peer_ni->ibp_nid),
cfs_duration_sec(jiffies - tx->tx_deadline));
list_move(&tx->tx_list, &timedout_txs);
}
}
Here the txs on peer_ni->ibp_tx_queue are the txs waiting for a connection and they are scanned to check for the ones whose deadline has passed, That;s when we see the "timed out" messages in the logs
|
| Comment by Giuseppe Di Natale (Inactive) [ 27/Mar/18 ] |
|
Hi Sonia, Thanks for the reply and explanation. In the situation where a route has a deadline set and is considered down, we are seeing these messages flood our logs and they don't seem to stop. Is that the expected behavior? It makes me think that connections that should've been timed out, aren't in fact being timed out. Looking at kiblnd_check_conns(), it seems that it eventually attempts to close connections that timed out. Does that mean that these messages should stop once that connection closes? And where do the deadlines on the routes get assigned considering deadline is present in /proc/sys/lnet/routers? |
| Comment by Olaf Faaland [ 24/Apr/18 ] |
|
Sonia, It seems like the queued tx's should have all timed out eventually, and new tx's should all have been queued for the other connection to the router that was still up, and so - eventually - all of these timeout messages would stop. Is that correct? How can we determine how long that time period should be? Minutes? Hours? thanks
|
| Comment by Sonia Sharma (Inactive) [ 25/Apr/18 ] |
|
Hi Olaf/Giuseppe First my apologies for the delay in replying to this. To answer Giuseppe's question - In lnet/lnet/router.c, we have this 1079 if (rtr->lpni_ping_deadline == 0) {
1080 rtr->lpni_ping_deadline =
1081 cfs_time_shift(router_ping_timeout);
1082 }
Now the question is why and when the router is being marked as down. Can you please collect net error logs like I asked in my earlier comment? Also, I am thinking this router being marked down might be related to
For the queued TXs being timed out in kiblnd_check_conns(), it looks like the timed out TXs are queued on this list - timedout_txs. And this is later checked for here - 3236 if (!list_empty(&timedout_txs))
3237 kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT);
Here the timedout TXs are eventually unmapped and marked done in kiblnd_tx_done() and lnet_finalize() is called ultimately signaling completion. The deadline for these TXs to timeout is set in kiblnd_queue_tx_locked() using the module param - kib_timeout. So it will be after this time-period that a queued TXs waiting for a connection will timeout tx->tx_deadline = jiffies +
msecs_to_jiffies(*kiblnd_tunables.kib_timeout *
MSEC_PER_SEC);
For the particular connection timing out and being closed, this is checked below and then later is closed- 3208 timedout = kiblnd_conn_timed_out_locked(conn); 3209 if (!sendnoop && !timedout) { 3210 spin_unlock(&conn->ibc_lock); 3211 continue; 3212 } 3213 3214 if (timedout) { 3215 CERROR("Timed out RDMA with %s (%lu): " 3216 "c: %u, oc: %u, rc: %u\n", 3217 libcfs_nid2str(peer_ni->ibp_nid), 3218 cfs_duration_sec(cfs_time_current() - 3219 peer_ni->ibp_last_alive), Again, the neterror logs can give more information regarding the network related errors that are resulting in the issue. |