[LU-13367] lnet_handle_local_failure messages every 10 min ? Created: 17/Mar/20 Updated: 15/Oct/20 Resolved: 15/Oct/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Michael Ethier (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric. |
||
| Attachments: |
|
| Epic/Theme: | lustre-2.13 |
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hello, Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages [root@cannonlnet07 ~]# nslookup 10.31.160.253 [root@cannonlnet07 ~]# nslookup 10.31.179.178 [root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
[root@cannonlnet07 ~]# ifconfig ib1 |
| Comments |
| Comment by Michael Ethier (Inactive) [ 17/Mar/20 ] |
|
Also these systems are using Mellanox OFED stack installed on all 8 of these is the following: [root@cannonlnet07 ~]# ofed_info cc_mgr: dapl: fabric-collector: gpio-mlxbf: hcoll: i2c-mlx: ibacm: ibsim: ibutils: ibutils2: infiniband-diags: iser: isert: kernel-mft: knem: libibumad: libibverbs: librdmacm: mlnx-en: mlnx-ethtool: mlnx-nvme: mlnx-ofa_kernel: mlnx-rdma-rxe: mlx-bootctl: mlx-l3cache: mlx-pmc: mlx-trio: mlxbf-livefish: mpi-selector: mpitests: mstflint: multiperf: mxm: nvme-snap: ofed-docs: openmpi: opensm: openvswitch: pka-mlxbf: qperf: rdma-core: sharp: sockperf: srp: srptools: tmfifo: ucx: Installed Packages: kmod-srp |
| Comment by Amir Shehata (Inactive) [ 17/Mar/20 ] |
|
These messages have been reduced in severity. Here is the patch:
LU-13071 lnet: reduce log severity for health events
Would be interesting to know what happens at the 10 minute interval, though. Can you share the output of "lnetctl stats show" Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?
|
| Comment by Michael Ethier (Inactive) [ 17/Mar/20 ] |
|
[root@cannonlnet07 ~]# lnetctl stats show Last 200 lines of lnet messages, I don't see anything of big interest, do you ? Sure I will enable debug and send the logs |
| Comment by Michael Ethier (Inactive) [ 17/Mar/20 ] |
|
Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time? what are the additional commands to capture the "logs" ? |
| Comment by Amir Shehata (Inactive) [ 17/Mar/20 ] |
|
Before we capture the logs, can we try the below recommendation and monitor the errors. I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code. I see that you have: transaction_timeout: 10 and retry_count: 3 That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is: LU-13145 lnet: use conservative health timeouts You can set it manually: lnetctl set transaction_timeout 50 lnetctl set retry_count 2 |
| Comment by Michael Ethier (Inactive) [ 17/Mar/20 ] |
|
I set those 2 parameters and I still see the recovery messages: [root@cannonlnet07 ~]# lnetctl global show |
| Comment by Amir Shehata (Inactive) [ 17/Mar/20 ] |
|
I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100. If you still see the problem, I would:
lctl set_param debug=+net
lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G (if you have the space)
# wait until the problem happens
lctl debug_daemon stop
lctl set_param debug=-net
lctl debug_file lustre.dk lustre.log
# attach or upload the lustre.log file depending on how big the file is
Here is the relevant lustre manual section for debug_daemon commands: 37.2.3.1. lctl debug_daemon Commands |
| Comment by Michael Ethier (Inactive) [ 18/Mar/20 ] |
|
Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below. Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max) |
| Comment by Michael Ethier (Inactive) [ 19/Mar/20 ] |
|
Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ? |
| Comment by Amir Shehata (Inactive) [ 19/Mar/20 ] |
|
So it looks like there are a couple of nodes which are causing all the problems: 00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed 00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message. The reason connection to these NIDs are failing is due to: (o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110 We're trying to resolve them and we're timing out. Are these nodes "real"? Are they left over configuration? |
| Comment by Michael Ethier (Inactive) [ 19/Mar/20 ] |
|
Hi, [root@holy2c18110 ~]# ifconfig ib0 [root@holy2c18110 ~]# ibstat |
| Comment by Amir Shehata (Inactive) [ 19/Mar/20 ] |
|
That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message. The address resolution error is only seen when net logging is turned on. |
| Comment by Michael Ethier (Inactive) [ 20/Mar/20 ] |
|
Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34. |
| Comment by Peter Jones [ 20/Mar/20 ] |
|
My recommendation would be to use a 2.12.x release. If there is a bug fix missing from the 2.12.x branch we can include that in 2.12.5. ashehata do you agree? |
| Comment by Amir Shehata (Inactive) [ 20/Mar/20 ] |
|
Sure. If there is no need for the features in 2.13, then the latest 2.12.x would suffice. |
| Comment by Michael Ethier (Inactive) [ 24/Mar/20 ] |
|
Thanks for the help. I will let others on my team know about the 2.12 vs 2.13 recommendations. |
| Comment by John Hammond [ 15/Oct/20 ] |
|
Both changes referenced above ( |