[LU-15234] LNet high peer reference counts inconsistent with queue Created: 16/Nov/21 Updated: 17/Dec/22 Resolved: 25/Oct/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.12.7_2.llnl-2.ch6.x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers". The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update. [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
172.19.2.24@o2ib100 46957 up 5 8 -46945 -46945 8 -13 0
172.19.2.24@o2ib100 47380 up 1 8 -47368 -47368 8 -23 0
172.19.2.24@o2ib100 48449 up 15 8 -48437 -48437 8 -17 0
172.19.2.24@o2ib100 49999 up 3 8 -49987 -49987 8 -7 0
The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation). |
| Comments |
| Comment by Olaf Faaland [ 16/Nov/21 ] |
|
My local issue is TOSS5305 |
| Comment by Peter Jones [ 16/Nov/21 ] |
|
Serguei Could you please advise? Thanks Peter |
| Comment by Serguei Smirnov [ 16/Nov/21 ] |
|
Hi Olaf, From the graph it looks like the refcount is growing slowly and constantly. If it grows fast enough such that it is likely to increment within a reasonably short window (short enough so the debug log is not overwritten), could you please capture the net debug log for the window:
lctl set_param debug=+net
lctl dk clear
---- wait for the incrememnt ----
lctl dk > log.txt
lctl set_param debug=-net
If the increment window is too long so it is not practical to capture debug log, please provide the syslog instead. Before and after the debug window, please capture: lnetctl stats show lnetctl peer show -v 4 --nid <peer nid that leaks refcount> I agree that there may be a problem with not decrementing the refcount on some error path. Hopefully the debug data can help narrow down which path it is. Thanks, Serguei. |
| Comment by Olaf Faaland [ 16/Nov/21 ] |
|
Hi Serguei, I've attached debug logs and the peer and stats before-and-after output. |
| Comment by Olaf Faaland [ 18/Nov/21 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 18/Nov/21 ] |
|
Hi Olaf, I haven't had a chance yet to properly process what you have provided. I should be able to give you an update tomorrow. Thanks, Serguei. |
| Comment by Serguei Smirnov [ 20/Nov/21 ] |
|
refcount appears to be going up at the same rate as rtr_credits are going down (64 between the two "peer show" snapshots. Peer status changed to "down" as we likely didn't receive a router check ping. Nothing is received from 172.19.2.24@o2ib100 on the lnet level, but it appears that the peer is receiving at least some messages and is returning credits on the lnd level. Could you please list dead_router_check_interval live_router_check_interval router_ping_timeout peer_timeout from both nodes? Is 172.19.2.24@o2ib100 another router? Thanks, Serguei. |
| Comment by Olaf Faaland [ 20/Nov/21 ] |
|
Hi Serguei, Yes, 172.19.2.24 is orelic4, one of the IB-to-TCP "RELIC" routers that is still at Lustre 2.10. ruby1016: /sys/module/lnet/parameters/dead_router_check_interval:60 /sys/module/lnet/parameters/live_router_check_interval:60 /sys/module/lnet/parameters/router_ping_timeout:50 /sys/module/ko2iblnd/parameters/peer_timeout:180 orelic4: /sys/module/lnet/parameters/dead_router_check_interval:60 /sys/module/lnet/parameters/live_router_check_interval:60 /sys/module/lnet/parameters/router_ping_timeout:50 /sys/module/ko2iblnd/parameters/peer_timeout:180 thanks, |
| Comment by Serguei Smirnov [ 22/Nov/21 ] |
|
It could be related to The 2.10 version probably doesn't have the fix for it. Could you please provide, from both sides lctl --version lnetctl net show -v 4 to check what the credit-related settings are? Thanks, Serguei. |
| Comment by Olaf Faaland [ 22/Nov/21 ] |
|
Hi Serguei, attached as lctl.version.orelic4.1637616867.txt along with module params for orelic4 since 2.10 lnetctl doesn't report as much with "net show" thanks |
| Comment by Chris Horn [ 24/Nov/21 ] |
|
FYI, I think this may be caused by a bug in discovery. I noticed this same symptom on a router while I was doing some internal testing. The router had high reference counts for three peers (all Lustre servers): nid00053:~ # awk '{if ($2 > 1){print $0}}' /sys/kernel/debug/lnet/peers
nid refs state last max rtr min tx min queue
10.13.100.57@o2ib11 2488 up -1 16 -2359 -2359 16 -18 0
10.13.100.53@o2ib11 1917 up -1 16 -1788 -1788 16 -5 0
10.13.100.55@o2ib11 2582 up -1 16 -2453 -2453 16 -41 0
nid00053:~ #
Here I'm just locating the lnet_peer.lp_rtrq for each of these peers: crash_x86_64> p &the_lnet.ln_peer_tables[0].pt_peer_list $7 = (struct list_head *) 0xffff8803ea46af90 crash_x86_64> p &the_lnet.ln_peer_tables[1].pt_peer_list $8 = (struct list_head *) 0xffff8803e9298450 crash_x86_64> p &the_lnet.ln_peer_tables[2].pt_peer_list $9 = (struct list_head *) 0xffff8803e9298750 crash_x86_64> list -H 0xffff8803ea46af90 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319 ffff8803e9297400 lp_primary_nid = 1407422296843321 <<< 10.13.100.57@o2ib11 crash_x86_64> list -H 0xffff8803e9298450 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319 crash_x86_64> list -H 0xffff8803e9298750 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319 ffff8803ea3e2e00 lp_primary_nid = 1407422296843319 <<< 10.13.100.55@o2ib11 ffff8803864ab600 lp_primary_nid = 1407422296843317 <<< 10.13.100.53@o2ib11 crash_x86_64> struct -o lnet_peer ffff8803e9297400 | grep lp_rtrq [ffff8803e9297470] struct list_head lp_rtrq; crash_x86_64> struct -o lnet_peer ffff8803ea3e2e00 | grep lp_rtrq [ffff8803ea3e2e70] struct list_head lp_rtrq; crash_x86_64> struct -o lnet_peer ffff8803864ab600 | grep lp_rtrq [ffff8803864ab670] struct list_head lp_rtrq; crash_x86_64> list -H ffff8803e9297470 | wc -l 2389 crash_x86_64> list -H ffff8803ea3e2e70 | wc -l 2481 crash_x86_64> list -H ffff8803864ab670 | wc -l 1815 crash_x86_64> While trying to track down where the bottleneck was, I noticed that there are two @gni peers that seem to be stuck in discovery: crash_x86_64> struct -o lnet the_lnet | grep ln_dc [ffffffffa0307078] lnet_handler_t ln_dc_handler; [ffffffffa0307080] struct list_head ln_dc_request; [ffffffffa0307090] struct list_head ln_dc_working; [ffffffffa03070a0] struct list_head ln_dc_expired; [ffffffffa03070b0] wait_queue_head_t ln_dc_waitq; [ffffffffa03070c8] int ln_dc_state; crash_x86_64> list -H ffffffffa0307090 -o 224 ffff8803e9297c00 ffff8803e91cc200 crash_x86_64> lnet_peer.lp_primary_nid ffff8803e9297c00 lp_primary_nid = 3659174697238582 crash_x86_64> lnet_peer.lp_primary_nid ffff8803e91cc200 lp_primary_nid = 3659174697238578 crash_x86_64> epython nid2str.py 3659174697238582 54@gni crash_x86_64> epython nid2str.py 3659174697238578 50@gni crash_x86_64> These peers were last processed by the discovery thread hours ago: crash_x86_64> lnet_peer.lp_last_queued ffff8803e9297c00 lp_last_queued = 1637750512 crash_x86_64> lnet_peer.lp_last_queued ffff8803e91cc200 lp_last_queued = 1637750211 crash_x86_64> pollux-p4:~ # date -d @1637750512 Wed Nov 24 04:41:52 CST 2021 pollux-p4:~ # date -d @1637750211 Wed Nov 24 04:36:51 CST 2021 pollux-p4:~ # The router was dumped a little under 10 hours later: DATE: Wed Nov 24 14:28:12 2021 The stuck peers have a state that is inconsistent with being on the ln_dc_working queue: crash_x86_64> lnet_peer.lp_state ffff8803e9297c00 lp_state = 338 crash_x86_64> lnet_peer.lp_state ffff8803e91cc200 lp_state = 338 crash_x86_64> *hornc@cflosbld09 fs4 $ lpst2str.sh 338 LNET_PEER_NO_DISCOVERY LNET_PEER_DISCOVERED LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE *hornc@cflosbld09 fs4 $ These @gni peers have numerous messages on the lnet_peer.lp_dc_pendq: crash_x86_64> struct -o lnet_peer ffff8803e9297c00 | grep lp_dc_pendq [ffff8803e9297c20] struct list_head lp_dc_pendq; crash_x86_64> struct -o lnet_peer ffff8803e91cc200 | grep lp_dc_pendq [ffff8803e91cc220] struct list_head lp_dc_pendq; crash_x86_64> list -H ffff8803e9297c20 | wc -l 214 crash_x86_64> list -H ffff8803e91cc220 | wc -l 170 crash_x86_64> It is likely the case that those messages are what's consuming and not letting go of the lpni_rtrcredits for the three @o2ib11 peers that show the high reference counts. I haven't yet figured out why the @gni peers are stuck in discovery. |
| Comment by Chris Horn [ 24/Nov/21 ] |
|
Olaf, does your 2.12.7 router have the fix from |
| Comment by Chris Horn [ 27/Nov/21 ] |
|
I discovered a race between the discovery thread and other threads that are queueing a peer for discovery. When the discovery thread finishes processing a peer it calls lnet_peer_discovered() which clears the LNET_PEER_DISCOVERING bit from the peer state: lp->lp_state |= LNET_PEER_DISCOVERED;
lp->lp_state &= ~(LNET_PEER_DISCOVERING |
LNET_PEER_REDISCOVER);
At this point, the peer is on the lnet.ln_dc_working queue. When lnet_peer_discovered() returns, the lnet_peer.lp_lock spinlock is dropped, and the discovery thread acquires the lnet_net_lock/EX. This is where the race window exists: spin_unlock(&lp->lp_lock);
<<< Race window >>>
lnet_net_lock(LNET_LOCK_EX);
If another threads queues this peer for discovery during this window, then the LNET_PEER_DISCOVERING bit is added back to the peer state, but since the peer is already on the lnet.ln_dc_working queue, it does not get added to the lnet.ln_dc_request queue. static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
...
spin_lock(&lp->lp_lock);
if (!(lp->lp_state & LNET_PEER_DISCOVERING))
lp->lp_state |= LNET_PEER_DISCOVERING;
spin_unlock(&lp->lp_lock);
if (list_empty(&lp->lp_dc_list)) { <<< Peer is on ln_dc_working
lnet_peer_addref_locked(lp);
list_add_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
...
When the discovery thread acquires the lnet_net_lock/EX, it sees that the LNET_PEER_DISCOVERING bit has not been cleared, so it does not call lnet_peer_discovery_complete() which is responsible for sending messages on the peer's discovery pending queue. spin_unlock(&lp->lp_lock);
<<< Race window >>>
lnet_net_lock(LNET_LOCK_EX);
...
if (!(lp->lp_state & LNET_PEER_DISCOVERING))
lnet_peer_discovery_complete(lp);
At this point, the peer is stuck on the lnet.ln_dc_working queue, and messages may continue to accumulate on the peer's lnet_peer.lp_dc_pendq. |
| Comment by Olaf Faaland [ 29/Nov/21 ] |
|
Hi Chris,
No, it does not. |
| Comment by Gerrit Updater [ 29/Nov/21 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45670 |
| Comment by Chris Horn [ 29/Nov/21 ] |
|
Thanks, Olaf.
From what I can tell, the discovery defect that I found has been there a long time, but maybe I am missing something that has caused this issue to start manifesting. Can you provide any additional detail about when you started seeing this issue? Was everything working fine until this update that you referenced in the description? What did the update entail? |
| Comment by Olaf Faaland [ 29/Nov/21 ] |
|
Hi Chris, I first documented the high refcounts 2021-09-17. I recall seeing it before then, but I'm not sure how long before. Our systems were updated to our 2.12.7_2.llnl tag (the tag we're still on) about 2021-08-10. There was a new issue subsequent to that August update - we started seeing some router nodes reporting "Timed out RDMA" with some other routers for no reason we could find, like this: [root@orelic1:~]# grep 'LNetError.*Timed out RDMA' /var/log//conman/console.orelic4 | grep ^2021-08 | nidgrep | sort -V | uniq -c
1 19.1.104@o2ib10
1 172.19.1.101@o2ib100
896 172.19.1.103@o2ib100
674 172.19.1.104@o2ib100
1 172.19.2.7@o2ib100
1 172.19.2.8@o2ib100
1 172.19.2.40@o2ib100
1 172.19.2.43@o2ib100
1 172.19.2.44@o2ib100
1 172.19.2.46@o2ib100
where 172.19.1.[101-104] all have the same hardware, same LNet versions, and same role (IE were on the path to the same endpoints). The LNet related patches that were new to the _2.llnl tag, which was what we updated to in August, were:
|
| Comment by Gerrit Updater [ 30/Nov/21 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45681 |
| Comment by Chris Horn [ 30/Nov/21 ] |
|
Olaf, I pushed a backport of this patch to b2_12, just in case you want to try it and see if it resolves your issue. |
| Comment by Olaf Faaland [ 30/Nov/21 ] |
|
Thank you, Chris. I will try it. |
| Comment by Olaf Faaland [ 30/Nov/21 ] |
|
Serguei, before I do try Chris' patch, can you or Amir review it (at least the patch against master)? Thank you. |
| Comment by Olaf Faaland [ 09/Dec/21 ] |
|
Serguei and Chris, What is different, is that we now see queue (reported by lctl get_param peers) climb along with refs. So, I think we have progress, but some other issue. |
| Comment by Chris Horn [ 09/Dec/21 ] |
|
Olaf, would you be able to collect a crash dump, vmlinux and lustre kos from a router experiencing the high refcount issue? preferably from a router running the patch? |
| Comment by Olaf Faaland [ 09/Dec/21 ] |
|
Hi Chris, I need to check whether I can make a crash dump available. |
| Comment by Olaf Faaland [ 09/Dec/21 ] |
|
Chris, I can make a crash dump available to you. Can you let me know a way I can send it to you privately? Serguei, if it would help you to have it also, please let me know a way to send it. thanks, |
| Comment by Chris Horn [ 10/Dec/21 ] |
Yes, I need to check with our new IT org on the best way to do it. I'll let you know. |
| Comment by Chris Horn [ 10/Dec/21 ] |
|
Olaf, you can upload to this ftp: sftp -o Port=2222 lu15234@ftp.ext.hpe.com pass: e$VS3mw_ > put <filename> |
| Comment by Gian-Carlo Defazio [ 10/Dec/21 ] |
|
Olaf is not officially working today. I'll send you the tarball he made yesterday, which includes the kernel dump. Sent. Hopefully it showed up on your side. |
| Comment by Chris Horn [ 13/Dec/21 ] |
|
defazio I got it. Thank you |
| Comment by Chris Horn [ 13/Dec/21 ] |
|
defazio ofaaland can you provide the lustre-debuginfo package that matches kmod-lustre-2.12.8_3.llnl-1.ch6.x86_64.rpm ? |
| Comment by Gian-Carlo Defazio [ 13/Dec/21 ] |
|
I've sent lustre-debuginfo-2.12.8_3.llnl-1.ch6.x86_64.rpm |
| Comment by Chris Horn [ 13/Dec/21 ] |
|
defazio ofaaland can you provide the output of 'lnetctl global show' from one peer on each cluster? What I mean is, I believe you have something like: (Cluster A) <-> (Router Cluster B) <-> (Router Cluster C) <-> (Cluster D) and I'm looking for 'lnetctl global show' output from one peer in each Cluster A, B, C, and D. Also, could you let me know if you have any tuning changes in place on each cluster? By this I mean, if you are explicitly setting any lnet/ko2iblnd/ksocklnd kernel module parameters, or if you are doing any tuning by executing 'lnetctl set' commands, etc. Thanks. |
| Comment by Chris Horn [ 13/Dec/21 ] |
|
The fix I authored is only applicable to LNet peers that undergo discovery. In 2.12 LTS, router peers do not undergo discovery, so that explains why the fix didn't help with your issue. In the dump, we can see this peer with a high refcount (relative to other peers; there are a handful that have refcount between 10-20) and negative lpni_rtrcredits: crash_x86_64> epython lnet.py -p --nid 172.16.70.63@tcp -d
lnet_peer: ffff9c9bc20cb980
lp_primary_nid: 172.16.70.63@tcp
lp_state:
lp_dc_pendq: ffff9c9bc20cb9a0(0)
lp_dc_list: ffff9c9bc20cba18(0)
lp_peer_nets: ffff9c9bc20cb990
- lnet_peer_net: tcp(ffff9cbba3ff4cc0)
- lpn_peer_nis: ffff9cbba3ff4cd0
- lnet_peer_ni: ffff9c9bc3b18600
- lpni_nid: 172.16.70.63@tcp
- lpni_refcount: !!!!!!245!!!!!!
- lpni_healthv: 1000
- lpni_txcredits: 3
- lpni_mintxcredits: 0
- lpni_rtrcredits: -226
- lpni_minrtrcredits: -226
- lpni_rtrq: ffff9c9bc3b18650(226)
- lpni_last_alive: 847
- lpni_txq: ffff9c9bc3b18640(0)
In the debug log extracted from the dump, we can see timeout errors for this peer: 00000800:00000100:21.0:1639088392.413370:0:16075:0:(socklnd_cb.c:2390:ksocknal_find_timed_out_conn()) Timeout receiving from 12345-172.16.70.63@tcp (172.16.70.63:988), state 4 wanted 0 left 0 *hornc@cflosbld09 dec9-lu-15234-llnl $ grep 172.16.70.63@tcp dk.log.fmt | grep -c ksocknal_find_timed_out_conn 116 *hornc@cflosbld09 dec9-lu-15234-llnl $ Your LND timeout is currently set to the b2_12 default, which is 5 seconds: crash_x86_64> lnet_lnd_timeout lnet_lnd_timeout = $497 = 5 crash_x86_64> I think we need to increase this timeout value. However, there are some quirks with how this value is set in b2_12. I'm guessing that you have lnet_health_sensitivity=0. That setting results in lnet_transaction_timeout being set to 50, and lnet_retry_count being set to 0, but it doesn't update the lnet_lnd_timeout correctly (see the patches for The easiest solution is to just explicitly set the lnet_transaction_timeout to some value not equal to 50. This needs to be done after the lnet_health_sensitivity is set to 0. options lnet lnet_health_sensitivity=0 options lnet lnet_transaction_timeout=49 or at runtime (doesn't persist across reboots) lnetctl set transaction_timeout 49 You might experiment with this value to see what's the smallest value that resolves the issue, but I wouldn't go any lower than 10 seconds, and I would avoid anything > 50. If you're still seeing timeouts/network errors with it set to 49 then you may have other issues with your network that warrant investigation (bad cables, etc.). The above assumes you are tuning all of our clusters the same way. If that isn't the case, then if you provide the information I requested in my previous comment then I can provide specific tuning guidance for each cluster. |
| Comment by Chris Horn [ 13/Dec/21 ] |
|
ssmirnov whamcloud might want to consider backporting |
| Comment by Gian-Carlo Defazio [ 14/Dec/21 ] |
|
Uploading file params_20211213.tar.gz |
| Comment by Chris Horn [ 15/Dec/21 ] |
|
Thanks defazio, but I'm a little confused by the information that you've provided. The data in the tarball suggests that you have the following: orelic2 - Lustre 2.10 - local networks tcp0, o2ib100 w/routes to various other o2ib networks Is that right? The crash dump you provided has "NODENAME: orelic4", but this node was running 2.12. Can you clarify? Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball: crash_x86_64> dmesg | grep kiblnd_passive_connect [ 145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.54@o2ib100, queue depth too large: 32 (<=8 wanted) [ 185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.55@o2ib100, queue depth too large: 32 (<=8 wanted) [ 185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages crash_x86_64> hornc@C02V50B9HTDG params_20211213 % grep -a 'ko2iblnd peer_credits' * hornc@C02V50B9HTDG params_20211213 % grep -a peer_credits * ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits:8 ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0 ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8 ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 0 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 0 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 0 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.orelic2.1639444262: peer_credits: 8 ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits:8 ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0 ksocklnd.parameters.orelic2.1639444379:/sys/module/ksocklnd/parameters/peer_credits:8 ksocklnd.parameters.zrelic2.1639443594:/sys/module/ksocklnd/parameters/peer_credits:8 lnetctl-net-show.orelic2.1639444148: peer_credits: 0 lnetctl-net-show.orelic2.1639444148: peer_credits: 8 lnetctl-net-show.orelic2.1639444148: peer_credits: 8 lnetctl-net-show.zrelic2.1639443299: peer_credits: 0 lnetctl-net-show.zrelic2.1639443299: peer_credits: 8 lnetctl-net-show.zrelic2.1639443299: peer_credits: 8 lnetctl-net-show.zrelic2.1639443299~: peer_credits: 0 lnetctl-net-show.zrelic2.1639443299~: peer_credits: 8 lnetctl-net-show.zrelic2.1639443299~: peer_credits: 8 hornc@C02V50B9HTDG params_20211213 % What am I missing? |
| Comment by Chris Horn [ 15/Dec/21 ] |
|
In any case, getting back to the timeout issue. Lustre 2.10 has default LND timeouts of 50 seconds for both ksocklnd and ko2iblnd. You can see that in the parameters file for orelic2: ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/timeout:50 ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/sock_timeout:50 So we probably want to get the 2.12 nodes to match that. I figured out that we can get exactly 50 with this set of parameters (note, the order is important): options lnet lnet_retry_count=0 # Sets lnet_lnd_timeout = lnet_transaction_timeout (lnet_transaction_timeout should have default value of 50) options lnet lnet_health_sensitivity=0 I would suggest putting this configuration in place on all your Lustre 2.12 nodes. If you are still seeing network timeout issues then I would suggest doing some investigation into the network to see if it is healthy. |
| Comment by Olaf Faaland [ 16/Dec/21 ] |
|
Hi Chris, Regarding the parameters, I'm attaching them for an orelic node running Lustre 2.12, which is when we see the symptoms described in this issue. The orelic/zrelic nodes currently run lustre 2.10 because of the issues we've seen, which is why you got the parameters for that configuration. orelic and zrelic are configured the same, except for the routes. orelic4-lustre212-20211216.tgz After gathering these parameters, we put the configuration you suggest in place on orelic4 while it was running lustre 2.12 and verified that lnet_lnd_timeout was set to 50. The node ran first with lustre 2.12 and our stock settings (as in the attached tarball) and refs built up to about 590. We then set lnet_health_sensitivity=100, set lnet_retry__count=2, then set lnet_retry_count=0, then set lnet_health_sensitivity=0. After this we observed refs continued to climb, but much more slowly - the rate was probably 1/4 or less of the rate of climb before changing lnet_lnd_timeout. We'll make that change more widely and see how it goes. |
| Comment by Chris Horn [ 20/Dec/21 ] |
|
Hi Olaf, How's it going with the param changes? As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network. Also, can you clarify where the ko2iblnd peer_credits=32 is coming from that I asked about earlier? Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball: crash_x86_64> dmesg | grep kiblnd_passive_connect [ 145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.54@o2ib100, queue depth too large: 32 (<=8 wanted) [ 185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.55@o2ib100, queue depth too large: 32 (<=8 wanted) [ 185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages crash_x86_64> I only ask because ideally you would have peer_credits the same on all o2iblnd peers, although it is not fatal to have them different as long as the peers are able to negotiate to a lower value. |
| Comment by Olaf Faaland [ 22/Dec/21 ] |
|
Hi Chris, I ran out of time before I went on vacation, so I won't know for a couple weeks. I'll post here as soon as I've made the change. Yes, those two peers are running Lustre 2.14, and may have ended up with different settings accidentally. I'll have to check. Thanks! |
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45670/ |
| Comment by Peter Jones [ 23/Dec/21 ] |
|
Landed for 2.15 |
| Comment by Chris Horn [ 23/Dec/21 ] |
|
pjones I'm going to re-open this ticket until Olaf can verify that the tuning recommendations have alleviated his issue. It was probably a mistake for me to push that code change against this ticket as it turned out not to be the root cause of Olaf's problem. I'm sorry about that. |
| Comment by Chris Horn [ 23/Dec/21 ] |
|
Alternatively, ofaaland if you're okay with it we could open a new ticket to continue investigation into your issue. |
| Comment by Olaf Faaland [ 05/Jan/22 ] |
|
Chris, I'm fine with either continuing in this ticket or opening a new one. I'm updating tunings over the next couple of days. Thanks! |
| Comment by Olaf Faaland [ 06/Jan/22 ] |
|
Hi Chris and Serguei, > How's it going with the param changes? I changed the timeout as prescribed above, on all the systems (clients, routers, servers). I then rebooted orelic4 into an image with lustre 2.12. The changed timeout did not change the symptoms. I still see the climbing "refs" on orelic4 node when I boot it into Lustre 2.12, sadly. > As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network. I don't think problems with the network (ie switches, cables, NICs, drivers) can explain this, because we don't see these issues when orelic4 (and other nodes in the orelic cluster) are running Lustre 2.10 - only when they are running 2.12. Do you have other ideas? I've gathered information from the node (dmesg, lctl dk, module params, etc.) and also gathered a crash dump. |
| Comment by Serguei Smirnov [ 13/Jan/22 ] |
|
Olaf, I was wondering whether before I provide an instrumented build for debugging, in the meantime you could try the test making sure that o2iblnd parameters are consistent between orelic4 and the nodes it is talking to directly, specifically peer_credits_hiw=4 and concurrent_sends=8 Thanks, Serguei. |
| Comment by Olaf Faaland [ 14/Jan/22 ] |
Yes, I'll check whether those parameters are consistent. |
| Comment by Olaf Faaland [ 18/Jan/22 ] |
|
Hi Serguei, orelic4 and all of the nodes it talks to directly over o2ib have peer_credits_hiw=0 and concurrent_sends=0 (we don't set those values).
|
| Comment by Serguei Smirnov [ 28/Jan/22 ] |
|
Hi Olaf, I prepared a patch that can be applied on top of 2.12.7-llnl This patch provides more detailed info on lpni refcounts. There are individual counts for each instance the lpni refcount is incremented and decremented in the code. After applying the patch, once the peer with excessive refcount is identified, you can use lnetctl peer show -v 5 --nid <nid_of_peer_with_high_refcount> command to display detailed counts which will be dumped in the end. This should help narrow down the issue a bit. Thanks, Serguei. |
| Comment by Olaf Faaland [ 28/Jan/22 ] |
|
Hi Serguei, Thank you, that looks good. We're at 2.12.8 these days, but the patch applies cleanly. Is there any reason not to push it to gerrit with "fortestonly"? |
| Comment by Serguei Smirnov [ 28/Jan/22 ] |
|
Hi Olaf, I didn't think I could push it to LLNL repo. Do you mean I should push it to b2_12 of Lustre? |
| Comment by Olaf Faaland [ 28/Jan/22 ] |
|
> Do you mean I should push it to b2_12 of Lustre? Yes, that's what I meant, with "Test-Parameters: fortestonly" and an appropriately limited set of tests. Would that be inappropriate? |
| Comment by Gerrit Updater [ 28/Jan/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46364 |
| Comment by Chris Horn [ 28/Jan/22 ] |
|
I've been reviewing the related code off and on and I have found one reference leak, though I doubt it is responsible for your issue because it would only be hit on ENOMEM error (which is probably rare), and this code path deals with resizing the ping buffer which should not happen very often. This ping code is suspicious though, because it is something that has changed from 2.10 -> 2.12. static void
lnet_ping_router_locked(struct lnet_peer_ni *rtr)
{
struct lnet_rc_data *rcd = NULL;
time64_t now = ktime_get_seconds();
time64_t secs;
struct lnet_ni *ni;
lnet_peer_ni_addref_locked(rtr); <<<<< Addref
if (rtr->lpni_ping_deadline != 0 && /* ping timed out? */
now > rtr->lpni_ping_deadline)
lnet_notify_locked(rtr, 1, 0, now);
/* Run any outstanding notifications */
ni = lnet_get_next_ni_locked(rtr->lpni_net, NULL);
lnet_ni_notify_locked(ni, rtr);
if (!lnet_isrouter(rtr) ||
the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
/* router table changed or router checker is shutting down */
lnet_peer_ni_decref_locked(rtr);
return;
}
rcd = rtr->lpni_rcd;
/*
* The response to the router checker ping could've timed out and
* the mdh might've been invalidated, so we need to update it
* again.
*/
if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis ||
LNetMDHandleIsInvalid(rcd->rcd_mdh))
rcd = lnet_update_rc_data_locked(rtr);
if (rcd == NULL)
return; <<<<< Reference leak
|
| Comment by Gerrit Updater [ 28/Jan/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46367 |
| Comment by Olaf Faaland [ 29/Jan/22 ] |
|
Thanks Serguei. I'll be able to run with this patch Tuesday morning, and I'll get results right away. |
| Comment by Olaf Faaland [ 29/Jan/22 ] |
|
Thanks Chris. I'm adding your patch to my stack so the Tuesday morning test will be with both patches. |
| Comment by Olaf Faaland [ 01/Feb/22 ] |
|
Hi Serguei and Chris, I've uploaded orelic4.debug_refcount_01.tar.gz which has dmesg, lustre debug log (with our default flags, errors only), and lnetctl peer show for a peer with a high refcount. Thanks |
| Comment by Chris Horn [ 01/Feb/22 ] |
|
faaland1 I'll let Serguei dig into the data from the debug patch, but I noticed in the lustre debug log you provided that a few nodes appear to be down/unreachable: hornc@C02V50B9HTDG pass1 % grep 'ADDR ERR' dk.1.txt | awk '{print $2}' | sort -u
172.19.1.59@o2ib100:
172.19.1.91@o2ib100:
172.19.1.92@o2ib100:
172.19.2.26@o2ib100:
172.19.2.27@o2ib100:
hornc@C02V50B9HTDG pass1 %
Is this expected? i.e. were these peers actually down during the time captured by the log? |
| Comment by Olaf Faaland [ 01/Feb/22 ] |
|
Hi Chris, 172.19.1.59@o2ib100: This host was not down. I noticed afterwards that it failed to reconnect with orelic4 after orelic4 was rebooted; I haven't had a chance to look into it yet, so I don't know why. 172.19.1.91@o2ib100: retired/expected |
| Comment by Serguei Smirnov [ 02/Feb/22 ] |
|
Hi Olaf, The debug data listing itemized lpni refcounts looks incorrect: the sum of "decrement" counts exceeds the sum of "increment" counts by a lot, while I was expecting to see the difference no less than the reported total peer ni ref count. I wonder if the debug patch got applied correctly. Was the patch applied to LLNL repo or was lustre b2_12 used? I'd like to check if the resulting code is missing some of the "addref" cases. Could you please provide the diff of the change and specify which commit was used as a base? Thanks, Serguei. |
| Comment by Olaf Faaland [ 02/Feb/22 ] |
|
Hi Serguei, I applied the patch to lustre's branch based on 2.12.8. Here's what I built: There was a merge conflict because I had both Chris' refcount leak patch and your patch on the same branch. Maybe I goofed up the conflict resolution. thanks, |
| Comment by Chris Horn [ 02/Feb/22 ] |
|
If system was under load then the counts might not match up just from activity on the router while the stats were being dumped. |
| Comment by Serguei Smirnov [ 02/Feb/22 ] |
|
I'm going to update the debug patch to make use of atomic increments. I'll also rebase it on top of Chris's fix. |
| Comment by Serguei Smirnov [ 08/Feb/22 ] |
|
Hi, The debug patch has been updated. Please try the same test again and use lnetctl peer show -v 5 --nid <nid_of_peer_with_high_refcount> to dump the debug data. Thanks, Serguei. |
| Comment by Olaf Faaland [ 15/Feb/22 ] |
|
Serguei, I've attached 4 files - peer.show.172.16.70.*_at_tcp.orelic4.1644951836 |
| Comment by Serguei Smirnov [ 17/Feb/22 ] |
|
Olaf, Now the debug counts add up and make more sense. However, so far I haven't been able to find any obvious leaks, so the most likely explanation is still messages just not getting finalized in LNet, and it is still unclear exactly why. I found a patch that may be related ( Thanks, Serguei. |
| Comment by Olaf Faaland [ 17/Feb/22 ] |
|
Hi Serguei, OK, thanks. If you want, you could base your patches on 2.14. -Olaf |
| Comment by Gerrit Updater [ 28/Feb/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46650 |
| Comment by Serguei Smirnov [ 02/Mar/22 ] |
|
Hi Olaf, The new patch is based on b2_14. It adds ability to examine the lnd peer. If there's a peer with climbing refcounts lnetctl debug peer --prim_nid=<peer nid> will dump the peer info to the log ("console", should be enabled by default), to be retrieved with "lctl dk". Note that "prim_nid" parameter doesn't really require the primary nid of the peer, but rather the specific lpni. I'll update the patch to make it more clear a bit later. The purpose of this is to check the idea that something on lnd level is preventing messages from being finalized. Thanks, Serguei.
|
| Comment by Olaf Faaland [ 25/Mar/22 ] |
|
Hi Serguei, Sorry for the long delay. With 2.14 + this patch running on orelic4, I did not see the climbing peer refcounts, even though I still saw the symptoms described in LU-14026 (reconnects between clients and targets, timed out messages on clients and servers, "Lost connection to MGS" on clients. So perhaps my suggestion to use 2.14 wasn't a good one. I'm wondering if there are multiple issues and 2.14 doesn't have the one that causes climbing refcounts. Ideas? thanks, |
| Comment by Gerrit Updater [ 26/May/22 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/47460 |
| Comment by Olaf Faaland [ 27/May/22 ] |
|
Hi Serguei, Can you review Gian-Carlo's backport of your lnd peer debug info patch? We're seeing this climbing refcount issue more widely. thanks, |
| Comment by Serguei Smirnov [ 27/May/22 ] |
|
Hi Olaf, Would it be possible to check which Mellanox FW versions are used? There was a recent investigation at one of the DDN sites which isolated xxx.30.xxxx FW version as problematic: there's a bug in this version which can cause "stuck qp" in IB layer. I'd like to make sure we're not affected by the same problem here. Thanks, Serguei. |
| Comment by Olaf Faaland [ 01/Jun/22 ] |
|
Hi Serguei, Routers with this symptom recently have FW 16.29.2002. We don't have any routers running xxx.30.xxxx. thanks |
| Comment by Olaf Faaland [ 09/Jun/22 ] |
|
Hi Serguei, We reproduced the issue on orelic2, with https://review.whamcloud.com/47460, under Lustre 2.12.8. There were 4 peers with high refcounts, with NIDs 172.16.70.6[2-5]@tcp. I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for. I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached. lnetctl.peer.show.orelic2.1654723542.txt Thanks, |
| Comment by Olaf Faaland [ 13/Jun/22 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 13/Jun/22 ] |
|
Hi Olaf, So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level. If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only. In the meantime I'm looking at how instrumentation can be extended to yield more useful info. Thanks, Serguei.
|
| Comment by Serguei Smirnov [ 16/Jun/22 ] |
|
Hi Olaf, After discussing with ashehata, I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364 I'd like to clarify the following: 1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500. 2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer? Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too. Thanks, Serguei.
|
| Comment by Olaf Faaland [ 21/Jun/22 ] |
|
Hi Serguei, I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2". The detailed counts and debug logs are attached: 2022-jun-21.tgz To provide context: The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered. Before removing routes, the refcounts continued to climb. I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure". I can send you the crash dump. Thanks, |
| Comment by Olaf Faaland [ 11/Jul/22 ] |
|
Hi Serguei, do you have any update on this? |
| Comment by Serguei Smirnov [ 11/Jul/22 ] |
|
Hi Olaf, I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow. Could please give me access to the crash dump if you still have it, along with debuginfo rpms? Thanks, Serguei. |
| Comment by Olaf Faaland [ 11/Jul/22 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 12/Jul/22 ] |
|
Hi Olaf, I found these files -rw-r--r-- 1 sdsmirnov staff 469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 1270395238 12 Jul 11:34 vmcore -rw-r--r-- 1 sdsmirnov staff 148855 12 Jul 11:34 vmcore-dmesg.txt and copied them over to my machine. I'll take a look and keep you updated. Thanks, Serguei. |
| Comment by Olaf Faaland [ 25/Jul/22 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 26/Jul/22 ] |
|
Hi Olaf, I had too many distractions and haven't finished looking at the core yet. Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there. Thanks, Serguei. |
| Comment by Gerrit Updater [ 08/Aug/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163 |
| Comment by Serguei Smirnov [ 08/Aug/22 ] |
|
Hi Olaf, While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer. At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed. I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase. Otherwise, I'll have to look for other possible causes. Thanks, Serguei. |
| Comment by Chris Horn [ 09/Aug/22 ] |
|
Sounds like https://jira.whamcloud.com/browse/LU-12739 ? |
| Comment by Serguei Smirnov [ 09/Aug/22 ] |
|
Chris, Yes indeed, it looks very much like I'll port these changes. Thanks, Serguei. |
| Comment by Serguei Smirnov [ 11/Aug/22 ] |
|
Hi Olaf, I ported Chris's fix for Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided. Thanks, Serguei. |
| Comment by Olaf Faaland [ 24/Aug/22 ] |
|
Hi Serguei I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days. |
| Comment by Olaf Faaland [ 25/Aug/22 ] |
|
Hi Serguei, 2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great. I'll deploy more widely and update here early next week. thanks, |
| Comment by Olaf Faaland [ 31/Aug/22 ] |
|
Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf |
| Comment by Olaf Faaland [ 12/Sep/22 ] |
|
As far as I'm concerned, this will be resolved when the patch lands to b2_12. Do you agree? If so, what is the plan for that? thanks |
| Comment by Peter Jones [ 12/Sep/22 ] |
|
Yes I think that we can mark this ticket as a duplicate of |
| Comment by Gerrit Updater [ 15/Sep/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566 |
| Comment by Peter Jones [ 20/Sep/22 ] |
|
The |
| Comment by Olaf Faaland [ 27/Sep/22 ] |
|
> The No opinion from me. Thanks for getting this fixed. |
| Comment by Peter Jones [ 01/Oct/22 ] |
|
I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments? |
| Comment by Serguei Smirnov [ 03/Oct/22 ] |
|
No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. |
| Comment by Gerrit Updater [ 25/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/ |
| Comment by Peter Jones [ 25/Oct/22 ] |
|
Landed for 2.16 |