[LU-15234] LNet high peer reference counts inconsistent with queue Created: 16/Nov/21  Updated: 17/Dec/22  Resolved: 25/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64


Attachments: File 2022-jun-21.tgz     File debug_refcount_01.patch     Text File dk.orelic2.1654723678.txt     Text File dk.orelic2.1654723686.txt     Text File dk.orelic2.1654724730.txt     Text File dk.orelic2.1654724740.txt     Text File dk.orelic2.1654724745.txt     Text File dk.orelic2.1654724751.txt     File dk.ruby1016.1637103254.txt.bz2     Text File ko2iblnd.parameters.orelic4.1637617473.txt     Text File ksocklnd.parameters.orelic4.1637617487.txt     Text File lctl.version.orelic4.1637616867.txt     Text File lctl.version.ruby1016.1637616519.txt     Text File lnet.parameters.orelic4.1637617458.txt     Text File lnetctl.net-show.orelic4.1637616889.txt     Text File lnetctl.net-show.ruby1016.1637616206.txt     Text File lnetctl.peer.show.orelic2.1654723542.txt     Text File lnetctl.peer.show.orelic2.1654724780.txt     File orelic4-lustre212-20211216.tgz     File orelic4.debug_refcount_01.tar.gz     File params_20211213.tar.gz     PNG File peer status orelic4 with discovery race patch v3.png     File peer.show.172.16.70.62_at_tcp.orelic4.1644951836     File peer.show.172.16.70.63_at_tcp.orelic4.1644951836     File peer.show.172.16.70.64_at_tcp.orelic4.1644951836     File peer.show.172.16.70.65_at_tcp.orelic4.1644951836     Text File peer.show.ruby1016.1637103254.txt     Text File peer.show.ruby1016.1637103865.txt     Text File stats.show.ruby1016.1637103254.txt     Text File stats.show.ruby1016.1637103865.txt     PNG File toss-5305 queue 2021-11-15.png     PNG File toss-5305 refs 2021-11-15.png    
Issue Links:
Duplicate
duplicates LU-12739 Race with discovery thread completion... Resolved
Related
is related to LU-15453 MDT shutdown hangs on mutex_lock, po... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
 172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
 172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
 172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
 172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0

The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).



 Comments   
Comment by Olaf Faaland [ 16/Nov/21 ]

My local issue is TOSS5305

Comment by Peter Jones [ 16/Nov/21 ]

Serguei

Could you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 16/Nov/21 ]

Hi Olaf,

From the graph it looks like the refcount is growing slowly and constantly. If it grows fast enough such that it is likely to increment within a reasonably short window (short enough so the debug log is not overwritten), could you please capture the net debug log for the window:

lctl set_param debug=+net
lctl dk clear
---- wait for the incrememnt ----
lctl dk > log.txt
lctl set_param debug=-net

If the increment window is too long so it is not practical to capture debug log, please provide the syslog instead. 

Before and after the debug window, please capture:

lnetctl stats show
lnetctl peer show -v 4 --nid <peer nid that leaks refcount> 

I agree that there may be a problem with not decrementing the refcount on some error path. Hopefully the debug data can help narrow down which path it is.

Thanks,

Serguei.

Comment by Olaf Faaland [ 16/Nov/21 ]

Hi Serguei, I've attached debug logs and the peer and stats before-and-after output.
thanks
Olaf

Comment by Olaf Faaland [ 18/Nov/21 ]

Hi Serguei,
Do you have any news, or need any additional information?
thanks,
Olaf

Comment by Serguei Smirnov [ 18/Nov/21 ]

Hi Olaf,

I haven't had a chance yet to properly process what you have provided. I should be able to give you an update tomorrow.

Thanks,

Serguei.

Comment by Serguei Smirnov [ 20/Nov/21 ]

refcount appears to be going up at the same rate as rtr_credits are going down (64 between the two "peer show" snapshots. Peer status changed to "down" as we likely didn't receive a router check ping. Nothing is received from 172.19.2.24@o2ib100 on the lnet level, but it appears that the peer is receiving at least some messages and is returning credits on the lnd level.

Could you please list 

dead_router_check_interval
live_router_check_interval
router_ping_timeout
peer_timeout

from both nodes?

Is 172.19.2.24@o2ib100 another router?

Thanks,

Serguei.

Comment by Olaf Faaland [ 20/Nov/21 ]

Hi Serguei,

Yes, 172.19.2.24 is orelic4, one of the IB-to-TCP "RELIC" routers that is still at Lustre 2.10.

ruby1016:

/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/router_ping_timeout:50
/sys/module/ko2iblnd/parameters/peer_timeout:180

orelic4:

/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/router_ping_timeout:50
/sys/module/ko2iblnd/parameters/peer_timeout:180

thanks,
Olaf

Comment by Serguei Smirnov [ 22/Nov/21 ]

It could be related to LU-12569

The 2.10 version probably doesn't have the fix for it. 

Could you please provide, from both sides 

lctl --version
lnetctl net show -v 4 

to check what the credit-related settings are?

Thanks,

Serguei.

Comment by Olaf Faaland [ 22/Nov/21 ]

Hi Serguei, attached as

lctl.version.orelic4.1637616867.txt
lctl.version.ruby1016.1637616519.txt
lnetctl.net-show.orelic4.1637616889.txt
lnetctl.net-show.ruby1016.1637616206.txt

along with module params for orelic4 since 2.10 lnetctl doesn't report as much with "net show"
ko2iblnd.parameters.orelic4.1637617473.txt
ksocklnd.parameters.orelic4.1637617487.txt
lnet.parameters.orelic4.1637617458.txt

thanks
Olaf

Comment by Chris Horn [ 24/Nov/21 ]

FYI, I think this may be caused by a bug in discovery. I noticed this same symptom on a router while I was doing some internal testing.

The router had high reference counts for three peers (all Lustre servers):

nid00053:~ # awk '{if ($2 > 1){print $0}}' /sys/kernel/debug/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min queue
10.13.100.57@o2ib11      2488    up    -1    16 -2359 -2359    16   -18 0
10.13.100.53@o2ib11      1917    up    -1    16 -1788 -1788    16    -5 0
10.13.100.55@o2ib11      2582    up    -1    16 -2453 -2453    16   -41 0
nid00053:~ #

Here I'm just locating the lnet_peer.lp_rtrq for each of these peers:

crash_x86_64> p &the_lnet.ln_peer_tables[0].pt_peer_list
$7 = (struct list_head *) 0xffff8803ea46af90
crash_x86_64> p &the_lnet.ln_peer_tables[1].pt_peer_list
$8 = (struct list_head *) 0xffff8803e9298450
crash_x86_64> p &the_lnet.ln_peer_tables[2].pt_peer_list
$9 = (struct list_head *) 0xffff8803e9298750
crash_x86_64> list -H 0xffff8803ea46af90 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
ffff8803e9297400
  lp_primary_nid = 1407422296843321 <<< 10.13.100.57@o2ib11
crash_x86_64> list -H 0xffff8803e9298450 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
crash_x86_64> list -H 0xffff8803e9298750 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
ffff8803ea3e2e00
  lp_primary_nid = 1407422296843319 <<< 10.13.100.55@o2ib11
ffff8803864ab600
  lp_primary_nid = 1407422296843317 <<< 10.13.100.53@o2ib11
crash_x86_64> struct -o lnet_peer ffff8803e9297400 | grep lp_rtrq
  [ffff8803e9297470] struct list_head lp_rtrq;
crash_x86_64> struct -o lnet_peer ffff8803ea3e2e00 | grep lp_rtrq
  [ffff8803ea3e2e70] struct list_head lp_rtrq;
crash_x86_64> struct -o lnet_peer ffff8803864ab600 | grep lp_rtrq
  [ffff8803864ab670] struct list_head lp_rtrq;
crash_x86_64> list -H ffff8803e9297470 | wc -l
2389
crash_x86_64> list -H ffff8803ea3e2e70 | wc -l
2481
crash_x86_64> list -H ffff8803864ab670 | wc -l
1815
crash_x86_64>

While trying to track down where the bottleneck was, I noticed that there are two @gni peers that seem to be stuck in discovery:

crash_x86_64> struct -o lnet the_lnet | grep ln_dc
  [ffffffffa0307078] lnet_handler_t ln_dc_handler;
  [ffffffffa0307080] struct list_head ln_dc_request;
  [ffffffffa0307090] struct list_head ln_dc_working;
  [ffffffffa03070a0] struct list_head ln_dc_expired;
  [ffffffffa03070b0] wait_queue_head_t ln_dc_waitq;
  [ffffffffa03070c8] int ln_dc_state;
crash_x86_64> list -H ffffffffa0307090 -o 224
ffff8803e9297c00
ffff8803e91cc200
crash_x86_64> lnet_peer.lp_primary_nid ffff8803e9297c00
  lp_primary_nid = 3659174697238582
crash_x86_64> lnet_peer.lp_primary_nid ffff8803e91cc200
  lp_primary_nid = 3659174697238578
crash_x86_64> epython nid2str.py 3659174697238582
54@gni
crash_x86_64> epython nid2str.py 3659174697238578
50@gni
crash_x86_64>

These peers were last processed by the discovery thread hours ago:

crash_x86_64> lnet_peer.lp_last_queued ffff8803e9297c00
  lp_last_queued = 1637750512
crash_x86_64> lnet_peer.lp_last_queued ffff8803e91cc200
  lp_last_queued = 1637750211
crash_x86_64>

pollux-p4:~ # date -d @1637750512
Wed Nov 24 04:41:52 CST 2021
pollux-p4:~ # date -d @1637750211
Wed Nov 24 04:36:51 CST 2021
pollux-p4:~ #

The router was dumped a little under 10 hours later:

        DATE: Wed Nov 24 14:28:12 2021

The stuck peers have a state that is inconsistent with being on the ln_dc_working queue:

crash_x86_64> lnet_peer.lp_state ffff8803e9297c00
  lp_state = 338
crash_x86_64> lnet_peer.lp_state ffff8803e91cc200
  lp_state = 338
crash_x86_64>

*hornc@cflosbld09 fs4 $ lpst2str.sh 338
LNET_PEER_NO_DISCOVERY
LNET_PEER_DISCOVERED
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
*hornc@cflosbld09 fs4 $

These @gni peers have numerous messages on the lnet_peer.lp_dc_pendq:

crash_x86_64> struct -o lnet_peer ffff8803e9297c00 | grep lp_dc_pendq
  [ffff8803e9297c20] struct list_head lp_dc_pendq;
crash_x86_64> struct -o lnet_peer ffff8803e91cc200 | grep lp_dc_pendq
  [ffff8803e91cc220] struct list_head lp_dc_pendq;
crash_x86_64> list -H ffff8803e9297c20 | wc -l
214
crash_x86_64> list -H ffff8803e91cc220 | wc -l
170
crash_x86_64>

It is likely the case that those messages are what's consuming and not letting go of the lpni_rtrcredits for the three @o2ib11 peers that show the high reference counts.

I haven't yet figured out why the @gni peers are stuck in discovery.

Comment by Chris Horn [ 24/Nov/21 ]

Olaf, does your 2.12.7 router have the fix from LU-13883 backported to it?

Comment by Chris Horn [ 27/Nov/21 ]

I discovered a race between the discovery thread and other threads that are queueing a peer for discovery.

When the discovery thread finishes processing a peer it calls lnet_peer_discovered() which clears the LNET_PEER_DISCOVERING bit from the peer state:

        lp->lp_state |= LNET_PEER_DISCOVERED;
        lp->lp_state &= ~(LNET_PEER_DISCOVERING |
                          LNET_PEER_REDISCOVER);

At this point, the peer is on the lnet.ln_dc_working queue. When lnet_peer_discovered() returns, the lnet_peer.lp_lock spinlock is dropped, and the discovery thread acquires the lnet_net_lock/EX. This is where the race window exists:

                        spin_unlock(&lp->lp_lock);
<<< Race window >>>
                        lnet_net_lock(LNET_LOCK_EX);

If another threads queues this peer for discovery during this window, then the LNET_PEER_DISCOVERING bit is added back to the peer state, but since the peer is already on the lnet.ln_dc_working queue, it does not get added to the lnet.ln_dc_request queue.

static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
...
        spin_lock(&lp->lp_lock);
        if (!(lp->lp_state & LNET_PEER_DISCOVERING))
                lp->lp_state |= LNET_PEER_DISCOVERING;
        spin_unlock(&lp->lp_lock);
        if (list_empty(&lp->lp_dc_list)) {  <<< Peer is on ln_dc_working
                lnet_peer_addref_locked(lp);
                list_add_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
...

When the discovery thread acquires the lnet_net_lock/EX, it sees that the LNET_PEER_DISCOVERING bit has not been cleared, so it does not call lnet_peer_discovery_complete() which is responsible for sending messages on the peer's discovery pending queue.

                        spin_unlock(&lp->lp_lock);
<<< Race window >>>
                        lnet_net_lock(LNET_LOCK_EX);
...
                        if (!(lp->lp_state & LNET_PEER_DISCOVERING))
                                lnet_peer_discovery_complete(lp);

At this point, the peer is stuck on the lnet.ln_dc_working queue, and messages may continue to accumulate on the peer's lnet_peer.lp_dc_pendq.

Comment by Olaf Faaland [ 29/Nov/21 ]

Hi Chris,

Olaf, does your 2.12.7 router have the fix from LU-13883 backported to it?

No, it does not.

Comment by Gerrit Updater [ 29/Nov/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45670
Subject: LU-15234 lnet: Race on discovery queue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 43d9fe70f33defb31e7402d35474b4ef39560657

Comment by Chris Horn [ 29/Nov/21 ]

Thanks, Olaf.

This is just a little over 6 days since the ruby routers were rebooted during an update.

From what I can tell, the discovery defect that I found has been there a long time, but maybe I am missing something that has caused this issue to start manifesting. Can you provide any additional detail about when you started seeing this issue? Was everything working fine until this update that you referenced in the description? What did the update entail?

Comment by Olaf Faaland [ 29/Nov/21 ]

Hi Chris,

I first documented the high refcounts 2021-09-17.  I recall seeing it before then, but I'm not sure how long before.  Our systems were updated to our 2.12.7_2.llnl tag (the tag we're still on) about 2021-08-10. 

There was a new issue subsequent to that August update - we started seeing some router nodes reporting "Timed out RDMA" with some other routers for no reason we could find, like this:

[root@orelic1:~]# grep 'LNetError.*Timed out RDMA' /var/log//conman/console.orelic4 | grep ^2021-08 | nidgrep | sort -V | uniq -c
      1 19.1.104@o2ib10
      1 172.19.1.101@o2ib100
    896 172.19.1.103@o2ib100
    674 172.19.1.104@o2ib100
      1 172.19.2.7@o2ib100
      1 172.19.2.8@o2ib100
      1 172.19.2.40@o2ib100
      1 172.19.2.43@o2ib100
      1 172.19.2.44@o2ib100
      1 172.19.2.46@o2ib100 

where 172.19.1.[101-104] all have the same hardware, same LNet versions, and same role (IE were on the path to the same endpoints).

The LNet related patches that were new to the _2.llnl tag, which was what we updated to in August, were:

  • LU-14627 lnet: Allow delayed sends
  • LU-14627 lnet: Ensure ref taken when queueing for discovery
  • LU-13972 o2iblnd: Don't retry indefinitely
  • LU-14488 o2ib: Use rdma_connect_locked if it is defined
  • LU-14588 o2ib: make config script aware of the ofed symbols
  • LU-14733 o2iblnd: Avoid double posting invalidate
  • LU-14733 o2iblnd: Move racy NULL assignment
Comment by Gerrit Updater [ 30/Nov/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45681
Subject: LU-15234 lnet: Race on discovery queue
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f7e0853c7da89724cb89a8b3bad972c661b55794

Comment by Chris Horn [ 30/Nov/21 ]

Olaf, I pushed a backport of this patch to b2_12, just in case you want to try it and see if it resolves your issue.

Comment by Olaf Faaland [ 30/Nov/21 ]

Thank you, Chris.  I will try it.

Comment by Olaf Faaland [ 30/Nov/21 ]

Serguei, before I do try Chris' patch, can you or Amir review it (at least the patch against master)?  Thank you.

Comment by Olaf Faaland [ 09/Dec/21 ]

Serguei and Chris,
We applied the patch from https://review.whamcloud.com/#/c/45681/ (version 3) to our 2.12.8 branch, and still see the climbing refcount and dropping min_rtr. See attached

What is different, is that we now see queue (reported by lctl get_param peers) climb along with refs. So, I think we have progress, but some other issue.

Comment by Chris Horn [ 09/Dec/21 ]

Olaf, would you be able to collect a crash dump, vmlinux and lustre kos from a router experiencing the high refcount issue? preferably from a router running the patch?

Comment by Olaf Faaland [ 09/Dec/21 ]

Hi Chris, I need to check whether I can make a crash dump available.

Comment by Olaf Faaland [ 09/Dec/21 ]

Chris, I can make a crash dump available to you. Can you let me know a way I can send it to you privately?

Serguei, if it would help you to have it also, please let me know a way to send it.

thanks,
Olaf

Comment by Chris Horn [ 10/Dec/21 ]

Chris, I can make a crash dump available to you. Can you let me know a way I can send it to you privately?

Yes, I need to check with our new IT org on the best way to do it. I'll let you know.

Comment by Chris Horn [ 10/Dec/21 ]

Olaf, you can upload to this ftp:

sftp -o Port=2222 lu15234@ftp.ext.hpe.com
pass: e$VS3mw_ 
> put <filename>
Comment by Gian-Carlo Defazio [ 10/Dec/21 ]

Olaf is not officially working today. I'll send you the tarball he made yesterday, which includes the kernel dump.

Sent. Hopefully it showed up on your side.

Comment by Chris Horn [ 13/Dec/21 ]

defazio I got it. Thank you

Comment by Chris Horn [ 13/Dec/21 ]

defazio ofaaland can you provide the lustre-debuginfo package that matches kmod-lustre-2.12.8_3.llnl-1.ch6.x86_64.rpm ?

Comment by Gian-Carlo Defazio [ 13/Dec/21 ]

I've sent lustre-debuginfo-2.12.8_3.llnl-1.ch6.x86_64.rpm

Comment by Chris Horn [ 13/Dec/21 ]

defazio ofaaland can you provide the output of 'lnetctl global show' from one peer on each cluster? What I mean is, I believe you have something like:

(Cluster A) <-> (Router Cluster B) <-> (Router Cluster C) <-> (Cluster D)

and I'm looking for 'lnetctl global show' output from one peer in each Cluster A, B, C, and D. Also, could you let me know if you have any tuning changes in place on each cluster? By this I mean, if you are explicitly setting any lnet/ko2iblnd/ksocklnd kernel module parameters, or if you are doing any tuning by executing 'lnetctl set' commands, etc. Thanks.

Comment by Chris Horn [ 13/Dec/21 ]

The fix I authored is only applicable to LNet peers that undergo discovery. In 2.12 LTS, router peers do not undergo discovery, so that explains why the fix didn't help with your issue.

In the dump, we can see this peer with a high refcount (relative to other peers; there are a handful that have refcount between 10-20) and negative lpni_rtrcredits:

crash_x86_64> epython lnet.py -p --nid 172.16.70.63@tcp -d
lnet_peer: ffff9c9bc20cb980
  lp_primary_nid: 172.16.70.63@tcp
  lp_state:
  lp_dc_pendq: ffff9c9bc20cb9a0(0)
  lp_dc_list: ffff9c9bc20cba18(0)
  lp_peer_nets: ffff9c9bc20cb990
    - lnet_peer_net: tcp(ffff9cbba3ff4cc0)
      - lpn_peer_nis: ffff9cbba3ff4cd0
        - lnet_peer_ni: ffff9c9bc3b18600
          - lpni_nid: 172.16.70.63@tcp
          - lpni_refcount: !!!!!!245!!!!!!
          - lpni_healthv: 1000
          - lpni_txcredits: 3
          - lpni_mintxcredits: 0
          - lpni_rtrcredits: -226
          - lpni_minrtrcredits: -226
          - lpni_rtrq: ffff9c9bc3b18650(226)
          - lpni_last_alive: 847
          - lpni_txq: ffff9c9bc3b18640(0)

In the debug log extracted from the dump, we can see timeout errors for this peer:

00000800:00000100:21.0:1639088392.413370:0:16075:0:(socklnd_cb.c:2390:ksocknal_find_timed_out_conn()) Timeout receiving from 12345-172.16.70.63@tcp (172.16.70.63:988), state 4 wanted 0 left 0
*hornc@cflosbld09 dec9-lu-15234-llnl $ grep 172.16.70.63@tcp dk.log.fmt | grep -c ksocknal_find_timed_out_conn
116
*hornc@cflosbld09 dec9-lu-15234-llnl $

Your LND timeout is currently set to the b2_12 default, which is 5 seconds:

crash_x86_64> lnet_lnd_timeout
lnet_lnd_timeout = $497 = 5
crash_x86_64>

I think we need to increase this timeout value. However, there are some quirks with how this value is set in b2_12. I'm guessing that you have lnet_health_sensitivity=0. That setting results in lnet_transaction_timeout being set to 50, and lnet_retry_count being set to 0, but it doesn't update the lnet_lnd_timeout correctly (see the patches for LU-13510).

The easiest solution is to just explicitly set the lnet_transaction_timeout to some value not equal to 50. This needs to be done after the lnet_health_sensitivity is set to 0.

options lnet lnet_health_sensitivity=0
options lnet lnet_transaction_timeout=49

or at runtime (doesn't persist across reboots)

lnetctl set transaction_timeout 49

You might experiment with this value to see what's the smallest value that resolves the issue, but I wouldn't go any lower than 10 seconds, and I would avoid anything > 50. If you're still seeing timeouts/network errors with it set to 49 then you may have other issues with your network that warrant investigation (bad cables, etc.).

The above assumes you are tuning all of our clusters the same way. If that isn't the case, then if you provide the information I requested in my previous comment then I can provide specific tuning guidance for each cluster.

Comment by Chris Horn [ 13/Dec/21 ]

ssmirnov whamcloud might want to consider backporting LU-13510 patches to b2_12, or authoring a fix specific to b2_12 to address the issue with setting lnet_lnd_timeout correctly.

Comment by Gian-Carlo Defazio [ 14/Dec/21 ]

Uploading file params_20211213.tar.gz

Comment by Chris Horn [ 15/Dec/21 ]

Thanks defazio, but I'm a little confused by the information that you've provided. The data in the tarball suggests that you have the following:

orelic2 - Lustre 2.10 - local networks tcp0, o2ib100 w/routes to various other o2ib networks
zrelic2 - Lustre 2.10 - local networks tcp0, o2ib600 w/routes to various other o2ib networks
ruby1009 - Lustre 2.12 - local networks o2ib39, o2ib100 w/routes to o2ib600
zinc2 - Lustre 2.12 - local network o2ib600 w/routes to tcp0 and various other o2ib networks

Is that right?

The crash dump you provided has "NODENAME: orelic4", but this node was running 2.12. Can you clarify?

Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball:

crash_x86_64> dmesg | grep kiblnd_passive_connect
[  145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.54@o2ib100, queue depth too large:  32 (<=8 wanted)
[  185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.55@o2ib100, queue depth too large:  32 (<=8 wanted)
[  185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages
crash_x86_64>
hornc@C02V50B9HTDG params_20211213 % grep -a 'ko2iblnd peer_credits' *
hornc@C02V50B9HTDG params_20211213 % grep -a peer_credits *
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits:8
ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
ksocklnd.parameters.orelic2.1639444379:/sys/module/ksocklnd/parameters/peer_credits:8
ksocklnd.parameters.zrelic2.1639443594:/sys/module/ksocklnd/parameters/peer_credits:8
lnetctl-net-show.orelic2.1639444148:              peer_credits: 0
lnetctl-net-show.orelic2.1639444148:              peer_credits: 8
lnetctl-net-show.orelic2.1639444148:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 0
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 0
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 8
hornc@C02V50B9HTDG params_20211213 %

What am I missing?

Comment by Chris Horn [ 15/Dec/21 ]

In any case, getting back to the timeout issue.

Lustre 2.10 has default LND timeouts of 50 seconds for both ksocklnd and ko2iblnd. You can see that in the parameters file for orelic2:

ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/timeout:50
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/sock_timeout:50

So we probably want to get the 2.12 nodes to match that. I figured out that we can get exactly 50 with this set of parameters (note, the order is important):

options lnet lnet_retry_count=0 # Sets lnet_lnd_timeout = lnet_transaction_timeout (lnet_transaction_timeout should have default value of 50)
options lnet lnet_health_sensitivity=0

I would suggest putting this configuration in place on all your Lustre 2.12 nodes. If you are still seeing network timeout issues then I would suggest doing some investigation into the network to see if it is healthy.

Comment by Olaf Faaland [ 16/Dec/21 ]

Hi Chris,

Regarding the parameters, I'm attaching them for an orelic node running Lustre 2.12, which is when we see the symptoms described in this issue.  The orelic/zrelic nodes currently run lustre 2.10 because of the issues we've seen, which is why you got the parameters for that configuration.  orelic and zrelic are configured the same, except for the routes.

orelic4-lustre212-20211216.tgz

After gathering these parameters, we put the configuration you suggest in place on orelic4 while it was running lustre 2.12 and verified that lnet_lnd_timeout was set to 50.

The node ran first with lustre 2.12 and our stock settings (as in the attached tarball) and refs built up to about 590.  We then set lnet_health_sensitivity=100, set lnet_retry__count=2, then set lnet_retry_count=0, then set lnet_health_sensitivity=0.

After this we observed refs continued to climb, but much more slowly - the rate was probably 1/4 or less of the rate of climb before changing lnet_lnd_timeout.

We'll make that change more widely and see how it goes.

Comment by Chris Horn [ 20/Dec/21 ]

Hi Olaf,

How's it going with the param changes? As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network.

Also, can you clarify where the ko2iblnd peer_credits=32 is coming from that I asked about earlier?

Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball:

crash_x86_64> dmesg | grep kiblnd_passive_connect
[  145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.54@o2ib100, queue depth too large:  32 (<=8 wanted)
[  185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can't accept conn from 172.19.1.55@o2ib100, queue depth too large:  32 (<=8 wanted)
[  185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages
crash_x86_64>

I only ask because ideally you would have peer_credits the same on all o2iblnd peers, although it is not fatal to have them different as long as the peers are able to negotiate to a lower value.

Comment by Olaf Faaland [ 22/Dec/21 ]

Hi Chris,

I ran out of time before I went on vacation, so I won't know for a couple weeks.  I'll post here as soon as I've made the change.

Yes, those two peers are running Lustre 2.14, and may have ended up with different settings accidentally.  I'll have to check.

Thanks!

Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45670/
Subject: LU-15234 lnet: Race on discovery queue
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 852a4b264a984979dcef1fbd4685cab1350010ca

Comment by Peter Jones [ 23/Dec/21 ]

Landed for 2.15

Comment by Chris Horn [ 23/Dec/21 ]

pjones I'm going to re-open this ticket until Olaf can verify that the tuning recommendations have alleviated his issue. It was probably a mistake for me to push that code change against this ticket as it turned out not to be the root cause of Olaf's problem. I'm sorry about that.

Comment by Chris Horn [ 23/Dec/21 ]

Alternatively, ofaaland if you're okay with it we could open a new ticket to continue investigation into your issue.

Comment by Olaf Faaland [ 05/Jan/22 ]

Chris, I'm fine with either continuing in this ticket or opening a new one.  I'm updating tunings over the next couple of days.  Thanks!

Comment by Olaf Faaland [ 06/Jan/22 ]

Hi Chris and Serguei,

> How's it going with the param changes?

I changed the timeout as prescribed above, on all the systems (clients, routers, servers).  I then rebooted orelic4 into an image with lustre 2.12.  The changed timeout did not change the symptoms.  I still see the climbing "refs" on orelic4 node when I boot it into Lustre 2.12, sadly.

> As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network.

I don't think problems with the network (ie switches, cables, NICs, drivers) can explain this, because we don't see these issues when orelic4 (and other nodes in the orelic cluster) are running Lustre 2.10 - only when they are running 2.12.  Do you have other ideas?

I've gathered information from the node (dmesg, lctl dk, module params, etc.) and also gathered a crash dump.

Comment by Serguei Smirnov [ 13/Jan/22 ]

Olaf,

I was wondering whether before I provide an instrumented build for debugging, in the meantime you could try the test making sure that o2iblnd parameters are consistent between orelic4 and the nodes it is talking to directly, specifically peer_credits_hiw=4 and concurrent_sends=8

Thanks,

Serguei.

Comment by Olaf Faaland [ 14/Jan/22 ]

I was wondering whether before I provide an instrumented build for debugging, in the meantime you could try the test making sure that o2iblnd parameters are consistent between orelic4 and the nodes it is talking to directly, specifically peer_credits_hiw=4 and concurrent_sends=8

Yes, I'll check whether those parameters are consistent.

Comment by Olaf Faaland [ 18/Jan/22 ]

Hi Serguei, orelic4 and all of the nodes it talks to directly over o2ib have peer_credits_hiw=0 and concurrent_sends=0 (we don't set those values).

 

Comment by Serguei Smirnov [ 28/Jan/22 ]

Hi Olaf,

I prepared a patch that can be applied on top of 2.12.7-llnl

This patch provides more detailed info on lpni refcounts. There are individual counts for each instance the lpni refcount is incremented and decremented in the code.

After applying the patch, once the peer with excessive refcount is identified, you can use 

lnetctl peer show -v 5 --nid <nid_of_peer_with_high_refcount>

command to display detailed counts which will be dumped in the end.

This should help narrow down the issue a bit.

debug_refcount_01.patch

Thanks,

Serguei.

Comment by Olaf Faaland [ 28/Jan/22 ]

Hi Serguei,

Thank you, that looks good.  We're at 2.12.8 these days, but the patch applies cleanly.  Is there any reason not to push it to gerrit with "fortestonly"?

Comment by Serguei Smirnov [ 28/Jan/22 ]

Hi Olaf,

I didn't think I could push it to LLNL repo. Do you mean I should push it to b2_12 of Lustre?

Comment by Olaf Faaland [ 28/Jan/22 ]

> Do you mean I should push it to b2_12 of Lustre?

Yes, that's what I meant, with "Test-Parameters: fortestonly" and an appropriately limited set of tests.  Would that be inappropriate? 

Comment by Gerrit Updater [ 28/Jan/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46364
Subject: LU-15234 lnet: add debug info for lpni refcounts
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d4abf0db289afad72c7d4ac468aec4e2c7c2f935

Comment by Chris Horn [ 28/Jan/22 ]

I've been reviewing the related code off and on and I have found one reference leak, though I doubt it is responsible for your issue because it would only be hit on ENOMEM error (which is probably rare), and this code path deals with resizing the ping buffer which should not happen very often. This ping code is suspicious though, because it is something that has changed from 2.10 -> 2.12.

static void
lnet_ping_router_locked(struct lnet_peer_ni *rtr)
{
        struct lnet_rc_data *rcd = NULL;
        time64_t now = ktime_get_seconds();
        time64_t secs;
        struct lnet_ni *ni;

        lnet_peer_ni_addref_locked(rtr); <<<<< Addref

        if (rtr->lpni_ping_deadline != 0 && /* ping timed out? */
            now >  rtr->lpni_ping_deadline)
                lnet_notify_locked(rtr, 1, 0, now);

        /* Run any outstanding notifications */
        ni = lnet_get_next_ni_locked(rtr->lpni_net, NULL);
        lnet_ni_notify_locked(ni, rtr);

        if (!lnet_isrouter(rtr) ||
            the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
                /* router table changed or router checker is shutting down */
                lnet_peer_ni_decref_locked(rtr);
                return;
        }

        rcd = rtr->lpni_rcd;

        /*
         * The response to the router checker ping could've timed out and
         * the mdh might've been invalidated, so we need to update it
         * again.
         */
        if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis ||
            LNetMDHandleIsInvalid(rcd->rcd_mdh))
                rcd = lnet_update_rc_data_locked(rtr);
        if (rcd == NULL)
                return; <<<<< Reference leak
Comment by Gerrit Updater [ 28/Jan/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46367
Subject: LU-15234 lnet: ref leak in lnet_ping_router_locked
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c28647196038c4c4bdae113c02e6898821fcaa8f

Comment by Olaf Faaland [ 29/Jan/22 ]

Thanks Serguei.  I'll be able to run with this patch Tuesday morning, and I'll get results right away.

Comment by Olaf Faaland [ 29/Jan/22 ]

Thanks Chris.  I'm adding your patch to my stack so the Tuesday morning test will be with both patches.

Comment by Olaf Faaland [ 01/Feb/22 ]

Hi Serguei and Chris, I've uploaded orelic4.debug_refcount_01.tar.gz which has dmesg, lustre debug log (with our default flags, errors only), and lnetctl peer show for a peer with a high refcount.  Thanks

Comment by Chris Horn [ 01/Feb/22 ]

faaland1 I'll let Serguei dig into the data from the debug patch, but I noticed in the lustre debug log you provided that a few nodes appear to be down/unreachable:

hornc@C02V50B9HTDG pass1 % grep 'ADDR ERR' dk.1.txt | awk '{print $2}' | sort -u
172.19.1.59@o2ib100:
172.19.1.91@o2ib100:
172.19.1.92@o2ib100:
172.19.2.26@o2ib100:
172.19.2.27@o2ib100:
hornc@C02V50B9HTDG pass1 %

Is this expected? i.e. were these peers actually down during the time captured by the log?

Comment by Olaf Faaland [ 01/Feb/22 ]

Hi Chris,

172.19.1.59@o2ib100: This host was not down. I noticed afterwards that it failed to reconnect with orelic4 after orelic4 was rebooted; I haven't had a chance to look into it yet, so I don't know why.

172.19.1.91@o2ib100: retired/expected
172.19.1.92@o2ib100: retired/expected
172.19.2.26@o2ib100: down/expected
172.19.2.27@o2ib100: down/expected

Comment by Serguei Smirnov [ 02/Feb/22 ]

Hi Olaf,

The debug data listing itemized lpni refcounts looks incorrect: the sum of "decrement" counts exceeds the sum of "increment" counts by a lot, while I was expecting to see the difference no less than the reported total peer ni ref count. I wonder if the debug patch got applied correctly. 

Was the patch applied to LLNL repo or was lustre b2_12 used? I'd like to check if the resulting code is missing some of the "addref" cases. Could you please provide the diff of the change and specify which commit was used as a base?

Thanks,

Serguei.

Comment by Olaf Faaland [ 02/Feb/22 ]

Hi Serguei,

I applied the patch to lustre's branch based on 2.12.8. Here's what I built:
branch https://github.com/LLNL/lustre/tree/debug-refcount-01 tag 2.12.8_6.llnl.olaf3

There was a merge conflict because I had both Chris' refcount leak patch and your patch on the same branch. Maybe I goofed up the conflict resolution.

thanks,
Olaf

Comment by Chris Horn [ 02/Feb/22 ]

If system was under load then the counts might not match up just from activity on the router while the stats were being dumped.

Comment by Serguei Smirnov [ 02/Feb/22 ]

I'm going to update the debug patch to make use of atomic increments. I'll also rebase it on top of Chris's fix.

Comment by Serguei Smirnov [ 08/Feb/22 ]

Hi,

The debug patch has been updated. Please try the same test again and use

lnetctl peer show -v 5 --nid <nid_of_peer_with_high_refcount> 

to dump the debug data.

Thanks,

Serguei. 

Comment by Olaf Faaland [ 15/Feb/22 ]

Serguei,

I've attached 4 files - peer.show.172.16.70.*_at_tcp.orelic4.1644951836
peers .63,65 had climbing refcounts
peers .62,64 did not have climbing refcounts.
orelic4 was running your latest refcount debug patch using an atomic_t array.

Comment by Serguei Smirnov [ 17/Feb/22 ]

Olaf,

Now the debug counts add up and make more sense. However, so far I haven't been able to find any obvious leaks, so the most likely explanation is still messages just not getting finalized in LNet, and it is still unclear exactly why. I found a patch that may be related (LU-10428), but because I'm not certain about it, I'll probably also add some more debugging. I'll let you know when the patches are ready.

Thanks,

Serguei.

Comment by Olaf Faaland [ 17/Feb/22 ]

Hi Serguei,

OK, thanks.  If you want, you could base your patches on 2.14.

-Olaf

Comment by Gerrit Updater [ 28/Feb/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46650
Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: fa91a09fe507d17bc04568266589d7298c4e4025

Comment by Serguei Smirnov [ 02/Mar/22 ]

Hi Olaf, 

The new patch is based on b2_14. It adds ability to examine the lnd peer. 

If there's a peer with climbing refcounts

lnetctl debug peer --prim_nid=<peer nid> 

will dump the peer info to the log ("console", should be enabled by default), to be retrieved with  "lctl dk".

Note that "prim_nid" parameter doesn't really require the primary nid of the peer, but rather the specific lpni. I'll update the patch to make it more clear a bit later.

The purpose of this is to check the idea that something on lnd level is preventing messages from being finalized.

Thanks,

Serguei.

 

Comment by Olaf Faaland [ 25/Mar/22 ]

Hi Serguei,

Sorry for the long delay.

With 2.14 + this patch running on orelic4, I did not see the climbing peer refcounts, even though I still saw the symptoms described in LU-14026 (reconnects between clients and targets, timed out messages on clients and servers, "Lost connection to MGS" on clients.

So perhaps my suggestion to use 2.14 wasn't a good one. I'm wondering if there are multiple issues and 2.14 doesn't have the one that causes climbing refcounts. Ideas?

thanks,
Olaf

Comment by Gerrit Updater [ 26/May/22 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/47460
Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 5dec282aff43c739e8fd422df9e0de8fd93e35ba

Comment by Olaf Faaland [ 27/May/22 ]

Hi Serguei,

Can you review Gian-Carlo's backport of your lnd peer debug info patch?  We're seeing this climbing refcount issue more widely.

thanks,
Olaf

Comment by Serguei Smirnov [ 27/May/22 ]

Hi Olaf,

Would it be possible to check which Mellanox FW versions are used? There was a recent investigation at one of the DDN sites which isolated xxx.30.xxxx FW version as problematic: there's a bug in this version which can cause "stuck qp" in IB layer. I'd like to make sure we're not affected by the same problem here.

Thanks,

Serguei.

Comment by Olaf Faaland [ 01/Jun/22 ]

Hi Serguei,

Routers with this symptom recently have FW 16.29.2002.  We don't have any routers running xxx.30.xxxx.

thanks

Comment by Olaf Faaland [ 09/Jun/22 ]

Hi Serguei,

We reproduced the issue on orelic2, with https://review.whamcloud.com/47460, under Lustre 2.12.8. 

There were 4 peers with high refcounts, with NIDs  172.16.70.6[2-5]@tcp.  I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for.  I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached.

lnetctl.peer.show.orelic2.1654723542.txt
lnetctl.peer.show.orelic2.1654724780.txt
dk.orelic2.1654723678.txt
dk.orelic2.1654723686.txt
dk.orelic2.1654724730.txt
dk.orelic2.1654724740.txt
dk.orelic2.1654724745.txt
dk.orelic2.1654724751.txt

Thanks,
Olaf

Comment by Olaf Faaland [ 13/Jun/22 ]

Hi Serguei,
I don't have a record of which peer NID was given as the argument, for the above debug sessions. Do you need me to reproduce this and keep track of that?
thanks,
Olaf

Comment by Serguei Smirnov [ 13/Jun/22 ]

Hi Olaf,

So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level.

If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only.

In the meantime I'm looking at how instrumentation can be extended to yield more useful info.

Thanks,

Serguei.

 

Comment by Serguei Smirnov [ 16/Jun/22 ]

Hi Olaf,

After discussing with ashehata, I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364

I'd like to clarify the following:

1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500.

2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer? 

Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too.

Thanks,

Serguei.

 

 

Comment by Olaf Faaland [ 21/Jun/22 ]

Hi Serguei,

I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2".

The detailed counts and debug logs are attached:  2022-jun-21.tgz

To provide context:
2022-06-21 14:36:56 LNet started with debug patch
2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate)
2022-06-21 15:21:34 issued "lnetctl lnet unconfigure"
2022-06-21 15:26:21 crashed orelic2 to gather the dump

The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered.

Before removing routes, the refcounts continued to climb.
After removing routes, the refcounts plateaued at 82
The "lnetctl lnet unconfigure" command hung

I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure".

I can send you the crash dump.

Thanks,
Olaf

Comment by Olaf Faaland [ 11/Jul/22 ]

Hi Serguei, do you have any update on this?
Thanks,
Olaf

Comment by Serguei Smirnov [ 11/Jul/22 ]

Hi Olaf,

I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow.

Could please give me access to the crash dump if you still have it, along with debuginfo rpms?

Thanks,

Serguei.

Comment by Olaf Faaland [ 11/Jul/22 ]

Hi Serguei,
I've uploaded the dump and debuginfos via ftp. Please confirm you received them.
thanks,
Olaf

Comment by Serguei Smirnov [ 12/Jul/22 ]

Hi Olaf,

I found these files 

-rw-r--r--  1 sdsmirnov  staff   469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff    65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff    19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff  1270395238 12 Jul 11:34 vmcore
-rw-r--r--  1 sdsmirnov  staff      148855 12 Jul 11:34 vmcore-dmesg.txt

and copied them over to my machine. I'll take a look and keep you updated.

Thanks,

Serguei.

Comment by Olaf Faaland [ 25/Jul/22 ]

Hi Serguei,
Do you have any updates?
thanks,
Olaf

Comment by Serguei Smirnov [ 26/Jul/22 ]

Hi Olaf,

I had too many distractions and haven't finished looking at the core yet.

Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there.

Thanks,

Serguei.

Comment by Gerrit Updater [ 08/Aug/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163
Subject: LU-15234 lnet: test for race when completing discovery
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c

Comment by Serguei Smirnov [ 08/Aug/22 ]

Hi Olaf,

While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer.

At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed.

I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase.

Otherwise, I'll have to look for other possible causes.

Thanks,

Serguei.

Comment by Chris Horn [ 09/Aug/22 ]

Sounds like https://jira.whamcloud.com/browse/LU-12739 ?

Comment by Serguei Smirnov [ 09/Aug/22 ]

Chris,

Yes indeed, it looks very much like LU-12739

I'll port these changes.

Thanks,

Serguei.

Comment by Serguei Smirnov [ 11/Aug/22 ]

Hi Olaf,

I ported Chris's fix for LU-12739 to b2_12: https://review.whamcloud.com/#/c/48190/

Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided.

Thanks,

Serguei.

Comment by Olaf Faaland [ 24/Aug/22 ]

Hi Serguei

I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days.

Comment by Olaf Faaland [ 25/Aug/22 ]

Hi Serguei,

2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great.

I'll deploy more widely and update here early next week.

thanks,
Olaf

Comment by Olaf Faaland [ 31/Aug/22 ]

Hi Serguei,

2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

Olaf

Comment by Olaf Faaland [ 12/Sep/22 ]

As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that?

thanks

Comment by Peter Jones [ 12/Sep/22 ]

Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

Comment by Gerrit Updater [ 15/Sep/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566
Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a

Comment by Peter Jones [ 20/Sep/22 ]

The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

Comment by Olaf Faaland [ 27/Sep/22 ]

>  The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ?

No opinion from me.

Thanks for getting this fixed.

Comment by Peter Jones [ 01/Oct/22 ]

I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments?

Comment by Serguei Smirnov [ 03/Oct/22 ]

No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. 

Comment by Gerrit Updater [ 25/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/
Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

Comment by Peter Jones [ 25/Oct/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:16:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.