[LU-17440] after move from 2.14 to 2.15: LNetError: 31941:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) Created: 17/Jan/24  Updated: 07/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.4
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: llnl, topllnl
Environment:

TOSS 4.7-2.1
lustre 2.15.4_1.llnl-1
on lustre server asp


Attachments: File asp-orelic-console.tar.gz     File mutt-garter-2.14-no-bug.tar.gz     File mutt-garter-2.15-revert-LU-17062-no-bug.tar.gz     File mutt-garter-debug-2fcd4d27f.tar.gz     File mutt-garter-debug-2routers.tar.gz     File mutt-garter-debug-garter3_no_routes.tar.gz     File mutt-garter-reproducer.tar.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We combined a OS update from TOSS 4.6-6 to TOSS 4.7-2.1 with a move from lustre 2.14 to 2.15 (2.14.0_22.llnl-1 to lustre-2.15.4_1.llnl-1).

A few hours later we began to see this error, and eventually saw it on all 12 asp server nodes.

2024-01-16 23:13:19 [39638.886090] Lustre: aspls3-OST0004: Client 8f15405e-d4cc-cf3c-7534-051e7352cf50 (at 192.168.128.24@o2ib35) reconnecting
2024-01-16 23:13:19 [39638.896879] Lustre: Skipped 98 previous similar messages
2024-01-16 23:13:34 [39654.404520] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed:
2024-01-16 23:13:34 [39654.416271] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) LBUG
2024-01-16 23:13:34 [39654.423671] Pid: 165557, comm: kiblnd_sd_00_01 4.18.0-513.9.1.1toss.t4.x86_64 #1 SMP Wed Nov 29 11:04:55 PST 2023
2024-01-16 23:13:34 [39654.433921] Call Trace TBD:
2024-01-16 23:13:34 [39654.436731] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
2024-01-16 23:13:34 [39654.441888] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
2024-01-16 23:13:34 [39654.446688] [<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
2024-01-16 23:13:34 [39654.452722] [<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
2024-01-16 23:13:34 [39654.458845] [<0>] lnet_select_pathway+0xb95/0x16c0 [lnet]
2024-01-16 23:13:34 [39654.464265] [<0>] lnet_send+0x6d/0x1e0 [lnet]
2024-01-16 23:13:34 [39654.468646] [<0>] lnet_parse_local+0x3ed/0xdd0 [lnet]
2024-01-16 23:13:34 [39654.473721] [<0>] lnet_parse+0xd7d/0x1490 [lnet]
2024-01-16 23:13:34 [39654.478366] [<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
2024-01-16 23:13:34 [39654.483782] [<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
2024-01-16 23:13:34 [39654.489363] [<0>] kthread+0x14c/0x170
2024-01-16 23:13:34 [39654.493030] [<0>] ret_from_fork+0x1f/0x40
2024-01-16 23:13:34 [39654.497050] Kernel panic - not syncing: LBUG
2024-01-16 23:13:34 [39654.501320] CPU: 47 PID: 165557 Comm: kiblnd_sd_00_01 Kdump: loaded Tainted: P           OE  X --------- -  - 4.18.0-513.9.1.1toss.t\
4.x86_64 #1
2024-01-16 23:13:34 [39654.514172] Hardware name: Supermicro SSG-229P-DN2R24264-LL013/X11DSN-TS, BIOS 3.4 11/04/2020
2024-01-16 23:13:34 [39654.522683] Call Trace:
2024-01-16 23:13:34 [39654.525137]  dump_stack+0x41/0x60
2024-01-16 23:13:34 [39654.528457]  panic+0xe7/0x2ac
2024-01-16 23:13:34 [39654.531429]  ? ret_from_fork+0x1f/0x40
2024-01-16 23:13:34 [39654.535182]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
2024-01-16 23:13:34 [39654.540156]  lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
2024-01-16 23:13:35 [39654.545747]  lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
2024-01-16 23:13:35 [39654.551423]  ? lnet_peer_ni_find_locked+0x14/0x30 [lnet]
2024-01-16 23:13:35 [39654.556753]  lnet_select_pathway+0xb95/0x16c0 [lnet]
2024-01-16 23:13:35 [39654.561735]  ? kiblnd_check_sends_locked+0x1a5/0x4a0 [ko2iblnd]
2024-01-16 23:13:35 [39654.567656]  lnet_send+0x6d/0x1e0 [lnet]
2024-01-16 23:13:35 [39654.571600]  lnet_parse_local+0x3ed/0xdd0 [lnet]
2024-01-16 23:13:35 [39654.576238]  lnet_parse+0xd7d/0x1490 [lnet]
2024-01-16 23:13:35 [39654.580438]  ? try_to_wake_up+0x1c2/0x4f0
2024-01-16 23:13:35 [39654.584454]  kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
2024-01-16 23:13:35 [39654.589427]  ? __wake_up_common+0x7a/0x190
2024-01-16 23:13:35 [39654.593526]  kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
2024-01-16 23:13:35 [39654.598665]  ? finish_wait+0x90/0x90
2024-01-16 23:13:35 [39654.602245]  ? kiblnd_cq_event+0x80/0x80 [ko2iblnd]
2024-01-16 23:13:35 [39654.607125]  kthread+0x14c/0x170
2024-01-16 23:13:35 [39654.610357]  ? set_kthread_struct+0x50/0x50
2024-01-16 23:13:35 [39654.614544]  ret_from_fork+0x1f/0x40
2024-01-16 23:13:36 [    0.000000] Linux version 4.18.0-513.9.1.1toss.t4.x86_64 (mockbuild@builder2-x86.buildfarm.internal) (gcc version 8.5.0 20210514 (Re\
d Hat 8.5.0-20) (GCC)) #1 SMP Wed Nov 29 11:04:55 PST 2023


 Comments   
Comment by Gian-Carlo Defazio [ 17/Jan/24 ]

For my notes the local issue is TOSS-6214

Comment by Gian-Carlo Defazio [ 18/Jan/24 ]

The asp cluster has 12 server nodes, asp[1-12]. asp[1-4] are MDT[0000-0003], asp[5-12] are OST[0000-0007]

The nids are 172.19.1.[141-152]@o2ib100 so the last octet of the IPV4 is (<node_number> + 140)

 

I will also include some logs for orelic. orelic[2-5] were rebooted from a 2.12 image to a 2.15 image from about 2024-01-16 (15:00 - 17:00). orelic is a ib to tcp lustre router cluster.

The nids for orelic[2-5] are 172.19.2.[22-25]@o2ib100 and 172.16.66.[22-25]@tcp

 

The console logs are in asp-orelic-console.tar.gz

Comment by Serguei Smirnov [ 18/Jan/24 ]

I can't explain yet how exactly this happened. Did you update all nodes to lustre-2.15.4_1.llnl-1 or just the servers? Please share a core file if you have it.

I'm a bit suspicious of LU-17062 "lnet: Update lnet_peer_*_decref_locked usage" change. If you have room to experiment, it may be worth reverting this change to see if it helps. I'll work on reproducing locally, too.

Thanks,

Serguei.

 

 

 

Comment by Gian-Carlo Defazio [ 18/Jan/24 ]

There's still a variety of lustre versions, 2.12, 2.14, and 2.15

The only production server cluster we've put 2.15 on is asp. The TOSS 4 (RHEL 8) server clusters are running 2.14 (2.14.0_24.llnl-1). asp was running 2.14.0_22.llnl-1 before the update. The TOSS 3 (RHEL 7) server clusters run 2.12.

Most client clusters are running 2.12, some experimental and test clusters run 2.15.

The relic clusters (zrelic and orelic) run 2.12, but orelic has been 2.15 for a while for testing. It got rebooted back to 2.12 on the morning of 2024-01-16 and then put back to 2.15 that afternoon.

 

So to be clear, the only 2.15 updates were asp (a server cluster) and orelic (a router cluster).

orelic had been 2.15 previously, asp had only ever been 2.14.

 

One of the first things I checked was to make sure we didn't leave out https://jira.whamcloud.com/browse/LU-17062 in our 2.15_4-llnl branch. We'll look into testing with that patch removed.

 

Unfortunately we didn't get any core files, despite all the LBUGs.

Comment by Serguei Smirnov [ 19/Jan/24 ]

I did a review of which fixes from Lustre master LLNL branch may be missing, but haven't found anything that would stand out as definitely related. 

I'd suggest though that you consider porting LU-16709 "lnet: fix locking multiple NIDs of the MR peer". It fixes LU-14668 "lnet: Lock primary NID logic" which LLNL branch does have.

Thanks,

Serguei.

Comment by Gian-Carlo Defazio [ 26/Jan/24 ]

I was able to reproduce the bug.

The original bug seems to have been between asp and the compute cluster ruby. ruby has 8 routers and is connected to asp:

<ruby_compute_nodes> <—> o2ib39 <—> <ruby_routers> <—> o2ib100 <—> asp

However a route is missing on asp. The routes to ruby should be 172.19.2.[39-46], but the the routes are 172.19.2.[39-45], so 172.19.2.46 is missing. I noticed that this nid appears right before a lot of the LBUGs in the console logs.

 

I recreated this setup using test clusters mutt (compute) and garter (server).

The garter nids for garter[1-8] are 172.19.1.[133-140]@o2ib100

<mutt_compute_nodes> <—> o2ib44 <—> <mutt_routers> <—> o2ib100 <—> garter

The mutt routers are mutt[1-4] with nids 172.19.1.[105-108]@o2ib100 and 192.168.128.[1-4]@o2ib44. I removed the route through mutt4 (172.19.1.108@o2ib100) on all garter nodes so now the routes are 172.19.1.[105-107]@o2ib100.

I then pinged from a mutt compute node, mutt7 (192.168.128.7@o2ib44)  to garter3 (172.19.1.135@o2ib100) using lnetctl ping. I had to ping twice, I assume to round-robin my way to using mutt4, for which garter has not route.

This caused the error on garter3. I had panic on LBUG turned off.

I'll upload the debug logs for mutt7, the mutt router mutt[1-4] and garter3 as mutt-garter-reproducer.tar.gz

I haven't looked through these debug logs yet.

I'll see if the patch recommendations you made make a difference.

Also, we are currently in the configuration with a route to ruby missing on asp, but asp is running lustre-2.14.0_24.llnl-1.t4.x86_64 and we don't get this bug.

Comment by Gian-Carlo Defazio [ 26/Jan/24 ]

I was unable to recreate the bug using our 2.14 branch lustre-2.14.0_24.llnl-1.t4.x86_64 or our 2.15 branch lustre 2.15.4_1.llnl-1 with the only changes being LU-17062 reverted. I've uploaded the debug files for these tests.

I noticed that I used garter4 for the server in these examples, whereas for the bug reproducer I used garter3.

Comment by Gian-Carlo Defazio [ 02/Feb/24 ]

This is what I think is going wrong, do you agree?

As noted before, without LU-17062, the problem goes away.

For the code lines below, I'm referring to where we added LU-17062 in our 2.15.4_1-llnl branch https://github.com/LLNL/lustre/commit/2e2719341590edef547e16b4051684781cf1b795

The debug data is in mutt-garter-debug-2fcd4d27f.tar.gz in the dk.TOSS-6214.garter3.lustre-2.15.4_1.llnl_3_g2fcd4d2-1.t4.x86_64.1706816872 file. Lines 737 to 771
The debug patch version of lustre that was running on garter3 is at https://github.com/LLNL/lustre/tree/debug_print-lnet_handle_find_routed_path

The setup is
server garter3: 172.19.1.135@o2ib100
routers mutt[1-4]: 172.19.1.[105-108]@o2ib100 and 192.168.128.[1-4]@o2ib44
client mutt7: 192.168.128.7@o2ib44

garter3 <—> o2ib100 <—> mutt[1-4] <—> o2ib44 <—> mutt7

garter3 has routes to o2ib44 via mutt[1-3], mutt7 has routes to o2ib100 via mutt[1-4], so the is an asymmetry regarding mutt4
mutt7 pings garter3 via mutt4

Looking at function lnet_handle_find_routed_path():

at line 2182, there is a call to lnet_peer_ni_find_locked()
the call stack is lnet_handle_find_routed_path() -> lnet_peer_ni_find_locked() -> lnet_get_peer_ni_locked() -> lnet_peer_ni_addref_locked()
The return value is gwni. If gwni is not NULL, it is the gateway NI. In the debug logs for garter3, I see the nid for mutt4 when gwni is returned.

The refcount for mutt4 is incremented due to lnet_peer_ni_addref_locked() being in the call stack above.

before LU-17062, if gwni is not NULL, the refcount for mutt4 is decremented almost immediately at line 2184
after LU-17062, the decrement happens at the end of the function.

at line 2291, net_find_route_locked() is called. In the debug logs I can see the value of gwni->lpni_nid change after this call.
It changes from the nid of mutt4 to the nid of another mutt router that garter3 does have a route through.

before the LU-17062, there were no other calls to lnet_peer_ni_decref_locked()
after LU-17062 there is a call to lnet_peer_ni_decref_locked() if gwni was not NULL when it was modified at line 2182
and hasn't subsequently become NULL. It shouldn't have become NULL if there's some NI that garter3 can use to get to o2ib44, which
there is in this case.

after LU-17062 at line 2365 lnet_peer_ni_decref_locked() is called on the a router with a refcount of 1 (assuming the gwni changed values due to net_find_route_locked()). The refcount goes to 0 (or would go to 0) so lnet_destroy_peer_ni_locked() is called. lnet_destroy_peer_ni_locked() has asserts to make sure something like this doesn't happen, so the asserts fail.

Comment by Serguei Smirnov [ 02/Feb/24 ]

Thanks defazio for tracking this down! I think your analysis is correct: indeed it looks like essentially we decref on what we didn't addref here. 

(I wonder if this can be reproduced with just two routers, for the sake of adding a testcase.)

Are you planning to push a patch for this or would you like me to?

 

Comment by Gian-Carlo Defazio [ 02/Feb/24 ]

I'll push a patch. I'm not entirely sure what the best way to go about this is, but I guess I'll find out in the code reviews.

I'll run a test with 2 routers to see if I can reproduce it.

Another weird thing I noticed was that it didn't reproduce on a different test cluster. That cluster, slag, isn't the same hardware as asp and garter. So I'm gonna go back and confirm that I didn't just have things set up wrong. However, the bug that I found doesn't look hardware specific.

Comment by Gian-Carlo Defazio [ 02/Feb/24 ]

The bug does reproduce with 2 routers. I tested where mutt7 has routes through mutt[1,2] and garter3 has a route through mutt1. The debug files are in mutt-garter-debug-2routers.tar.gz

 

The bug does not reproduce in the case that garter3 has no routes to o2ib44. The debug files are in mutt-garter-debug-garter3_no_routes.tar.gz

Comment by Gerrit Updater [ 03/Feb/24 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53896
Subject: LU-17440 lnet: prevent errorneous decref for asym route
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8466a612e1bf5ceda0e3bf32a5eada2277cbbc75

Comment by Gian-Carlo Defazio [ 06/Feb/24 ]

ssmirnov
I've gone with the patch that Chris Horn recommended. As for the test case, you mentioned using just two routers. Is there a vm configuration that already has 2 routers setup? Or do I need to set things up from within a test script like sanity-lnet.sh?

Comment by Serguei Smirnov [ 07/Feb/24 ]

I was hoping it would be possible to go with sanity-lnet.sh. There are some routing tests there already, but I don't think there's a test which uses 2 routers.

hornc can correct me if I'm wrong, but I think it should be possible to basically follow the existing pattern with new test if we make sure there are enough test nodes allocated.

Generated at Sat Feb 10 03:35:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.