[LU-17440] after move from 2.14 to 2.15: LNetError: 31941:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) Created: 17/Jan/24 Updated: 07/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.4 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, topllnl | ||
| Environment: |
TOSS 4.7-2.1 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We combined a OS update from TOSS 4.6-6 to TOSS 4.7-2.1 with a move from lustre 2.14 to 2.15 (2.14.0_22.llnl-1 to lustre-2.15.4_1.llnl-1). A few hours later we began to see this error, and eventually saw it on all 12 asp server nodes. 2024-01-16 23:13:19 [39638.886090] Lustre: aspls3-OST0004: Client 8f15405e-d4cc-cf3c-7534-051e7352cf50 (at 192.168.128.24@o2ib35) reconnecting 2024-01-16 23:13:19 [39638.896879] Lustre: Skipped 98 previous similar messages 2024-01-16 23:13:34 [39654.404520] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed: 2024-01-16 23:13:34 [39654.416271] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) LBUG 2024-01-16 23:13:34 [39654.423671] Pid: 165557, comm: kiblnd_sd_00_01 4.18.0-513.9.1.1toss.t4.x86_64 #1 SMP Wed Nov 29 11:04:55 PST 2023 2024-01-16 23:13:34 [39654.433921] Call Trace TBD: 2024-01-16 23:13:34 [39654.436731] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] 2024-01-16 23:13:34 [39654.441888] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] 2024-01-16 23:13:34 [39654.446688] [<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet] 2024-01-16 23:13:34 [39654.452722] [<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet] 2024-01-16 23:13:34 [39654.458845] [<0>] lnet_select_pathway+0xb95/0x16c0 [lnet] 2024-01-16 23:13:34 [39654.464265] [<0>] lnet_send+0x6d/0x1e0 [lnet] 2024-01-16 23:13:34 [39654.468646] [<0>] lnet_parse_local+0x3ed/0xdd0 [lnet] 2024-01-16 23:13:34 [39654.473721] [<0>] lnet_parse+0xd7d/0x1490 [lnet] 2024-01-16 23:13:34 [39654.478366] [<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd] 2024-01-16 23:13:34 [39654.483782] [<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd] 2024-01-16 23:13:34 [39654.489363] [<0>] kthread+0x14c/0x170 2024-01-16 23:13:34 [39654.493030] [<0>] ret_from_fork+0x1f/0x40 2024-01-16 23:13:34 [39654.497050] Kernel panic - not syncing: LBUG 2024-01-16 23:13:34 [39654.501320] CPU: 47 PID: 165557 Comm: kiblnd_sd_00_01 Kdump: loaded Tainted: P OE X --------- - - 4.18.0-513.9.1.1toss.t\ 4.x86_64 #1 2024-01-16 23:13:34 [39654.514172] Hardware name: Supermicro SSG-229P-DN2R24264-LL013/X11DSN-TS, BIOS 3.4 11/04/2020 2024-01-16 23:13:34 [39654.522683] Call Trace: 2024-01-16 23:13:34 [39654.525137] dump_stack+0x41/0x60 2024-01-16 23:13:34 [39654.528457] panic+0xe7/0x2ac 2024-01-16 23:13:34 [39654.531429] ? ret_from_fork+0x1f/0x40 2024-01-16 23:13:34 [39654.535182] lbug_with_loc.cold.8+0x18/0x18 [libcfs] 2024-01-16 23:13:34 [39654.540156] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet] 2024-01-16 23:13:35 [39654.545747] lnet_handle_find_routed_path+0x86c/0xee0 [lnet] 2024-01-16 23:13:35 [39654.551423] ? lnet_peer_ni_find_locked+0x14/0x30 [lnet] 2024-01-16 23:13:35 [39654.556753] lnet_select_pathway+0xb95/0x16c0 [lnet] 2024-01-16 23:13:35 [39654.561735] ? kiblnd_check_sends_locked+0x1a5/0x4a0 [ko2iblnd] 2024-01-16 23:13:35 [39654.567656] lnet_send+0x6d/0x1e0 [lnet] 2024-01-16 23:13:35 [39654.571600] lnet_parse_local+0x3ed/0xdd0 [lnet] 2024-01-16 23:13:35 [39654.576238] lnet_parse+0xd7d/0x1490 [lnet] 2024-01-16 23:13:35 [39654.580438] ? try_to_wake_up+0x1c2/0x4f0 2024-01-16 23:13:35 [39654.584454] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd] 2024-01-16 23:13:35 [39654.589427] ? __wake_up_common+0x7a/0x190 2024-01-16 23:13:35 [39654.593526] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd] 2024-01-16 23:13:35 [39654.598665] ? finish_wait+0x90/0x90 2024-01-16 23:13:35 [39654.602245] ? kiblnd_cq_event+0x80/0x80 [ko2iblnd] 2024-01-16 23:13:35 [39654.607125] kthread+0x14c/0x170 2024-01-16 23:13:35 [39654.610357] ? set_kthread_struct+0x50/0x50 2024-01-16 23:13:35 [39654.614544] ret_from_fork+0x1f/0x40 2024-01-16 23:13:36 [ 0.000000] Linux version 4.18.0-513.9.1.1toss.t4.x86_64 (mockbuild@builder2-x86.buildfarm.internal) (gcc version 8.5.0 20210514 (Re\ d Hat 8.5.0-20) (GCC)) #1 SMP Wed Nov 29 11:04:55 PST 2023 |
| Comments |
| Comment by Gian-Carlo Defazio [ 17/Jan/24 ] |
|
For my notes the local issue is TOSS-6214 |
| Comment by Gian-Carlo Defazio [ 18/Jan/24 ] |
|
The asp cluster has 12 server nodes, asp[1-12]. asp[1-4] are MDT[0000-0003], asp[5-12] are OST[0000-0007] The nids are 172.19.1.[141-152]@o2ib100 so the last octet of the IPV4 is (<node_number> + 140)
I will also include some logs for orelic. orelic[2-5] were rebooted from a 2.12 image to a 2.15 image from about 2024-01-16 (15:00 - 17:00). orelic is a ib to tcp lustre router cluster. The nids for orelic[2-5] are 172.19.2.[22-25]@o2ib100 and 172.16.66.[22-25]@tcp
The console logs are in asp-orelic-console.tar.gz |
| Comment by Serguei Smirnov [ 18/Jan/24 ] |
|
I can't explain yet how exactly this happened. Did you update all nodes to lustre-2.15.4_1.llnl-1 or just the servers? Please share a core file if you have it. I'm a bit suspicious of Thanks, Serguei.
|
| Comment by Gian-Carlo Defazio [ 18/Jan/24 ] |
|
There's still a variety of lustre versions, 2.12, 2.14, and 2.15 The only production server cluster we've put 2.15 on is asp. The TOSS 4 (RHEL 8) server clusters are running 2.14 (2.14.0_24.llnl-1). asp was running 2.14.0_22.llnl-1 before the update. The TOSS 3 (RHEL 7) server clusters run 2.12. Most client clusters are running 2.12, some experimental and test clusters run 2.15. The relic clusters (zrelic and orelic) run 2.12, but orelic has been 2.15 for a while for testing. It got rebooted back to 2.12 on the morning of 2024-01-16 and then put back to 2.15 that afternoon.
So to be clear, the only 2.15 updates were asp (a server cluster) and orelic (a router cluster). orelic had been 2.15 previously, asp had only ever been 2.14.
One of the first things I checked was to make sure we didn't leave out https://jira.whamcloud.com/browse/LU-17062 in our 2.15_4-llnl branch. We'll look into testing with that patch removed.
Unfortunately we didn't get any core files, despite all the LBUGs. |
| Comment by Serguei Smirnov [ 19/Jan/24 ] |
|
I did a review of which fixes from Lustre master LLNL branch may be missing, but haven't found anything that would stand out as definitely related. I'd suggest though that you consider porting Thanks, Serguei. |
| Comment by Gian-Carlo Defazio [ 26/Jan/24 ] |
|
I was able to reproduce the bug. The original bug seems to have been between asp and the compute cluster ruby. ruby has 8 routers and is connected to asp: <ruby_compute_nodes> <—> o2ib39 <—> <ruby_routers> <—> o2ib100 <—> asp However a route is missing on asp. The routes to ruby should be 172.19.2.[39-46], but the the routes are 172.19.2.[39-45], so 172.19.2.46 is missing. I noticed that this nid appears right before a lot of the LBUGs in the console logs.
I recreated this setup using test clusters mutt (compute) and garter (server). The garter nids for garter[1-8] are 172.19.1.[133-140]@o2ib100 <mutt_compute_nodes> <—> o2ib44 <—> <mutt_routers> <—> o2ib100 <—> garter The mutt routers are mutt[1-4] with nids 172.19.1.[105-108]@o2ib100 and 192.168.128.[1-4]@o2ib44. I removed the route through mutt4 (172.19.1.108@o2ib100) on all garter nodes so now the routes are 172.19.1.[105-107]@o2ib100. I then pinged from a mutt compute node, mutt7 (192.168.128.7@o2ib44) to garter3 (172.19.1.135@o2ib100) using lnetctl ping. I had to ping twice, I assume to round-robin my way to using mutt4, for which garter has not route. This caused the error on garter3. I had panic on LBUG turned off. I'll upload the debug logs for mutt7, the mutt router mutt[1-4] and garter3 as mutt-garter-reproducer.tar.gz I haven't looked through these debug logs yet. I'll see if the patch recommendations you made make a difference. Also, we are currently in the configuration with a route to ruby missing on asp, but asp is running lustre-2.14.0_24.llnl-1.t4.x86_64 and we don't get this bug. |
| Comment by Gian-Carlo Defazio [ 26/Jan/24 ] |
|
I was unable to recreate the bug using our 2.14 branch lustre-2.14.0_24.llnl-1.t4.x86_64 or our 2.15 branch lustre 2.15.4_1.llnl-1 with the only changes being I noticed that I used garter4 for the server in these examples, whereas for the bug reproducer I used garter3. |
| Comment by Gian-Carlo Defazio [ 02/Feb/24 ] |
|
This is what I think is going wrong, do you agree? As noted before, without For the code lines below, I'm referring to where we added The debug data is in mutt-garter-debug-2fcd4d27f.tar.gz in the dk.TOSS-6214.garter3.lustre-2.15.4_1.llnl_3_g2fcd4d2-1.t4.x86_64.1706816872 file. Lines 737 to 771 The setup is garter3 <—> o2ib100 <—> mutt[1-4] <—> o2ib44 <—> mutt7 garter3 has routes to o2ib44 via mutt[1-3], mutt7 has routes to o2ib100 via mutt[1-4], so the is an asymmetry regarding mutt4 Looking at function lnet_handle_find_routed_path(): at line 2182, there is a call to lnet_peer_ni_find_locked() The refcount for mutt4 is incremented due to lnet_peer_ni_addref_locked() being in the call stack above. before at line 2291, net_find_route_locked() is called. In the debug logs I can see the value of gwni->lpni_nid change after this call. before the after |
| Comment by Serguei Smirnov [ 02/Feb/24 ] |
|
Thanks defazio for tracking this down! I think your analysis is correct: indeed it looks like essentially we decref on what we didn't addref here. (I wonder if this can be reproduced with just two routers, for the sake of adding a testcase.) Are you planning to push a patch for this or would you like me to?
|
| Comment by Gian-Carlo Defazio [ 02/Feb/24 ] |
|
I'll push a patch. I'm not entirely sure what the best way to go about this is, but I guess I'll find out in the code reviews. I'll run a test with 2 routers to see if I can reproduce it. Another weird thing I noticed was that it didn't reproduce on a different test cluster. That cluster, slag, isn't the same hardware as asp and garter. So I'm gonna go back and confirm that I didn't just have things set up wrong. However, the bug that I found doesn't look hardware specific. |
| Comment by Gian-Carlo Defazio [ 02/Feb/24 ] |
|
The bug does reproduce with 2 routers. I tested where mutt7 has routes through mutt[1,2] and garter3 has a route through mutt1. The debug files are in mutt-garter-debug-2routers.tar.gz
The bug does not reproduce in the case that garter3 has no routes to o2ib44. The debug files are in mutt-garter-debug-garter3_no_routes.tar.gz |
| Comment by Gerrit Updater [ 03/Feb/24 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53896 |
| Comment by Gian-Carlo Defazio [ 06/Feb/24 ] |
|
ssmirnov |
| Comment by Serguei Smirnov [ 07/Feb/24 ] |
|
I was hoping it would be possible to go with sanity-lnet.sh. There are some routing tests there already, but I don't think there's a test which uses 2 routers. hornc can correct me if I'm wrong, but I think it should be possible to basically follow the existing pattern with new test if we make sure there are enough test nodes allocated. |