[LU-14810] sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed Created: 03/Jul/21  Updated: 02/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Cyril Bordage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14627 Lost ref on lnet_peer in discovery le... Resolved
is related to LU-13005 Remove LNet event queuing subsystem (... Resolved
is related to LU-17313 sanity-lnet test_205/test_220: LNetEr... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84ed5e32-bda8-4880-a254-dbd96cb6a478

test_212 failed with the following error:

trevis-79vm3 crashed during sanity-lnet test_212

LNetError: 944562:0:(lib-md.c:288:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
Pid: 944562, comm: lnet_discovery 4.18.0-240.22.1.el8_3.aarch64 #1 SMP Thu Apr 8 19:01:45 UTC 2021
Call Trace:
 libcfs_call_trace+0xb8/0x118 [libcfs]
 lbug_with_loc+0x60/0xa0 [libcfs]
 lnet_assert_handler_unused+0xb8/0xe0 [lnet]
 lnet_peer_discovery+0x16c4/0x1cb0 [lnet]
 kthread+0x130/0x138

This test was recently landed via patch https://review.whamcloud.com/43418 "LU-14627 lnet: Ensure ref taken when queueing for discovery"

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_212 - trevis-79vm3 crashed during sanity-lnet test_212



 Comments   
Comment by Andreas Dilger [ 03/Jul/21 ]

Chris, could you please take a look at this. I just hit this failure in a functionally no-op patch, and since this test was just added by your patch it is possibly related to that.

Comment by Chris Horn [ 12/Jul/21 ]

adilger is there a vmcore available?

Comment by Andreas Dilger [ 13/Jul/21 ]

Chris, we used to collect crash dumps, but I don't see them linked from the Maloo test page for the crash. There is a console log attached for the failed server.

It should also be noted that this failure is with review-ldiskfs-arm (i.e. ARM client), so that may be of interest. It is running in little-endian mode, but will have 64KB PAGE_SIZE.

Comment by Li Xi [ 04/Jul/22 ]

Might got another one on master, but no log saved:

https://testing.whamcloud.com/test_sets/83dcfa5f-0848-43aa-8278-a7eb263bdd11

Comment by Andreas Dilger [ 27/Feb/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/b44670dd-ae14-4cdc-afdf-083de5b4ebc5

Comment by Andreas Dilger [ 29/Mar/23 ]

+1 on master + Ubuntu 22.04:
https://testing.whamcloud.com/test_sets/1a78e4ac-57d1-4905-97df-bf186dc0c619

Comment by Chris Horn [ 03/Apr/23 ]

The issue is a race between callers of lnet_ping_target_setup() and lnet_ping_target_fini(). I'm not sure how to resolve it yet.

Comment by Alex Zhuravlev [ 20/Nov/23 ]

https://testing.whamcloud.com/test_sets/bf0e171e-4549-4d8e-b437-1ac07124c4a0 on master

Comment by Etienne Aujames [ 27/Nov/23 ]

Same stack on "sanity-lnet test_220":
https://testing.whamcloud.com/test_sets/245f2ca0-247a-4f4b-989a-807f4642547e

[17408.323111] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-lnet test 220: Add routes w\/default options - check aliveness ========================================================== 22:18:00 \(1700864280\)
[17408.530260] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 22:18:00 (1700864280)
[17408.806233] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod
[17409.412889] Key type lgssc unregistered
[17409.479271] LNetError: 9972:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
[17409.481545] LNetError: 9972:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG
[17409.482950] Pid: 9972, comm: lnet_discovery 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Nov 8 23:01:53 UTC 2023
[17409.484866] Call Trace TBD:
[17409.485593] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]
[17409.486676] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[17409.487633] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]
[17409.488896] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]
[17409.489990] [<0>] kthread+0x134/0x150
[17409.490785] [<0>] ret_from_fork+0x35/0x40
[17409.491622] Kernel panic - not syncing: LBUG
[17409.492461] CPU: 1 PID: 9972 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_lustre.x86_64 #1
[17409.494792] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[17409.495890] Call Trace:
[17409.496425]  dump_stack+0x41/0x60
[17409.497114]  panic+0xe7/0x2ac
[17409.497751]  ? ret_from_fork+0x35/0x40
[17409.498494]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]
[17409.499605]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[17409.500571]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]
[17409.501620]  lnet_peer_discovery+0x14fb/0x1730 [lnet]
[17409.502616]  ? finish_wait+0x80/0x80
[17409.503343]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]
[17409.504385]  kthread+0x134/0x150
[17409.505042]  ? set_kthread_struct+0x50/0x50
[17409.505862]  ret_from_fork+0x35/0x40
Comment by Andreas Dilger [ 30/Nov/23 ]

Hi Cyril, are you able to look into this issue? It seems that we are still hitting this quite a bit in testing.

Comment by Chris Horn [ 06/Dec/23 ]

The issue is a race between callers of lnet_ping_target_setup() and lnet_ping_target_fini(). I'm not sure how to resolve it yet.

I think there may be two similar bugs. I did not keep my notes that lead me to this conclusion ^, but I recently hit the assert on the discovery shutdown path while running sanity-lnet test 500.

Discovery thread sends PUSH to some peer:

00000400:00000200:0.0:1701811372.819250:0:30012:0:(peer.c:3958:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:0.0:1701811372.819251:0:30012:0:(peer.c:4070:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdd@tcp(ffff975bb6c4aa00) state 0x151
00000400:00000010:0.0:1701811372.819252:0:30012:0:(lib-md.c:152:lnet_md_build()) slab-alloced 'lmd': 136 at ffff975bec10ed48.
...
00000400:00000200:0.0:1701811372.819277:0:30012:0:(net_fault.c:582:delay_rule_match()) Delay Rule 255.255.255.255@tcp->255.255.255.255@tcp: next delay: 1
00000400:00000200:0.0:1701811372.819279:0:30012:0:(peer.c:3892:lnet_peer_send_push()) peer 2601:8c1:c180:2000::cbdd@tcp
00000400:00000200:0.0:1701811372.819280:0:30012:0:(peer.c:4092:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdd@tcp(ffff975bb6c4aa00) state 0x551 rc 0
...
00000400:00000200:0.0:1701811372.892780:0:30459:0:(net_fault.c:692:delayed_msg_process()) TRACE: msg ffff975bee9ba210 2601:8c1:c180:2000::cbdd@tcp -> 10.1.2.3@tcp : PUT

The PUSH send fails:

00000400:02020000:0.0:1701811372.892824:0:30074:0:(acceptor.c:130:lnet_connect_console_error()) 11e-e: Unexpected error -2 connecting to 10.1.2.3@tcp at host 10.1.2.3:988
00000400:00000200:0.0:1701811372.902763:0:30074:0:(router.c:1693:lnet_notify()) 2601:8c1:c180:2000::cbdd@tcp notifying 10.1.2.3@tcp: down
00000800:00000100:0.0:1701811372.902767:0:30074:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 72 0@<0:0>->10.1.2.3@tcp
00000800:00000010:0.0:1701811372.902768:0:30074:0:(socklnd_cb.c:110:ksocknal_free_tx()) kfreed 'tx': 248 at ffff975bb6816b00.
00000400:00000200:0.0:1701811372.902769:0:30074:0:(lib-msg.c:1046:lnet_is_health_check()) health check = 1, status = -2, hstatus = 5
00000400:00000200:0.0:1701811372.902772:0:30074:0:(lib-msg.c:819:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdd@tcp->10.1.2.3@tcp: PUT: LOCAL_ERROR

The peer is placed back on discovery request queue by the event handler, and then discovery thread is stopped:

00000400:00000200:0.0:1701811372.902777:0:30074:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:0.0:1701811372.902778:0:30074:0:(peer.c:3013:lnet_discovery_event_send()) Push Send to 2601:8c1:c180:2000::cbdd@tcp: 1
00000400:00000200:0.0:1701811372.902779:0:30074:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 2601:8c1:c180:2000::cbdd@tcp: -114

00000400:00000200:0.0:1701811372.902814:0:30012:0:(peer.c:3958:lnet_peer_discovery_wait_for_work()) woken: -108
00000400:00000200:0.0:1701811372.902814:0:30012:0:(peer.c:4117:lnet_peer_discovery()) stopping
00000400:00000200:0.0:1701811372.902815:0:30012:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2601:8c1:c180:2000::cbdd@tcp
00000400:00040000:0.0:1701811372.902817:0:30012:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed:
00000400:00040000:0.0:1701811372.912358:0:30012:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG

Normally, the discovery thread would call lnet_peer_push_failed() to unlink the MD (it is not auto-unlinked because the PUT expects an ACK, so md_threshold is still > 0). But because discovery thread is shutdown this is missed.

Comment by Gerrit Updater [ 06/Dec/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53356
Subject: LU-14810 lnet: Cancel discovery ping/push on shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 51edce5d56125ca03763415fa52ec94f621468bc

Comment by Chris Horn [ 06/Dec/23 ]

I pushed a patch to fix the discovery issue.

I think this is a recent example of the other bug I referenced - https://testing.whamcloud.com/test_sets/a5b72565-cf36-421a-be50-00d8b275d740

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53356/
Subject: LU-14810 lnet: Cancel discovery ping/push on shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c3b9597742d5118a96f56129e7dd30d84468d2c8

Comment by Peter Jones [ 20/Dec/23 ]

Landed for 2.16

Comment by Chris Horn [ 20/Dec/23 ]

The patch that landed only fixes one of the two bugs that result in tripping the assert.

Reopening.

Comment by Andreas Dilger [ 05/Jan/24 ]

+1 on master: https://testing.whamcloud.com/sub_tests/121f699a-8b7c-40dc-be4e-99f534b874b4

Generated at Sat Feb 10 03:12:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.