[LU-14810] sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed Created: 03/Jul/21 Updated: 02/Feb/24 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Cyril Bordage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84ed5e32-bda8-4880-a254-dbd96cb6a478 test_212 failed with the following error: trevis-79vm3 crashed during sanity-lnet test_212 LNetError: 944562:0:(lib-md.c:288:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: Pid: 944562, comm: lnet_discovery 4.18.0-240.22.1.el8_3.aarch64 #1 SMP Thu Apr 8 19:01:45 UTC 2021 Call Trace: libcfs_call_trace+0xb8/0x118 [libcfs] lbug_with_loc+0x60/0xa0 [libcfs] lnet_assert_handler_unused+0xb8/0xe0 [lnet] lnet_peer_discovery+0x16c4/0x1cb0 [lnet] kthread+0x130/0x138 This test was recently landed via patch https://review.whamcloud.com/43418 " VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 03/Jul/21 ] |
|
Chris, could you please take a look at this. I just hit this failure in a functionally no-op patch, and since this test was just added by your patch it is possibly related to that. |
| Comment by Chris Horn [ 12/Jul/21 ] |
|
adilger is there a vmcore available? |
| Comment by Andreas Dilger [ 13/Jul/21 ] |
|
Chris, we used to collect crash dumps, but I don't see them linked from the Maloo test page for the crash. There is a console log attached for the failed server. It should also be noted that this failure is with review-ldiskfs-arm (i.e. ARM client), so that may be of interest. It is running in little-endian mode, but will have 64KB PAGE_SIZE. |
| Comment by Li Xi [ 04/Jul/22 ] |
|
Might got another one on master, but no log saved: https://testing.whamcloud.com/test_sets/83dcfa5f-0848-43aa-8278-a7eb263bdd11 |
| Comment by Andreas Dilger [ 27/Feb/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/b44670dd-ae14-4cdc-afdf-083de5b4ebc5 |
| Comment by Andreas Dilger [ 29/Mar/23 ] |
|
+1 on master + Ubuntu 22.04: |
| Comment by Chris Horn [ 03/Apr/23 ] |
|
The issue is a race between callers of lnet_ping_target_setup() and lnet_ping_target_fini(). I'm not sure how to resolve it yet. |
| Comment by Alex Zhuravlev [ 20/Nov/23 ] |
|
https://testing.whamcloud.com/test_sets/bf0e171e-4549-4d8e-b437-1ac07124c4a0 on master |
| Comment by Etienne Aujames [ 27/Nov/23 ] |
|
Same stack on "sanity-lnet test_220": [17408.323111] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-lnet test 220: Add routes w\/default options - check aliveness ========================================================== 22:18:00 \(1700864280\) [17408.530260] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 22:18:00 (1700864280) [17408.806233] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod [17409.412889] Key type lgssc unregistered [17409.479271] LNetError: 9972:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: [17409.481545] LNetError: 9972:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG [17409.482950] Pid: 9972, comm: lnet_discovery 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Wed Nov 8 23:01:53 UTC 2023 [17409.484866] Call Trace TBD: [17409.485593] [<0>] libcfs_call_trace+0x63/0x90 [libcfs] [17409.486676] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [17409.487633] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet] [17409.488896] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet] [17409.489990] [<0>] kthread+0x134/0x150 [17409.490785] [<0>] ret_from_fork+0x35/0x40 [17409.491622] Kernel panic - not syncing: LBUG [17409.492461] CPU: 1 PID: 9972 Comm: lnet_discovery Kdump: loaded Tainted: G W OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1 [17409.494792] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17409.495890] Call Trace: [17409.496425] dump_stack+0x41/0x60 [17409.497114] panic+0xe7/0x2ac [17409.497751] ? ret_from_fork+0x35/0x40 [17409.498494] ? lnet_discovery_event_reply+0xb00/0xb00 [lnet] [17409.499605] lbug_with_loc.cold.8+0x18/0x18 [libcfs] [17409.500571] lnet_assert_handler_unused+0xa0/0xd0 [lnet] [17409.501620] lnet_peer_discovery+0x14fb/0x1730 [lnet] [17409.502616] ? finish_wait+0x80/0x80 [17409.503343] ? lnet_peer_merge_data+0x1110/0x1110 [lnet] [17409.504385] kthread+0x134/0x150 [17409.505042] ? set_kthread_struct+0x50/0x50 [17409.505862] ret_from_fork+0x35/0x40 |
| Comment by Andreas Dilger [ 30/Nov/23 ] |
|
Hi Cyril, are you able to look into this issue? It seems that we are still hitting this quite a bit in testing. |
| Comment by Chris Horn [ 06/Dec/23 ] |
I think there may be two similar bugs. I did not keep my notes that lead me to this conclusion ^, but I recently hit the assert on the discovery shutdown path while running sanity-lnet test 500. Discovery thread sends PUSH to some peer: 00000400:00000200:0.0:1701811372.819250:0:30012:0:(peer.c:3958:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:0.0:1701811372.819251:0:30012:0:(peer.c:4070:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdd@tcp(ffff975bb6c4aa00) state 0x151 00000400:00000010:0.0:1701811372.819252:0:30012:0:(lib-md.c:152:lnet_md_build()) slab-alloced 'lmd': 136 at ffff975bec10ed48. ... 00000400:00000200:0.0:1701811372.819277:0:30012:0:(net_fault.c:582:delay_rule_match()) Delay Rule 255.255.255.255@tcp->255.255.255.255@tcp: next delay: 1 00000400:00000200:0.0:1701811372.819279:0:30012:0:(peer.c:3892:lnet_peer_send_push()) peer 2601:8c1:c180:2000::cbdd@tcp 00000400:00000200:0.0:1701811372.819280:0:30012:0:(peer.c:4092:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdd@tcp(ffff975bb6c4aa00) state 0x551 rc 0 ... 00000400:00000200:0.0:1701811372.892780:0:30459:0:(net_fault.c:692:delayed_msg_process()) TRACE: msg ffff975bee9ba210 2601:8c1:c180:2000::cbdd@tcp -> 10.1.2.3@tcp : PUT The PUSH send fails: 00000400:02020000:0.0:1701811372.892824:0:30074:0:(acceptor.c:130:lnet_connect_console_error()) 11e-e: Unexpected error -2 connecting to 10.1.2.3@tcp at host 10.1.2.3:988 00000400:00000200:0.0:1701811372.902763:0:30074:0:(router.c:1693:lnet_notify()) 2601:8c1:c180:2000::cbdd@tcp notifying 10.1.2.3@tcp: down 00000800:00000100:0.0:1701811372.902767:0:30074:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 72 0@<0:0>->10.1.2.3@tcp 00000800:00000010:0.0:1701811372.902768:0:30074:0:(socklnd_cb.c:110:ksocknal_free_tx()) kfreed 'tx': 248 at ffff975bb6816b00. 00000400:00000200:0.0:1701811372.902769:0:30074:0:(lib-msg.c:1046:lnet_is_health_check()) health check = 1, status = -2, hstatus = 5 00000400:00000200:0.0:1701811372.902772:0:30074:0:(lib-msg.c:819:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdd@tcp->10.1.2.3@tcp: PUT: LOCAL_ERROR The peer is placed back on discovery request queue by the event handler, and then discovery thread is stopped: 00000400:00000200:0.0:1701811372.902777:0:30074:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 5 00000400:00000200:0.0:1701811372.902778:0:30074:0:(peer.c:3013:lnet_discovery_event_send()) Push Send to 2601:8c1:c180:2000::cbdd@tcp: 1 00000400:00000200:0.0:1701811372.902779:0:30074:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 2601:8c1:c180:2000::cbdd@tcp: -114 00000400:00000200:0.0:1701811372.902814:0:30012:0:(peer.c:3958:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:0.0:1701811372.902814:0:30012:0:(peer.c:4117:lnet_peer_discovery()) stopping 00000400:00000200:0.0:1701811372.902815:0:30012:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2601:8c1:c180:2000::cbdd@tcp 00000400:00040000:0.0:1701811372.902817:0:30012:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 00000400:00040000:0.0:1701811372.912358:0:30012:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG Normally, the discovery thread would call lnet_peer_push_failed() to unlink the MD (it is not auto-unlinked because the PUT expects an ACK, so md_threshold is still > 0). But because discovery thread is shutdown this is missed. |
| Comment by Gerrit Updater [ 06/Dec/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53356 |
| Comment by Chris Horn [ 06/Dec/23 ] |
|
I pushed a patch to fix the discovery issue. I think this is a recent example of the other bug I referenced - https://testing.whamcloud.com/test_sets/a5b72565-cf36-421a-be50-00d8b275d740 |
| Comment by Gerrit Updater [ 20/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53356/ |
| Comment by Peter Jones [ 20/Dec/23 ] |
|
Landed for 2.16 |
| Comment by Chris Horn [ 20/Dec/23 ] |
|
The patch that landed only fixes one of the two bugs that result in tripping the assert. Reopening. |
| Comment by Andreas Dilger [ 05/Jan/24 ] |
|
+1 on master: https://testing.whamcloud.com/sub_tests/121f699a-8b7c-40dc-be4e-99f534b874b4 |