[LU-17313] sanity-lnet test_205/test_220: LNetError: (lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) Created: 24/Nov/23  Updated: 06/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14810 sanity-lnet test_212: lnet_assert_han... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1432600f-ba6e-45a0-a998-c2b871760b9b

test_205 failed with the following error:

trevis-33vm2 crashed during sanity-lnet test_205

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_lustre.x86_64

Client Output:

 

=================================== 16:17:04 \(1700756224\)^M
[21692.122539] Lustre: DEBUG MARKER: == sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224)^M
[21692.403638] LNet: 1467339:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M
[21692.405821] LNet: Removed LNI 10.1.2.3@tcp^M
[21692.413980] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
[21692.413980] lctl dl | grep ' ST ' || true^M
[21692.511140] Key type .llcrypt unregistered^M
[21692.512020] Key type ._llcrypt unregistered^M
[21692.800163] Key type ._llcrypt registered^M
[21692.801024] Key type .llcrypt registered^M
[21692.832028] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M
[21692.836702] alg: No test for adler32 (adler32-zlib)^M
[21693.611267] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M
[21693.617623] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M
[21693.618922] LNet: Accept all, port 7988^M
[21694.553419] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M
[21694.846378] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21694.850290] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M
[21694.896561] LNet: 1467564:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
[21694.911244] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21694.917894] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21694.971823] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21695.052546] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21695.059301] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21695.111848] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21695.199325] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21695.206117] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21695.258286] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21695.382743] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M
[21695.384355] LNet: Skipped 2 previous similar messages^M
[21695.428743] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
[21695.430652] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) Skipped 2 previous similar messages^M
[21695.445043] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21695.451861] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21695.506046] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21695.607926] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21695.614630] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21695.667799] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21695.808409] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21695.815303] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21695.828815] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
[21695.828815] lctl dl | grep ' ST ' || true^M
[21695.860860] LNet: 1467904:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M
[21695.924073] LNetError: 1467470:0:(socklnd.c:1261:ksocknal_create_conn()) Not creating conn 12345-10.240.38.123@tcp type 2: peer_ni/conn_cb removed^M
[21695.926432] LNetError: 11e-e: Unexpected error -116 connecting to 10.240.38.123@tcp at host 10.240.38.123:7988^M
[21696.874054] LNet: Removed LNI 10.240.39.5@tcp^M
[21696.938956] Key type .llcrypt unregistered^M
[21696.939795] Key type ._llcrypt unregistered^M
[21705.904334] Key type ._llcrypt registered^M
[21705.905266] Key type .llcrypt registered^M
[21706.693491] Key type .llcrypt unregistered^M
[21706.694372] Key type ._llcrypt unregistered^M
[21706.974304] Key type ._llcrypt registered^M
[21706.978361] Key type .llcrypt registered^M
[21707.011233] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M
[21707.015533] alg: No test for adler32 (adler32-zlib)^M
[21707.790659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M
[21707.797247] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M
[21707.798822] LNet: Accept all, port 7988^M
[21708.368775] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M
[21715.241659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure^M
[21715.248115] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net add --net tcp1 --if eth0^M
[21715.252231] LNet: Added LNI 10.240.39.5@tcp1 [8/256/0/180]^M
[21715.441585] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21715.445581] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
[21715.447305] LNet: 1 local NIs in recovery (showing 1): 10.240.39.5@tcp^M
[21715.637913] LNet: 1469390:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
[21715.652679] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21715.659414] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21715.729247] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21715.949323] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21715.956120] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21716.026111] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21716.029989] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
[21716.031631] LNet: Skipped 5 previous similar messages^M
[21716.231601] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
[21716.233446] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) Skipped 1 previous similar message^M
[21716.248327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21716.255340] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21716.326242] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21716.628909] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21716.636214] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21716.707624] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21717.005150] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
21717.011880] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21717.079281] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
[21717.083320] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
[21717.084987] LNet: Skipped 9 previous similar messages^M
[21717.142283] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
[21717.149327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
[21720.087147] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
[21720.087147] lctl dl | grep ' ST ' || true^M
[21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M
[21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M
[21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M
[21720.125123] Call Trace TBD:^M
[21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M
[21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M
[21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
[21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
[21720.129985] [<0>] kthread+0x134/0x150^M
[21720.130761] [<0>] ret_from_fork+0x35/0x40^M
[21720.131608] Kernel panic - not syncing: LBUG^M
[21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1^M
[21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
[21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M
[21720.135832] Call Trace:^M
[21720.135865]  dump_stack+0x41/0x60^M
[21720.139358]  panic+0xe7/0x2ac^M
[21720.139973]  ? ret_from_fork+0x35/0x40^M
[21720.140690]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M
[21720.141763]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M
[21720.142713]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
[21720.143717]  lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
[21720.144692]  ? finish_wait+0x80/0x80^M
[21720.145393]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M
[21720.146400]  kthread+0x134/0x150^M
[21720.147027]  ? set_kthread_struct+0x50/0x50^M
[21720.147820]  ret_from_fork+0x35/0x40^M
[21720.087147] lctl dl | grep ' ST ' || true^M
[21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M
[21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M
[21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M
[21720.125123] Call Trace TBD:^M
[21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M
[21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M
[21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
[21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
[21720.129985] [<0>] kthread+0x134/0x150^M
[21720.130761] [<0>] ret_from_fork+0x35/0x40^M
[21720.131608] Kernel panic - not syncing: LBUG^M
[21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1^M
[21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
[21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M
[21720.135832] Call Trace:^M
[21720.135865]  dump_stack+0x41/0x60^M
[21720.139358]  panic+0xe7/0x2ac^M
[21720.139973]  ? ret_from_fork+0x35/0x40^M
[21720.140690]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M
[21720.141763]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M
[21720.142713]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
[21720.143717]  lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
[21720.144692]  ? finish_wait+0x80/0x80^M
[21720.145393]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M

 

Sanity-lnet 204/205 run output

 

== sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224)
Cleaning up LNet
CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
Writer error: failed to resolve Netlink family id
modules unloaded.
Loading modules from /usr/lib64/lustre
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
../lnet/lnet/lnet options: 'accept=all'
/usr/sbin/lnetctl lnet configure --all
CMD: trevis-26vm4 /usr/sbin/lctl list_nids
/usr/sbin/lnetctl discover 10.240.38.123@tcp
discover:
- primary nid: 10.240.38.123@tcp
  Multi-Rail: true
  peer_ni:
  - nid: 10.240.38.123@tcp
net:
-     net type: lo
      local NI(s):
      -     nid: 0@lo
            status: up
-     net type: tcp
      local NI(s):
      -     nid: 10.240.39.5@tcp
            status: up
            interfaces:
                  0: eth0
-     primary nid: 10.240.38.123@tcp
      -     nid: 10.240.38.123@tcp
            health stats:
                  health value: 1000
debug=+net
Simulate local_interrupt
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 1000
Post local health: 1000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
Simulate local_dropped
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 2000
Post local health: 2000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
Simulate local_aborted
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 2000
Post local health: 2000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
Simulate local_no_route
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 2000
Post local health: 2000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
Simulate local_timeout
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 2000
Post local health: 2000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
Simulate local_error
Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
/usr/sbin/lnetctl ping 10.240.38.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.38.123@tcp: Input/output error
                 
Pre resends: 0
Post resends: 0
Resends delta: 0
Pre local health: 2000
Post local health: 2000
Pre remote health: 1000
Post remote health: 1000
/usr/sbin/lnetctl peer set --health 1000 --all
/usr/sbin/lnetctl net set --health 1000 --all
Removed 2 drop rules
Check that no resends took place
Check that local NI health is unchanged
CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
Writer error: failed to resolve Netlink family id
modules unloaded.
pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2
pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2
PASS 204 (10s)
CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1
CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-26vm4 sysctl --values kernel/kptr_restrict
CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-33vm2.trevis.whamcloud.com sysctl --values kernel/kptr_restrict
CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1
CMD: trevis-33vm3 sysctl --values kernel/kptr_restrict
CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-67vm7 sysctl --values kernel/kptr_restrict
CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1
CMD: trevis-79vm7 sysctl --values kernel/kptr_restrict
CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1
== sanity-lnet test 205: Check health and resends for multi-rail local failures ========================================================== 16:17:18 (1700756238)
Cleaning up LNet
CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true

 

Nothing alarming/useful under MDS/OST output

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_205 - trevis-33vm2 crashed during sanity-lnet test_205



 Comments   
Comment by Andreas Dilger [ 05/Jan/24 ]

+1 on master: https://testing.whamcloud.com/sub_tests/121f699a-8b7c-40dc-be4e-99f534b874b4

Comment by Arshad Hussain [ 16/Jan/24 ]

+1 on (master) https://testing.whamcloud.com/test_sets/3f6810cc-7c8e-4ee4-9397-e09bad232d54

Client:
Lustre Version:2.15.60.13
Kernel version:5.14.0-284.30.1.el9_2.x86_64

Server:
Lustre Version:2.15.60.13
Kernel version:4.18.0-477.27.1.el8_lustre.x86_64

This time it is seen under sanity-lnet/test_220 (panic under client)

[21725.920801] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 16:47:56 (1705337276)
[21726.373169] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod
[21726.999708] Key type lgssc unregistered
[21727.051580] LNetError: 6094:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
[21727.053408] LNetError: 6094:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG
[21727.054422] Pid: 6094, comm: lnet_discovery 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023
[21727.055793] Call Trace TBD:
[21727.056218] Kernel panic - not syncing: LBUG
[21727.056813] CPU: 1 PID: 6094 Comm: lnet_discovery Kdump: loaded Tainted: G           OE    --------  ---  5.14.0-284.30.1.el9_2.x86_64 #1
[21727.058258] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[21727.059002] Call Trace:
[21727.059418]  <TASK>
[21727.059756]  dump_stack_lvl+0x34/0x48
[21727.060365]  panic+0xf4/0x2c6
[21727.060810]  ? lnet_discovery_event_reply+0xbc0/0xbc0 [lnet]
[21727.062285]  lbug_with_loc.cold+0x18/0x18 [libcfs]
[21727.063335]  lnet_assert_handler_unused+0x9c/0xd0 [lnet]
[21727.064169]  lnet_peer_discovery+0x997/0xaf0 [lnet]
[21727.064850]  ? cpuacct_percpu_seq_show+0x10/0x10
[21727.065492]  ? lnet_peer_data_present+0x580/0x580 [lnet]
[21727.066187]  kthread+0xd9/0x100
[21727.066651]  ? kthread_complete_and_exit+0x20/0x20
[21727.067276]  ret_from_fork+0x22/0x30
[21727.067792]  </TASK>
Comment by Andreas Dilger [ 01/Feb/24 ]

+9 crashes of sanity-LNet test_220 this week:
https://testing.whamcloud.com/search?status%5B%5D=CRASH&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=03c13a24-99ad-4246-9749-8279e6e898a7&start_date=2024-01-26&end_date=2024-02-01&source=sub_tests#redirect

Comment by Arshad Hussain [ 06/Feb/24 ]

+1 on Master(for aarch64)  https://testing.whamcloud.com/test_sessions/ab07dba8-8b72-4468-9c0b-65182aafc48a

On MDS

[25868.167715] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 21:21:22 (1707168082)
[25868.709079] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod
[25869.289481] Key type lgssc unregistered
[25869.351068] LNetError: 11254:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed:@
[25869.352915] LNetError: 11254:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG 
[25869.354122] CPU: 0 PID: 11254 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_lustre.x86_64 #1
[25869.356169] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[25869.357115] Call Trace:
[25869.357602]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]
[25869.358667]  dump_stack+0x41/0x60
[25869.359294]  lbug_with_loc.cold.8+0x5/0x43 [libcfs]
[25869.360176]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]
[25869.361101]  lnet_peer_discovery+0x13e6/0x1620 [lnet]
[25869.361978]  ? finish_wait+0x80/0x80
[25869.362617]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]
[25869.363522]  kthread+0x134/0x150
[25869.364110]  ? set_kthread_struct+0x50/0x50
[25869.364827]  ret_from_fork+0x35/0x40
[25869.365473] Kernel panic - not syncing: LBUG 
[25869.366204] CPU: 0 PID: 11254 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_lustre.x86_64 #1
[25869.368229] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[25869.369173] Call Trace:
[25869.369619]  dump_stack+0x41/0x60
[25869.370194]  panic+0xe7/0x2ac
[25869.370741]  ? __switch_to_asm+0x51/0x80
[25869.371412]  ? __switch_to_asm+0x51/0x80
[25869.372087]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]
[25869.373044]  lbug_with_loc.cold.8+0x1a/0x43 [libcfs]
[25869.373882]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]
[25869.374787]  lnet_peer_discovery+0x13e6/0x1620 [lnet]
[25869.375658]  ? finish_wait+0x80/0x80
[25869.376281]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]
[25869.377185]  kthread+0x134/0x150
[25869.377750]  ? set_kthread_struct+0x50/0x50
[25869.378457]  ret_from_fork+0x35/0x40
Generated at Sat Feb 10 03:34:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.