Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1432600f-ba6e-45a0-a998-c2b871760b9b
test_205 failed with the following error:
trevis-33vm2 crashed during sanity-lnet test_205
Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_lustre.x86_64
Client Output:
=================================== 16:17:04 \(1700756224\)^M [21692.122539] Lustre: DEBUG MARKER: == sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224)^M [21692.403638] LNet: 1467339:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M [21692.405821] LNet: Removed LNI 10.1.2.3@tcp^M [21692.413980] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M [21692.413980] lctl dl | grep ' ST ' || true^M [21692.511140] Key type .llcrypt unregistered^M [21692.512020] Key type ._llcrypt unregistered^M [21692.800163] Key type ._llcrypt registered^M [21692.801024] Key type .llcrypt registered^M [21692.832028] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M [21692.836702] alg: No test for adler32 (adler32-zlib)^M [21693.611267] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M [21693.617623] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M [21693.618922] LNet: Accept all, port 7988^M [21694.553419] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M [21694.846378] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21694.850290] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M [21694.896561] LNet: 1467564:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M [21694.911244] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21694.917894] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21694.971823] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21695.052546] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21695.059301] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21695.111848] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21695.199325] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21695.206117] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21695.258286] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21695.382743] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M [21695.384355] LNet: Skipped 2 previous similar messages^M [21695.428743] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M [21695.430652] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) Skipped 2 previous similar messages^M [21695.445043] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21695.451861] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21695.506046] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21695.607926] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21695.614630] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21695.667799] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21695.808409] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21695.815303] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21695.828815] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M [21695.828815] lctl dl | grep ' ST ' || true^M [21695.860860] LNet: 1467904:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M [21695.924073] LNetError: 1467470:0:(socklnd.c:1261:ksocknal_create_conn()) Not creating conn 12345-10.240.38.123@tcp type 2: peer_ni/conn_cb removed^M [21695.926432] LNetError: 11e-e: Unexpected error -116 connecting to 10.240.38.123@tcp at host 10.240.38.123:7988^M [21696.874054] LNet: Removed LNI 10.240.39.5@tcp^M [21696.938956] Key type .llcrypt unregistered^M [21696.939795] Key type ._llcrypt unregistered^M [21705.904334] Key type ._llcrypt registered^M [21705.905266] Key type .llcrypt registered^M [21706.693491] Key type .llcrypt unregistered^M [21706.694372] Key type ._llcrypt unregistered^M [21706.974304] Key type ._llcrypt registered^M [21706.978361] Key type .llcrypt registered^M [21707.011233] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M [21707.015533] alg: No test for adler32 (adler32-zlib)^M [21707.790659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M [21707.797247] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M [21707.798822] LNet: Accept all, port 7988^M [21708.368775] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M [21715.241659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure^M [21715.248115] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net add --net tcp1 --if eth0^M [21715.252231] LNet: Added LNI 10.240.39.5@tcp1 [8/256/0/180]^M [21715.441585] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21715.445581] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M [21715.447305] LNet: 1 local NIs in recovery (showing 1): 10.240.39.5@tcp^M [21715.637913] LNet: 1469390:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M [21715.652679] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21715.659414] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21715.729247] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21715.949323] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21715.956120] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21716.026111] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21716.029989] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M [21716.031631] LNet: Skipped 5 previous similar messages^M [21716.231601] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M [21716.233446] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) Skipped 1 previous similar message^M [21716.248327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21716.255340] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21716.326242] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21716.628909] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21716.636214] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21716.707624] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21717.005150] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M 21717.011880] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21717.079281] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M [21717.083320] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M [21717.084987] LNet: Skipped 9 previous similar messages^M [21717.142283] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M [21717.149327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M [21720.087147] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M [21720.087147] lctl dl | grep ' ST ' || true^M [21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M [21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M [21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M [21720.125123] Call Trace TBD:^M [21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M [21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M [21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M [21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M [21720.129985] [<0>] kthread+0x134/0x150^M [21720.130761] [<0>] ret_from_fork+0x35/0x40^M [21720.131608] Kernel panic - not syncing: LBUG^M [21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G W OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1^M [21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M [21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M [21720.135832] Call Trace:^M [21720.135865] dump_stack+0x41/0x60^M [21720.139358] panic+0xe7/0x2ac^M [21720.139973] ? ret_from_fork+0x35/0x40^M [21720.140690] ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M [21720.141763] lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M [21720.142713] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M [21720.143717] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M [21720.144692] ? finish_wait+0x80/0x80^M [21720.145393] ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M [21720.146400] kthread+0x134/0x150^M [21720.147027] ? set_kthread_struct+0x50/0x50^M [21720.147820] ret_from_fork+0x35/0x40^M [21720.087147] lctl dl | grep ' ST ' || true^M [21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M [21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M [21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M [21720.125123] Call Trace TBD:^M [21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M [21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M [21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M [21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M [21720.129985] [<0>] kthread+0x134/0x150^M [21720.130761] [<0>] ret_from_fork+0x35/0x40^M [21720.131608] Kernel panic - not syncing: LBUG^M [21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G W OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1^M [21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M [21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M [21720.135832] Call Trace:^M [21720.135865] dump_stack+0x41/0x60^M [21720.139358] panic+0xe7/0x2ac^M [21720.139973] ? ret_from_fork+0x35/0x40^M [21720.140690] ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M [21720.141763] lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M [21720.142713] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M [21720.143717] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M [21720.144692] ? finish_wait+0x80/0x80^M [21720.145393] ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M
Sanity-lnet 204/205 run output
== sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224) Cleaning up LNet CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Writer error: failed to resolve Netlink family id modules unloaded. Loading modules from /usr/lib64/lustre detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' ../lnet/lnet/lnet options: 'accept=all' /usr/sbin/lnetctl lnet configure --all CMD: trevis-26vm4 /usr/sbin/lctl list_nids /usr/sbin/lnetctl discover 10.240.38.123@tcp discover: - primary nid: 10.240.38.123@tcp Multi-Rail: true peer_ni: - nid: 10.240.38.123@tcp net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 10.240.39.5@tcp status: up interfaces: 0: eth0 - primary nid: 10.240.38.123@tcp - nid: 10.240.38.123@tcp health stats: health value: 1000 debug=+net Simulate local_interrupt Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 1000 Post local health: 1000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged Simulate local_dropped Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 2000 Post local health: 2000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged Simulate local_aborted Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 2000 Post local health: 2000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged Simulate local_no_route Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 2000 Post local health: 2000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged Simulate local_timeout Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 2000 Post local health: 2000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged Simulate local_error Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1) Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1) /usr/sbin/lnetctl ping 10.240.38.123@tcp manage: - ping: errno: -1 descr: failed to ping 10.240.38.123@tcp: Input/output error Pre resends: 0 Post resends: 0 Resends delta: 0 Pre local health: 2000 Post local health: 2000 Pre remote health: 1000 Post remote health: 1000 /usr/sbin/lnetctl peer set --health 1000 --all /usr/sbin/lnetctl net set --health 1000 --all Removed 2 drop rules Check that no resends took place Check that local NI health is unchanged CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Writer error: failed to resolve Netlink family id modules unloaded. pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2 pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2 PASS 204 (10s) CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1 CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-26vm4 sysctl --values kernel/kptr_restrict CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-33vm2.trevis.whamcloud.com sysctl --values kernel/kptr_restrict CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1 CMD: trevis-33vm3 sysctl --values kernel/kptr_restrict CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-67vm7 sysctl --values kernel/kptr_restrict CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1 CMD: trevis-79vm7 sysctl --values kernel/kptr_restrict CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1 == sanity-lnet test 205: Check health and resends for multi-rail local failures ========================================================== 16:17:18 (1700756238) Cleaning up LNet CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true
Nothing alarming/useful under MDS/OST output
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_205 - trevis-33vm2 crashed during sanity-lnet test_205