Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17313

sanity-lnet test_205/test_220: LNetError: (lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1432600f-ba6e-45a0-a998-c2b871760b9b

      test_205 failed with the following error:

      trevis-33vm2 crashed during sanity-lnet test_205
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/100460 - 4.18.0-477.27.1.el8_lustre.x86_64

      Client Output:

       

      =================================== 16:17:04 \(1700756224\)^M
      [21692.122539] Lustre: DEBUG MARKER: == sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224)^M
      [21692.403638] LNet: 1467339:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M
      [21692.405821] LNet: Removed LNI 10.1.2.3@tcp^M
      [21692.413980] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
      [21692.413980] lctl dl | grep ' ST ' || true^M
      [21692.511140] Key type .llcrypt unregistered^M
      [21692.512020] Key type ._llcrypt unregistered^M
      [21692.800163] Key type ._llcrypt registered^M
      [21692.801024] Key type .llcrypt registered^M
      [21692.832028] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M
      [21692.836702] alg: No test for adler32 (adler32-zlib)^M
      [21693.611267] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M
      [21693.617623] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M
      [21693.618922] LNet: Accept all, port 7988^M
      [21694.553419] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M
      [21694.846378] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21694.850290] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M
      [21694.896561] LNet: 1467564:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
      [21694.911244] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21694.917894] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21694.971823] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21695.052546] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21695.059301] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21695.111848] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21695.199325] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21695.206117] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21695.258286] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21695.382743] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M
      [21695.384355] LNet: Skipped 2 previous similar messages^M
      [21695.428743] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
      [21695.430652] LNet: 1467750:0:(api-ni.c:358:recovery_interval_set()) Skipped 2 previous similar messages^M
      [21695.445043] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21695.451861] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21695.506046] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21695.607926] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21695.614630] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21695.667799] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21695.808409] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21695.815303] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21695.828815] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
      [21695.828815] lctl dl | grep ' ST ' || true^M
      [21695.860860] LNet: 1467904:0:(lib-ptl.c:956:lnet_clear_lazy_portal()) Active lazy portal 0 on exit^M
      [21695.924073] LNetError: 1467470:0:(socklnd.c:1261:ksocknal_create_conn()) Not creating conn 12345-10.240.38.123@tcp type 2: peer_ni/conn_cb removed^M
      [21695.926432] LNetError: 11e-e: Unexpected error -116 connecting to 10.240.38.123@tcp at host 10.240.38.123:7988^M
      [21696.874054] LNet: Removed LNI 10.240.39.5@tcp^M
      [21696.938956] Key type .llcrypt unregistered^M
      [21696.939795] Key type ._llcrypt unregistered^M
      [21705.904334] Key type ._llcrypt registered^M
      [21705.905266] Key type .llcrypt registered^M
      [21706.693491] Key type .llcrypt unregistered^M
      [21706.694372] Key type ._llcrypt unregistered^M
      [21706.974304] Key type ._llcrypt registered^M
      [21706.978361] Key type .llcrypt registered^M
      [21707.011233] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2^M
      [21707.015533] alg: No test for adler32 (adler32-zlib)^M
      [21707.790659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure --all^M
      [21707.797247] LNet: Added LNI 10.240.39.5@tcp [8/256/0/180]^M
      [21707.798822] LNet: Accept all, port 7988^M
      [21708.368775] Lustre: DEBUG MARKER: /usr/sbin/lnetctl discover 10.240.38.123@tcp^M
      [21715.241659] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure^M
      [21715.248115] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net add --net tcp1 --if eth0^M
      [21715.252231] LNet: Added LNI 10.240.39.5@tcp1 [8/256/0/180]^M
      [21715.441585] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21715.445581] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
      [21715.447305] LNet: 1 local NIs in recovery (showing 1): 10.240.39.5@tcp^M
      [21715.637913] LNet: 1469390:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
      [21715.652679] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21715.659414] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21715.729247] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21715.949323] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21715.956120] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21716.026111] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21716.029989] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
      [21716.031631] LNet: Skipped 5 previous similar messages^M
      [21716.231601] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) 'lnet_recovery_interval' has been deprecated^M
      [21716.233446] LNet: 1469526:0:(api-ni.c:358:recovery_interval_set()) Skipped 1 previous similar message^M
      [21716.248327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21716.255340] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21716.326242] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21716.628909] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21716.636214] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21716.707624] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21717.005150] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      21717.011880] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21717.079281] Lustre: DEBUG MARKER: /usr/sbin/lnetctl ping 10.240.38.123@tcp^M
      [21717.083320] LNet: There was an unexpected network error while writing to 10.240.38.123: rc = -22^M 
      [21717.084987] LNet: Skipped 9 previous similar messages^M
      [21717.142283] Lustre: DEBUG MARKER: /usr/sbin/lnetctl peer set --health 1000 --all^M
      [21717.149327] Lustre: DEBUG MARKER: /usr/sbin/lnetctl net set --health 1000 --all^M
      [21720.087147] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&^M
      [21720.087147] lctl dl | grep ' ST ' || true^M
      [21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M
      [21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M
      [21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M
      [21720.125123] Call Trace TBD:^M
      [21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M
      [21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M
      [21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
      [21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
      [21720.129985] [<0>] kthread+0x134/0x150^M
      [21720.130761] [<0>] ret_from_fork+0x35/0x40^M
      [21720.131608] Kernel panic - not syncing: LBUG^M
      [21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1^M
      [21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
      [21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M
      [21720.135832] Call Trace:^M
      [21720.135865]  dump_stack+0x41/0x60^M
      [21720.139358]  panic+0xe7/0x2ac^M
      [21720.139973]  ? ret_from_fork+0x35/0x40^M
      [21720.140690]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M
      [21720.141763]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M
      [21720.142713]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
      [21720.143717]  lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
      [21720.144692]  ? finish_wait+0x80/0x80^M
      [21720.145393]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M
      [21720.146400]  kthread+0x134/0x150^M
      [21720.147027]  ? set_kthread_struct+0x50/0x50^M
      [21720.147820]  ret_from_fork+0x35/0x40^M
      [21720.087147] lctl dl | grep ' ST ' || true^M
      [21720.119727] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: ^M
      [21720.121839] LNetError: 1469074:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG^M
      [21720.123252] Pid: 1469074, comm: lnet_discovery 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023^M
      [21720.125123] Call Trace TBD:^M
      [21720.125763] [<0>] libcfs_call_trace+0x63/0x90 [libcfs]^M
      [21720.126772] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]^M
      [21720.127754] [<0>] lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
      [21720.128903] [<0>] lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
      [21720.129985] [<0>] kthread+0x134/0x150^M
      [21720.130761] [<0>] ret_from_fork+0x35/0x40^M
      [21720.131608] Kernel panic - not syncing: LBUG^M
      [21720.132444] CPU: 0 PID: 1469074 Comm: lnet_discovery Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1^M
      [21720.134714] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
      [21720.134720] LNetError: 1469071:0:(lib-move.c:4933:lnet_parse()) 10.240.38.123@tcp, src 10.240.38.123@tcp: Dropping ACK (error -108 looking up sender)^M
      [21720.135832] Call Trace:^M
      [21720.135865]  dump_stack+0x41/0x60^M
      [21720.139358]  panic+0xe7/0x2ac^M
      [21720.139973]  ? ret_from_fork+0x35/0x40^M
      [21720.140690]  ? lnet_discovery_event_reply+0xb00/0xb00 [lnet]^M
      [21720.141763]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]^M
      [21720.142713]  lnet_assert_handler_unused+0xa0/0xd0 [lnet]^M
      [21720.143717]  lnet_peer_discovery+0x14fb/0x1730 [lnet]^M
      [21720.144692]  ? finish_wait+0x80/0x80^M
      [21720.145393]  ? lnet_peer_merge_data+0x1110/0x1110 [lnet]^M
      

       

      Sanity-lnet 204/205 run output

       

      == sanity-lnet test 204: Check no health or resends for single-rail local failures ========================================================== 16:17:04 (1700756224)
      Cleaning up LNet
      CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      Writer error: failed to resolve Netlink family id
      modules unloaded.
      Loading modules from /usr/lib64/lustre
      detected 2 online CPUs by sysfs
      Force libcfs to create 2 CPU partitions
      ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
      ../lnet/lnet/lnet options: 'accept=all'
      /usr/sbin/lnetctl lnet configure --all
      CMD: trevis-26vm4 /usr/sbin/lctl list_nids
      /usr/sbin/lnetctl discover 10.240.38.123@tcp
      discover:
      - primary nid: 10.240.38.123@tcp
        Multi-Rail: true
        peer_ni:
        - nid: 10.240.38.123@tcp
      net:
      -     net type: lo
            local NI(s):
            -     nid: 0@lo
                  status: up
      -     net type: tcp
            local NI(s):
            -     nid: 10.240.39.5@tcp
                  status: up
                  interfaces:
                        0: eth0
      -     primary nid: 10.240.38.123@tcp
            -     nid: 10.240.38.123@tcp
                  health stats:
                        health value: 1000
      debug=+net
      Simulate local_interrupt
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 1000
      Post local health: 1000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      Simulate local_dropped
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 2000
      Post local health: 2000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      Simulate local_aborted
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 2000
      Post local health: 2000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      Simulate local_no_route
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 2000
      Post local health: 2000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      Simulate local_timeout
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 2000
      Post local health: 2000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      Simulate local_error
      Added drop rule 10.240.39.5@tcp->10.240.38.123@tcp (1/1)
      Added drop rule 10.240.39.5@tcp->10.240.39.5@tcp (1/1)
      /usr/sbin/lnetctl ping 10.240.38.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 10.240.38.123@tcp: Input/output error
                       
      Pre resends: 0
      Post resends: 0
      Resends delta: 0
      Pre local health: 2000
      Post local health: 2000
      Pre remote health: 1000
      Post remote health: 1000
      /usr/sbin/lnetctl peer set --health 1000 --all
      /usr/sbin/lnetctl net set --health 1000 --all
      Removed 2 drop rules
      Check that no resends took place
      Check that local NI health is unchanged
      CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      Writer error: failed to resolve Netlink family id
      modules unloaded.
      pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2
      pdsh@trevis-33vm2: trevis-33vm2: ssh exited with exit code 2
      PASS 204 (10s)
      CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-26vm4 sysctl --values kernel/kptr_restrict
      CMD: trevis-26vm4 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-33vm2.trevis.whamcloud.com sysctl --values kernel/kptr_restrict
      CMD: trevis-33vm2.trevis.whamcloud.com sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-33vm3 sysctl --values kernel/kptr_restrict
      CMD: trevis-33vm3 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-67vm7 sysctl --values kernel/kptr_restrict
      CMD: trevis-67vm7 sysctl -wq kernel/kptr_restrict=1
      CMD: trevis-79vm7 sysctl --values kernel/kptr_restrict
      CMD: trevis-79vm7 sysctl -wq kernel/kptr_restrict=1
      == sanity-lnet test 205: Check health and resends for multi-rail local failures ========================================================== 16:17:18 (1700756238)
      Cleaning up LNet
      CMD: trevis-33vm2.trevis.whamcloud.com lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      

       

      Nothing alarming/useful under MDS/OST output

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_205 - trevis-33vm2 crashed during sanity-lnet test_205

      Attachments

        Issue Links

          Activity

            People

              cbordage Cyril Bordage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: