Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14810

sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84ed5e32-bda8-4880-a254-dbd96cb6a478

      test_212 failed with the following error:

      trevis-79vm3 crashed during sanity-lnet test_212
      
      LNetError: 944562:0:(lib-md.c:288:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
      Pid: 944562, comm: lnet_discovery 4.18.0-240.22.1.el8_3.aarch64 #1 SMP Thu Apr 8 19:01:45 UTC 2021
      Call Trace:
       libcfs_call_trace+0xb8/0x118 [libcfs]
       lbug_with_loc+0x60/0xa0 [libcfs]
       lnet_assert_handler_unused+0xb8/0xe0 [lnet]
       lnet_peer_discovery+0x16c4/0x1cb0 [lnet]
       kthread+0x130/0x138
      

      This test was recently landed via patch https://review.whamcloud.com/43418 "LU-14627 lnet: Ensure ref taken when queueing for discovery"

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_212 - trevis-79vm3 crashed during sanity-lnet test_212

      Attachments

        Issue Links

          Activity

            [LU-14810] sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sets/ec596cc8-2bd5-41a8-a098-821566f270e4
            ys Yang Sheng added a comment -

            +1. https://testing.whamcloud.com/test_sets/5b314583-911e-4f74-abba-3590e5e66bd0

            [18631.568718] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for 1080568
            [18631.791373] Lustre: DEBUG MARKER: Wait for 1080568
            [18633.346663] LNet: There was an unexpected network error while writing to 10.240.25.244: rc = -22
            [18633.596010] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Finished wait on 1080568
            [18633.871079] Lustre: DEBUG MARKER: Finished wait on 1080568
            [18633.891459] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
            lctl dl | grep ' ST ' || true
            [18633.922010] LNetError: 1081165:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
            [18633.922014] LNetError: 1081165:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG
            [18633.922027] CPU: 0 PID: 1081165 Comm: lnetctl Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-362.24.1.el9_3.x86_64 #1
            [18633.922030] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [18633.922034] Call Trace:
            [18633.922053]  <TASK>
            [18633.922057]  dump_stack_lvl+0x34/0x48
            [18633.922111]  lbug_with_loc.cold+0x5/0x58 [libcfs]
            [18633.922131]  lnet_assert_handler_unused+0x9c/0xd0 [lnet]
            [18633.922195]  ? __pfx_lnet_discovery_event_handler+0x10/0x10 [lnet]
            [18633.922231]  LNetNIFini+0x9f/0x150 [lnet]
            [18633.922261]  lnet_unconfigure+0x66/0x80 [lnet]
            [18633.922297]  genl_family_rcv_msg_doit.isra.0+0xcb/0x120
            [18633.922330]  genl_family_rcv_msg+0x14c/0x220
            [18633.922333]  ? __pfx_lnet_net_conf_cmd+0x10/0x10 [lnet]
            [18633.922357]  genl_rcv_msg+0x47/0xa0
            [18633.922361]  ? __pfx_genl_rcv_msg+0x10/0x10
            [18633.922363]  netlink_rcv_skb+0x57/0x100
            [18633.922367]  genl_rcv+0x24/0x40
            [18633.922370]  netlink_unicast+0x23e/0x360
            [18633.922372]  netlink_sendmsg+0x238/0x480
            [18633.922374]  ? __check_object_size.part.0+0x35/0xd0
            [18633.922401]  sock_sendmsg+0x62/0x70
            [18633.922422]  ____sys_sendmsg+0x230/0x270
            [18633.922424]  ? copy_msghdr_from_user+0x6d/0xa0
            [18633.922427]  ___sys_sendmsg+0x88/0xd0
            [18633.922433]  ? ___sys_recvmsg+0x88/0xd0
            [18633.922436]  __sys_sendmsg+0x59/0xa0
            [18633.922438]  do_syscall_64+0x5c/0x90
            [18633.922463]  ? syscall_exit_work+0x103/0x130
            [18633.922481]  ? syscall_exit_to_user_mode+0x22/0x40
            [18633.922484]  ? do_syscall_64+0x69/0x90
            [18633.922485]  ? do_syscall_64+0x69/0x90
            [18633.922487]  ? exc_page_fault+0x62/0x150
            [18633.922489]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
            [18633.922505] RIP: 0033:0x7fc79b94f787
            [18633.922558] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
            [18633.922560] RSP: 002b:00007ffe57304698 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
            [18633.922566] RAX: ffffffffffffffda RBX: 0000556d5c0e3390 RCX: 00007fc79b94f787
            [18633.922567] RDX: 0000000000000000 RSI: 00007ffe573046d0 RDI: 0000000000000003
            [18633.922569] RBP: 0000556d5c0e32a0 R08: 0000000000000003 R09: 0000000000000000
            [18633.922569] R10: 0000000000000010 R11: 0000000000000246 R12: 0000556d5c111ce0
            [18633.922571] R13: 00007ffe573046d0 R14: 0000000000000004 R15: 0000556d5c0f4915
            [18633.922573]  </TASK>
            
            
            ys Yang Sheng added a comment - +1. https://testing.whamcloud.com/test_sets/5b314583-911e-4f74-abba-3590e5e66bd0 [18631.568718] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for 1080568 [18631.791373] Lustre: DEBUG MARKER: Wait for 1080568 [18633.346663] LNet: There was an unexpected network error while writing to 10.240.25.244: rc = -22 [18633.596010] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Finished wait on 1080568 [18633.871079] Lustre: DEBUG MARKER: Finished wait on 1080568 [18633.891459] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true [18633.922010] LNetError: 1081165:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: [18633.922014] LNetError: 1081165:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG [18633.922027] CPU: 0 PID: 1081165 Comm: lnetctl Kdump: loaded Tainted: G W OE ------- --- 5.14.0-362.24.1.el9_3.x86_64 #1 [18633.922030] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [18633.922034] Call Trace: [18633.922053] <TASK> [18633.922057] dump_stack_lvl+0x34/0x48 [18633.922111] lbug_with_loc.cold+0x5/0x58 [libcfs] [18633.922131] lnet_assert_handler_unused+0x9c/0xd0 [lnet] [18633.922195] ? __pfx_lnet_discovery_event_handler+0x10/0x10 [lnet] [18633.922231] LNetNIFini+0x9f/0x150 [lnet] [18633.922261] lnet_unconfigure+0x66/0x80 [lnet] [18633.922297] genl_family_rcv_msg_doit.isra.0+0xcb/0x120 [18633.922330] genl_family_rcv_msg+0x14c/0x220 [18633.922333] ? __pfx_lnet_net_conf_cmd+0x10/0x10 [lnet] [18633.922357] genl_rcv_msg+0x47/0xa0 [18633.922361] ? __pfx_genl_rcv_msg+0x10/0x10 [18633.922363] netlink_rcv_skb+0x57/0x100 [18633.922367] genl_rcv+0x24/0x40 [18633.922370] netlink_unicast+0x23e/0x360 [18633.922372] netlink_sendmsg+0x238/0x480 [18633.922374] ? __check_object_size.part.0+0x35/0xd0 [18633.922401] sock_sendmsg+0x62/0x70 [18633.922422] ____sys_sendmsg+0x230/0x270 [18633.922424] ? copy_msghdr_from_user+0x6d/0xa0 [18633.922427] ___sys_sendmsg+0x88/0xd0 [18633.922433] ? ___sys_recvmsg+0x88/0xd0 [18633.922436] __sys_sendmsg+0x59/0xa0 [18633.922438] do_syscall_64+0x5c/0x90 [18633.922463] ? syscall_exit_work+0x103/0x130 [18633.922481] ? syscall_exit_to_user_mode+0x22/0x40 [18633.922484] ? do_syscall_64+0x69/0x90 [18633.922485] ? do_syscall_64+0x69/0x90 [18633.922487] ? exc_page_fault+0x62/0x150 [18633.922489] entry_SYSCALL_64_after_hwframe+0x72/0xdc [18633.922505] RIP: 0033:0x7fc79b94f787 [18633.922558] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [18633.922560] RSP: 002b:00007ffe57304698 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [18633.922566] RAX: ffffffffffffffda RBX: 0000556d5c0e3390 RCX: 00007fc79b94f787 [18633.922567] RDX: 0000000000000000 RSI: 00007ffe573046d0 RDI: 0000000000000003 [18633.922569] RBP: 0000556d5c0e32a0 R08: 0000000000000003 R09: 0000000000000000 [18633.922569] R10: 0000000000000010 R11: 0000000000000246 R12: 0000556d5c111ce0 [18633.922571] R13: 00007ffe573046d0 R14: 0000000000000004 R15: 0000556d5c0f4915 [18633.922573] </TASK>
            adilger Andreas Dilger added a comment - - edited This crashed again on the tip of master (patch based on v2_15_65-5-g018c4e8f25): https://testing.whamcloud.com/test_sets/964597eb-199c-47fc-9609-b23c1f47f949 and 30 other crashes of sanity-lnet test_212 since this patch landed: https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=CRASH&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=058541e7-0f00-4f82-a6e5-8316ea99a160&source=sub_tests#redirect
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55559/
            Subject: LU-14810 lnet: Do not issue multiple PUSHes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 72726a311814bc0c0eefb22a769c9ebf7912839e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55559/ Subject: LU-14810 lnet: Do not issue multiple PUSHes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 72726a311814bc0c0eefb22a769c9ebf7912839e

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55559
            Subject: LU-14810 lnet: Do not issue multiple PUSHes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d33780a2496bcf10c179fff9bb25e7084b2baf0b

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55559 Subject: LU-14810 lnet: Do not issue multiple PUSHes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d33780a2496bcf10c179fff9bb25e7084b2baf0b
            hornc Chris Horn added a comment -

            This easily reproduces running sanity-lnet/304 using https://review.whamcloud.com/c/fs/lustre-release/+/55422 with FORCE_LARGE_NID=true and ipv6 configuration in place. I modified the test case to unload modules immediately after the lnetctl discovery command returns.

            I still don't have a complete picture, but here's some additional detail:

            Initial discovery PING to 2601:8c1:c180:2000::cbdc@tcp

            00000400:00000200:20.0:1719427955.999607:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x40
            00000400:00000010:20.0:1719427955.999610:0:13521:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at 000000004b255cff.
            00000400:00000200:20.0:1719427955.999654:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:<?>) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : GET try# 0
            00000400:00000200:20.0:1719427955.999675:0:13521:0:(peer.c:3805:lnet_peer_send_ping()) peer 2601:8c1:c180:2000::cbdc@tcp
            00000400:00000200:20.0:1719427955.999677:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x240 rc 0
            00000400:00000200:18.0:1719427955.999846:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : GET - for me
            00000400:00000200:18.0:1719427955.999881:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : REPLY try# 0
            00000400:00000200:18.0:1719427955.999926:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: REPLY: OK
            

            Peers are merged:

            00000400:00000200:20.0:1719427955.999942:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0xc1
            00000400:00000200:20.0:1719427955.999950:0:13521:0:(peer.c:470:lnet_peer_cancel_discovery()) peer 2001:db8:0:f101::1@tcp(000000005e8466da)
            00000400:00000200:20.0:1719427955.999958:0:13521:0:(peer.c:279:lnet_destroy_peer_locked()) 000000005e8466da nid 2001:db8:0:f101::1@tcp
            00000400:00000010:20.0:1719427955.999958:0:13521:0:(peer.c:309:lnet_destroy_peer_locked()) kfreed 'lp': 336 at 000000005e8466da.
            00000400:00000010:20.0:1719427955.999968:0:13521:0:(api-ni.c:1799:lnet_ping_buffer_free()) kfreed 'pbuf': 281 at 000000004b255cff.
            00000400:00000200:20.0:1719427955.999969:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0
            00000400:00000200:20.0:1719427955.999970:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100141
            00000400:00000200:20.0:1719427955.999971:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 rc 1
            

            Discovery PUSH (1) sent to 2601:8c1:c180:2000::cbdc@tcp:

            00000400:00000200:20.0:1719427955.999972:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141
            00000400:00000200:20.0:1719427955.999982:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : PUT try# 0
            00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278)
            00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0
            00000400:00000200:18.0:1719427956.000020:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : PUT - for me
            00000400:00000200:18.0:1719427956.000036:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: PUT: OK
            00000400:00000200:18.0:1719427956.000040:0:13597:0:(api-ni.c:2242:lnet_push_target_event_handler()) type 2 status 0 unlinked 1
            00000400:00000200:18.0:1719427956.000043:0:13597:0:(peer.c:2581:lnet_peer_push_event()) peer 2001:db8:0:f101::1@tcp(000000002304c278) is MR
            00000400:00000010:18.0:1719427956.000045:0:13597:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 89 at 000000007a703a44.
            00000400:00000200:18.0:1719427956.000047:0:13597:0:(peer.c:2655:lnet_peer_push_event()) Received Push 2001:db8:0:f101::1@tcp 3
            00000400:00000200:18.0:1719427956.000049:0:13597:0:(peer.c:2416:lnet_peer_queue_for_discovery()) Queue peer 2001:db8:0:f101::1@tcp: -114
            

            Discovery thread (PID 13521) starts processing the incoming PUSH, but ACK is sent async (via PID 13597):

            00000400:00000200:20.0:1719427956.000058:0:13521:0:(api-ni.c:2232:lnet_push_target_post()) posted push target 00000000c5b5a8b6
            
            00000400:00000200:18.0:1719427956.000063:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : ACK try# 0
            
            00000400:00000200:20.0:1719427956.000066:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x1004c1
            00000400:00000200:20.0:1719427956.000068:0:13521:0:(peer.c:1908:lnet_peer_add_nid()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp flags 0x11: 0
            00000400:00000200:20.0:1719427956.000073:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0
            00000400:00000200:20.0:1719427956.000074:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100541
            00000400:00000200:20.0:1719427956.000075:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 1
            

            Discovery PUSH (2) sending to 2001:db8:0:f101::1@tcp:

            00000400:00000200:20.0:1719427956.000076:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541
            00000400:00000010:20.0:1719427956.000076:0:13521:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 0000000048c8e5a7.
            00000400:00000200:20.0:1719427956.000078:0:13521:0:(lib-move.c:5252:LNetPut()) LNetPut -> 12345-2001:db8:0:f101::1@tcp
            

            ACK for PUSH (1) is finalized while PUSH (2) is in process of sending:

            00000400:00000200:18.0:1719427956.000089:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: ACK: OK
            

            PUSH (2) continues sending:

            00000400:00000200:20.0:1719427956.000094:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:<?>) -> 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:2001:db8:0:f101::1@tcp) <?> : PUT try# 0
            ...
            00000400:00000200:20.0:1719427956.000104:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278)
            00000400:00000200:20.0:1719427956.000105:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0
            

            When discovery thread next wakes up, the PUSH_SENT flag is cleared. I think this is because of the async ACK processing.

            00000400:00000200:20.0:1719427956.000113:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: 0
            00000400:00000200:20.0:1719427956.000115:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141
            00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:3861:lnet_peer_discovered()) peer 2001:db8:0:f101::1@tcp
            00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100111 rc 0
            00000400:00000200:20.0:1719427956.000117:0:13521:0:(peer.c:2432:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp
            
            00000400:00000200:20.0:1719427956.012685:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: -108
            00000400:00000200:20.0:1719427956.012686:0:13521:0:(peer.c:4164:lnet_peer_discovery()) stopping
            00000400:00000200:20.0:1719427956.012688:0:13521:0:(peer.c:4210:lnet_peer_discovery()) stopped
            LBUG...
            
            hornc Chris Horn added a comment - This easily reproduces running sanity-lnet/304 using https://review.whamcloud.com/c/fs/lustre-release/+/55422 with FORCE_LARGE_NID=true and ipv6 configuration in place. I modified the test case to unload modules immediately after the lnetctl discovery command returns. I still don't have a complete picture, but here's some additional detail: Initial discovery PING to 2601:8c1:c180:2000::cbdc@tcp 00000400:00000200:20.0:1719427955.999607:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x40 00000400:00000010:20.0:1719427955.999610:0:13521:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at 000000004b255cff. 00000400:00000200:20.0:1719427955.999654:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:<?>) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : GET try# 0 00000400:00000200:20.0:1719427955.999675:0:13521:0:(peer.c:3805:lnet_peer_send_ping()) peer 2601:8c1:c180:2000::cbdc@tcp 00000400:00000200:20.0:1719427955.999677:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x240 rc 0 00000400:00000200:18.0:1719427955.999846:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : GET - for me 00000400:00000200:18.0:1719427955.999881:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : REPLY try# 0 00000400:00000200:18.0:1719427955.999926:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: REPLY: OK Peers are merged: 00000400:00000200:20.0:1719427955.999942:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0xc1 00000400:00000200:20.0:1719427955.999950:0:13521:0:(peer.c:470:lnet_peer_cancel_discovery()) peer 2001:db8:0:f101::1@tcp(000000005e8466da) 00000400:00000200:20.0:1719427955.999958:0:13521:0:(peer.c:279:lnet_destroy_peer_locked()) 000000005e8466da nid 2001:db8:0:f101::1@tcp 00000400:00000010:20.0:1719427955.999958:0:13521:0:(peer.c:309:lnet_destroy_peer_locked()) kfreed 'lp': 336 at 000000005e8466da. 00000400:00000010:20.0:1719427955.999968:0:13521:0:(api-ni.c:1799:lnet_ping_buffer_free()) kfreed 'pbuf': 281 at 000000004b255cff. 00000400:00000200:20.0:1719427955.999969:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0 00000400:00000200:20.0:1719427955.999970:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100141 00000400:00000200:20.0:1719427955.999971:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 rc 1 Discovery PUSH (1) sent to 2601:8c1:c180:2000::cbdc@tcp: 00000400:00000200:20.0:1719427955.999972:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 00000400:00000200:20.0:1719427955.999982:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : PUT try# 0 00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278) 00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0 00000400:00000200:18.0:1719427956.000020:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : PUT - for me 00000400:00000200:18.0:1719427956.000036:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: PUT: OK 00000400:00000200:18.0:1719427956.000040:0:13597:0:(api-ni.c:2242:lnet_push_target_event_handler()) type 2 status 0 unlinked 1 00000400:00000200:18.0:1719427956.000043:0:13597:0:(peer.c:2581:lnet_peer_push_event()) peer 2001:db8:0:f101::1@tcp(000000002304c278) is MR 00000400:00000010:18.0:1719427956.000045:0:13597:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 89 at 000000007a703a44. 00000400:00000200:18.0:1719427956.000047:0:13597:0:(peer.c:2655:lnet_peer_push_event()) Received Push 2001:db8:0:f101::1@tcp 3 00000400:00000200:18.0:1719427956.000049:0:13597:0:(peer.c:2416:lnet_peer_queue_for_discovery()) Queue peer 2001:db8:0:f101::1@tcp: -114 Discovery thread (PID 13521) starts processing the incoming PUSH, but ACK is sent async (via PID 13597): 00000400:00000200:20.0:1719427956.000058:0:13521:0:(api-ni.c:2232:lnet_push_target_post()) posted push target 00000000c5b5a8b6 00000400:00000200:18.0:1719427956.000063:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : ACK try# 0 00000400:00000200:20.0:1719427956.000066:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x1004c1 00000400:00000200:20.0:1719427956.000068:0:13521:0:(peer.c:1908:lnet_peer_add_nid()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp flags 0x11: 0 00000400:00000200:20.0:1719427956.000073:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0 00000400:00000200:20.0:1719427956.000074:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100541 00000400:00000200:20.0:1719427956.000075:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 1 Discovery PUSH (2) sending to 2001:db8:0:f101::1@tcp: 00000400:00000200:20.0:1719427956.000076:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 00000400:00000010:20.0:1719427956.000076:0:13521:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 0000000048c8e5a7. 00000400:00000200:20.0:1719427956.000078:0:13521:0:(lib-move.c:5252:LNetPut()) LNetPut -> 12345-2001:db8:0:f101::1@tcp ACK for PUSH (1) is finalized while PUSH (2) is in process of sending: 00000400:00000200:18.0:1719427956.000089:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: ACK: OK PUSH (2) continues sending: 00000400:00000200:20.0:1719427956.000094:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:<?>) -> 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:2001:db8:0:f101::1@tcp) <?> : PUT try# 0 ... 00000400:00000200:20.0:1719427956.000104:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278) 00000400:00000200:20.0:1719427956.000105:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0 When discovery thread next wakes up, the PUSH_SENT flag is cleared. I think this is because of the async ACK processing. 00000400:00000200:20.0:1719427956.000113:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:20.0:1719427956.000115:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:3861:lnet_peer_discovered()) peer 2001:db8:0:f101::1@tcp 00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100111 rc 0 00000400:00000200:20.0:1719427956.000117:0:13521:0:(peer.c:2432:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp 00000400:00000200:20.0:1719427956.012685:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:20.0:1719427956.012686:0:13521:0:(peer.c:4164:lnet_peer_discovery()) stopping 00000400:00000200:20.0:1719427956.012688:0:13521:0:(peer.c:4210:lnet_peer_discovery()) stopped LBUG...
            hornc Chris Horn added a comment -

            Added dk.log.fmt which is the full debug log extracted from the crash dump

            hornc Chris Horn added a comment - Added dk.log.fmt which is the full debug log extracted from the crash dump
            hornc Chris Horn added a comment - - edited

            I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs).

            While processing the peer in discovery we send a push which must be ACK'd:

            00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541
            00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d.
            00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp
            00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0
            

            (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state)

            Discovery "completes" while this push is outstanding:

            00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141
            00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0
            00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp
            00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete
            

            Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding:

            00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108
            00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping
            00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped
            00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped
            ...
            00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d
            ...
            00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed:
            00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG
            

            I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.

            hornc Chris Horn added a comment - - edited I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs). While processing the peer in discovery we send a push which must be ACK'd: 00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541 00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d. 00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp 00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0 (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state) Discovery "completes" while this push is outstanding: 00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0 00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp 00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding: 00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping 00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped 00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped ... 00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d ... 00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.
            adilger Andreas Dilger added a comment - This is still being hit on master, despite patch 54884 landing as commit v2_15_63-8-g3ba393a5cb: https://testing.whamcloud.com/test_sets/979b507b-225a-407d-bc2c-e02925ae9d39 https://testing.whamcloud.com/test_sets/3805363b-bb13-4133-a2cf-41b6bcff7154 https://testing.whamcloud.com/test_sets/7d5b948d-8245-476f-abd1-9fbe8d1cb752 https://testing.whamcloud.com/test_sets/bb60d168-511c-4861-ac7d-5f14389d836e

            People

              cbordage Cyril Bordage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: