Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14810

sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84ed5e32-bda8-4880-a254-dbd96cb6a478

      test_212 failed with the following error:

      trevis-79vm3 crashed during sanity-lnet test_212
      
      LNetError: 944562:0:(lib-md.c:288:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
      Pid: 944562, comm: lnet_discovery 4.18.0-240.22.1.el8_3.aarch64 #1 SMP Thu Apr 8 19:01:45 UTC 2021
      Call Trace:
       libcfs_call_trace+0xb8/0x118 [libcfs]
       lbug_with_loc+0x60/0xa0 [libcfs]
       lnet_assert_handler_unused+0xb8/0xe0 [lnet]
       lnet_peer_discovery+0x16c4/0x1cb0 [lnet]
       kthread+0x130/0x138
      

      This test was recently landed via patch https://review.whamcloud.com/43418 "LU-14627 lnet: Ensure ref taken when queueing for discovery"

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_212 - trevis-79vm3 crashed during sanity-lnet test_212

      Attachments

        Issue Links

          Activity

            [LU-14810] sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55559/
            Subject: LU-14810 lnet: Do not issue multiple PUSHes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 72726a311814bc0c0eefb22a769c9ebf7912839e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55559/ Subject: LU-14810 lnet: Do not issue multiple PUSHes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 72726a311814bc0c0eefb22a769c9ebf7912839e

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55559
            Subject: LU-14810 lnet: Do not issue multiple PUSHes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d33780a2496bcf10c179fff9bb25e7084b2baf0b

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55559 Subject: LU-14810 lnet: Do not issue multiple PUSHes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d33780a2496bcf10c179fff9bb25e7084b2baf0b
            hornc Chris Horn added a comment -

            This easily reproduces running sanity-lnet/304 using https://review.whamcloud.com/c/fs/lustre-release/+/55422 with FORCE_LARGE_NID=true and ipv6 configuration in place. I modified the test case to unload modules immediately after the lnetctl discovery command returns.

            I still don't have a complete picture, but here's some additional detail:

            Initial discovery PING to 2601:8c1:c180:2000::cbdc@tcp

            00000400:00000200:20.0:1719427955.999607:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x40
            00000400:00000010:20.0:1719427955.999610:0:13521:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at 000000004b255cff.
            00000400:00000200:20.0:1719427955.999654:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:<?>) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : GET try# 0
            00000400:00000200:20.0:1719427955.999675:0:13521:0:(peer.c:3805:lnet_peer_send_ping()) peer 2601:8c1:c180:2000::cbdc@tcp
            00000400:00000200:20.0:1719427955.999677:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x240 rc 0
            00000400:00000200:18.0:1719427955.999846:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : GET - for me
            00000400:00000200:18.0:1719427955.999881:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : REPLY try# 0
            00000400:00000200:18.0:1719427955.999926:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: REPLY: OK
            

            Peers are merged:

            00000400:00000200:20.0:1719427955.999942:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0xc1
            00000400:00000200:20.0:1719427955.999950:0:13521:0:(peer.c:470:lnet_peer_cancel_discovery()) peer 2001:db8:0:f101::1@tcp(000000005e8466da)
            00000400:00000200:20.0:1719427955.999958:0:13521:0:(peer.c:279:lnet_destroy_peer_locked()) 000000005e8466da nid 2001:db8:0:f101::1@tcp
            00000400:00000010:20.0:1719427955.999958:0:13521:0:(peer.c:309:lnet_destroy_peer_locked()) kfreed 'lp': 336 at 000000005e8466da.
            00000400:00000010:20.0:1719427955.999968:0:13521:0:(api-ni.c:1799:lnet_ping_buffer_free()) kfreed 'pbuf': 281 at 000000004b255cff.
            00000400:00000200:20.0:1719427955.999969:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0
            00000400:00000200:20.0:1719427955.999970:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100141
            00000400:00000200:20.0:1719427955.999971:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 rc 1
            

            Discovery PUSH (1) sent to 2601:8c1:c180:2000::cbdc@tcp:

            00000400:00000200:20.0:1719427955.999972:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141
            00000400:00000200:20.0:1719427955.999982:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : PUT try# 0
            00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278)
            00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0
            00000400:00000200:18.0:1719427956.000020:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : PUT - for me
            00000400:00000200:18.0:1719427956.000036:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: PUT: OK
            00000400:00000200:18.0:1719427956.000040:0:13597:0:(api-ni.c:2242:lnet_push_target_event_handler()) type 2 status 0 unlinked 1
            00000400:00000200:18.0:1719427956.000043:0:13597:0:(peer.c:2581:lnet_peer_push_event()) peer 2001:db8:0:f101::1@tcp(000000002304c278) is MR
            00000400:00000010:18.0:1719427956.000045:0:13597:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 89 at 000000007a703a44.
            00000400:00000200:18.0:1719427956.000047:0:13597:0:(peer.c:2655:lnet_peer_push_event()) Received Push 2001:db8:0:f101::1@tcp 3
            00000400:00000200:18.0:1719427956.000049:0:13597:0:(peer.c:2416:lnet_peer_queue_for_discovery()) Queue peer 2001:db8:0:f101::1@tcp: -114
            

            Discovery thread (PID 13521) starts processing the incoming PUSH, but ACK is sent async (via PID 13597):

            00000400:00000200:20.0:1719427956.000058:0:13521:0:(api-ni.c:2232:lnet_push_target_post()) posted push target 00000000c5b5a8b6
            
            00000400:00000200:18.0:1719427956.000063:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : ACK try# 0
            
            00000400:00000200:20.0:1719427956.000066:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x1004c1
            00000400:00000200:20.0:1719427956.000068:0:13521:0:(peer.c:1908:lnet_peer_add_nid()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp flags 0x11: 0
            00000400:00000200:20.0:1719427956.000073:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0
            00000400:00000200:20.0:1719427956.000074:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100541
            00000400:00000200:20.0:1719427956.000075:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 1
            

            Discovery PUSH (2) sending to 2001:db8:0:f101::1@tcp:

            00000400:00000200:20.0:1719427956.000076:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541
            00000400:00000010:20.0:1719427956.000076:0:13521:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 0000000048c8e5a7.
            00000400:00000200:20.0:1719427956.000078:0:13521:0:(lib-move.c:5252:LNetPut()) LNetPut -> 12345-2001:db8:0:f101::1@tcp
            

            ACK for PUSH (1) is finalized while PUSH (2) is in process of sending:

            00000400:00000200:18.0:1719427956.000089:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: ACK: OK
            

            PUSH (2) continues sending:

            00000400:00000200:20.0:1719427956.000094:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:<?>) -> 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:2001:db8:0:f101::1@tcp) <?> : PUT try# 0
            ...
            00000400:00000200:20.0:1719427956.000104:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278)
            00000400:00000200:20.0:1719427956.000105:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0
            

            When discovery thread next wakes up, the PUSH_SENT flag is cleared. I think this is because of the async ACK processing.

            00000400:00000200:20.0:1719427956.000113:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: 0
            00000400:00000200:20.0:1719427956.000115:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141
            00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:3861:lnet_peer_discovered()) peer 2001:db8:0:f101::1@tcp
            00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100111 rc 0
            00000400:00000200:20.0:1719427956.000117:0:13521:0:(peer.c:2432:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp
            
            00000400:00000200:20.0:1719427956.012685:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: -108
            00000400:00000200:20.0:1719427956.012686:0:13521:0:(peer.c:4164:lnet_peer_discovery()) stopping
            00000400:00000200:20.0:1719427956.012688:0:13521:0:(peer.c:4210:lnet_peer_discovery()) stopped
            LBUG...
            
            hornc Chris Horn added a comment - This easily reproduces running sanity-lnet/304 using https://review.whamcloud.com/c/fs/lustre-release/+/55422 with FORCE_LARGE_NID=true and ipv6 configuration in place. I modified the test case to unload modules immediately after the lnetctl discovery command returns. I still don't have a complete picture, but here's some additional detail: Initial discovery PING to 2601:8c1:c180:2000::cbdc@tcp 00000400:00000200:20.0:1719427955.999607:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x40 00000400:00000010:20.0:1719427955.999610:0:13521:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at 000000004b255cff. 00000400:00000200:20.0:1719427955.999654:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:<?>) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : GET try# 0 00000400:00000200:20.0:1719427955.999675:0:13521:0:(peer.c:3805:lnet_peer_send_ping()) peer 2601:8c1:c180:2000::cbdc@tcp 00000400:00000200:20.0:1719427955.999677:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0x240 rc 0 00000400:00000200:18.0:1719427955.999846:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : GET - for me 00000400:00000200:18.0:1719427955.999881:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : REPLY try# 0 00000400:00000200:18.0:1719427955.999926:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: REPLY: OK Peers are merged: 00000400:00000200:20.0:1719427955.999942:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2601:8c1:c180:2000::cbdc@tcp(000000002304c278) state 0xc1 00000400:00000200:20.0:1719427955.999950:0:13521:0:(peer.c:470:lnet_peer_cancel_discovery()) peer 2001:db8:0:f101::1@tcp(000000005e8466da) 00000400:00000200:20.0:1719427955.999958:0:13521:0:(peer.c:279:lnet_destroy_peer_locked()) 000000005e8466da nid 2001:db8:0:f101::1@tcp 00000400:00000010:20.0:1719427955.999958:0:13521:0:(peer.c:309:lnet_destroy_peer_locked()) kfreed 'lp': 336 at 000000005e8466da. 00000400:00000010:20.0:1719427955.999968:0:13521:0:(api-ni.c:1799:lnet_ping_buffer_free()) kfreed 'pbuf': 281 at 000000004b255cff. 00000400:00000200:20.0:1719427955.999969:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0 00000400:00000200:20.0:1719427955.999970:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100141 00000400:00000200:20.0:1719427955.999971:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 rc 1 Discovery PUSH (1) sent to 2601:8c1:c180:2000::cbdc@tcp: 00000400:00000200:20.0:1719427955.999972:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 00000400:00000200:20.0:1719427955.999982:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) <?> : PUT try# 0 00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278) 00000400:00000200:20.0:1719427955.999991:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0 00000400:00000200:18.0:1719427956.000020:0:13597:0:(lib-move.c:4742:lnet_parse()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp) <- 2601:8c1:c180:2000::cbdc@tcp : PUT - for me 00000400:00000200:18.0:1719427956.000036:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: PUT: OK 00000400:00000200:18.0:1719427956.000040:0:13597:0:(api-ni.c:2242:lnet_push_target_event_handler()) type 2 status 0 unlinked 1 00000400:00000200:18.0:1719427956.000043:0:13597:0:(peer.c:2581:lnet_peer_push_event()) peer 2001:db8:0:f101::1@tcp(000000002304c278) is MR 00000400:00000010:18.0:1719427956.000045:0:13597:0:(api-ni.c:1785:lnet_ping_buffer_alloc()) alloc '(pbuf)': 89 at 000000007a703a44. 00000400:00000200:18.0:1719427956.000047:0:13597:0:(peer.c:2655:lnet_peer_push_event()) Received Push 2001:db8:0:f101::1@tcp 3 00000400:00000200:18.0:1719427956.000049:0:13597:0:(peer.c:2416:lnet_peer_queue_for_discovery()) Queue peer 2001:db8:0:f101::1@tcp: -114 Discovery thread (PID 13521) starts processing the incoming PUSH, but ACK is sent async (via PID 13597): 00000400:00000200:20.0:1719427956.000058:0:13521:0:(api-ni.c:2232:lnet_push_target_post()) posted push target 00000000c5b5a8b6 00000400:00000200:18.0:1719427956.000063:0:13597:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) -> 2601:8c1:c180:2000::cbdc@tcp(2601:8c1:c180:2000::cbdc@tcp:2601:8c1:c180:2000::cbdc@tcp) 2601:8c1:c180:2000::cbdc@tcp : ACK try# 0 00000400:00000200:20.0:1719427956.000066:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x1004c1 00000400:00000200:20.0:1719427956.000068:0:13521:0:(peer.c:1908:lnet_peer_add_nid()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp flags 0x11: 0 00000400:00000200:20.0:1719427956.000073:0:13521:0:(peer.c:3426:lnet_peer_merge_data()) peer 2001:db8:0:f101::1@tcp (000000002304c278): 0 00000400:00000200:20.0:1719427956.000074:0:13521:0:(peer.c:3731:lnet_peer_data_present()) peer 2001:db8:0:f101::1@tcp(000000002304c278): 0. state = 0x100541 00000400:00000200:20.0:1719427956.000075:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 1 Discovery PUSH (2) sending to 2001:db8:0:f101::1@tcp: 00000400:00000200:20.0:1719427956.000076:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 00000400:00000010:20.0:1719427956.000076:0:13521:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 0000000048c8e5a7. 00000400:00000200:20.0:1719427956.000078:0:13521:0:(lib-move.c:5252:LNetPut()) LNetPut -> 12345-2001:db8:0:f101::1@tcp ACK for PUSH (1) is finalized while PUSH (2) is in process of sending: 00000400:00000200:18.0:1719427956.000089:0:13597:0:(lib-msg.c:797:lnet_health_check()) health check: 2601:8c1:c180:2000::cbdc@tcp->2601:8c1:c180:2000::cbdc@tcp: ACK: OK PUSH (2) continues sending: 00000400:00000200:20.0:1719427956.000094:0:13521:0:(lib-move.c:1958:lnet_handle_send()) TRACE: 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:<?>) -> 2001:db8:0:f101::1@tcp(2001:db8:0:f101::1@tcp:2001:db8:0:f101::1@tcp) <?> : PUT try# 0 ... 00000400:00000200:20.0:1719427956.000104:0:13521:0:(peer.c:3939:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp(000000002304c278) 00000400:00000200:20.0:1719427956.000105:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100541 rc 0 When discovery thread next wakes up, the PUSH_SENT flag is cleared. I think this is because of the async ACK processing. 00000400:00000200:20.0:1719427956.000113:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:20.0:1719427956.000115:0:13521:0:(peer.c:4117:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100141 00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:3861:lnet_peer_discovered()) peer 2001:db8:0:f101::1@tcp 00000400:00000200:20.0:1719427956.000116:0:13521:0:(peer.c:4139:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(000000002304c278) state 0x100111 rc 0 00000400:00000200:20.0:1719427956.000117:0:13521:0:(peer.c:2432:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp 00000400:00000200:20.0:1719427956.012685:0:13521:0:(peer.c:4005:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:20.0:1719427956.012686:0:13521:0:(peer.c:4164:lnet_peer_discovery()) stopping 00000400:00000200:20.0:1719427956.012688:0:13521:0:(peer.c:4210:lnet_peer_discovery()) stopped LBUG...
            hornc Chris Horn added a comment -

            Added dk.log.fmt which is the full debug log extracted from the crash dump

            hornc Chris Horn added a comment - Added dk.log.fmt which is the full debug log extracted from the crash dump
            hornc Chris Horn added a comment - - edited

            I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs).

            While processing the peer in discovery we send a push which must be ACK'd:

            00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541
            00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d.
            00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp
            00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0
            

            (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state)

            Discovery "completes" while this push is outstanding:

            00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141
            00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0
            00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp
            00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete
            

            Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding:

            00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108
            00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping
            00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped
            00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped
            ...
            00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d
            ...
            00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed:
            00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG
            

            I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.

            hornc Chris Horn added a comment - - edited I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs). While processing the peer in discovery we send a push which must be ACK'd: 00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541 00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d. 00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp 00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0 (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state) Discovery "completes" while this push is outstanding: 00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0 00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp 00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding: 00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping 00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped 00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped ... 00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d ... 00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.
            adilger Andreas Dilger added a comment - This is still being hit on master, despite patch 54884 landing as commit v2_15_63-8-g3ba393a5cb: https://testing.whamcloud.com/test_sets/979b507b-225a-407d-bc2c-e02925ae9d39 https://testing.whamcloud.com/test_sets/3805363b-bb13-4133-a2cf-41b6bcff7154 https://testing.whamcloud.com/test_sets/7d5b948d-8245-476f-abd1-9fbe8d1cb752 https://testing.whamcloud.com/test_sets/bb60d168-511c-4861-ac7d-5f14389d836e
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54884/
            Subject: LU-14810 lnet: ongoing push when discovery is stopped
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3ba393a5cb21ff0f8bd8a09c341ee01e936321c7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54884/ Subject: LU-14810 lnet: ongoing push when discovery is stopped Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3ba393a5cb21ff0f8bd8a09c341ee01e936321c7

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54884
            Subject: LU-14810 lnet: ongoing push when discovery is stopped
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7c25fd3bc2c4e0cb241688c36acfa4d82cc52078

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54884 Subject: LU-14810 lnet: ongoing push when discovery is stopped Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7c25fd3bc2c4e0cb241688c36acfa4d82cc52078

            Any progress on this? It still seems to be crashing on a regular basis.

            adilger Andreas Dilger added a comment - Any progress on this? It still seems to be crashing on a regular basis.

            People

              cbordage Cyril Bordage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: