Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14810

sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84ed5e32-bda8-4880-a254-dbd96cb6a478

      test_212 failed with the following error:

      trevis-79vm3 crashed during sanity-lnet test_212
      
      LNetError: 944562:0:(lib-md.c:288:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
      Pid: 944562, comm: lnet_discovery 4.18.0-240.22.1.el8_3.aarch64 #1 SMP Thu Apr 8 19:01:45 UTC 2021
      Call Trace:
       libcfs_call_trace+0xb8/0x118 [libcfs]
       lbug_with_loc+0x60/0xa0 [libcfs]
       lnet_assert_handler_unused+0xb8/0xe0 [lnet]
       lnet_peer_discovery+0x16c4/0x1cb0 [lnet]
       kthread+0x130/0x138
      

      This test was recently landed via patch https://review.whamcloud.com/43418 "LU-14627 lnet: Ensure ref taken when queueing for discovery"

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_212 - trevis-79vm3 crashed during sanity-lnet test_212

      Attachments

        Issue Links

          Activity

            [LU-14810] sanity-lnet test_212: lnet_assert_handler_unused() ASSERTION(md->md_handler != handler) failed
            hornc Chris Horn added a comment -

            Added dk.log.fmt which is the full debug log extracted from the crash dump

            hornc Chris Horn added a comment - Added dk.log.fmt which is the full debug log extracted from the crash dump
            hornc Chris Horn added a comment - - edited

            I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs).

            While processing the peer in discovery we send a push which must be ACK'd:

            00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541
            00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d.
            00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp
            00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0
            

            (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state)

            Discovery "completes" while this push is outstanding:

            00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141
            00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0
            00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp
            00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete
            

            Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding:

            00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108
            00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping
            00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped
            00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped
            ...
            00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d
            ...
            00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed:
            00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG
            

            I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.

            hornc Chris Horn added a comment - - edited I tripped this assert running sanity-lnet test 304 and 310 (it trips in 310 when unloading modules after test 304 runs). While processing the peer in discovery we send a push which must be ACK'd: 00000400:00000200:18.0:1719348279.400460:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100541 00000400:00000010:18.0:1719348279.400461:0:915:0:(lib-md.c:130:lnet_md_build()) slab-alloced 'lmd': 136 at 00000000afac742d. 00000400:00000200:18.0:1719348279.400503:0:915:0:(peer.c:3937:lnet_peer_send_push()) peer 2001:db8:0:f101::1@tcp 00000400:00000200:18.0:1719348279.400504:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 rc 0 (Note: unclear why LNET_PEER_PUSH_SENT is not set in above peer state) Discovery "completes" while this push is outstanding: 00000400:00000200:18.0:1719348279.400505:0:915:0:(peer.c:4115:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100141 00000400:00000200:18.0:1719348279.400507:0:915:0:(peer.c:4137:lnet_peer_discovery()) peer 2001:db8:0:f101::1@tcp(0000000044443381) state 0x100111 rc 0 00000400:00000200:18.0:1719348279.400508:0:915:0:(peer.c:2430:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 2001:db8:0:f101::1@tcp 00000400:00000200:13.0:1719348279.400514:0:1237:0:(peer.c:2785:lnet_discover_peer_locked()) peer 2001:db8:0:f101::1@tcp NID 2601:8c1:c180:2000::cbdc@tcp: 0. discovery complete Thus the test continues on and now LNet is shutdown, and discovery thread stopped while push is outstanding: 00000400:00000200:18.0:1719348281.265136:0:915:0:(peer.c:4003:lnet_peer_discovery_wait_for_work()) woken: -108 00000400:00000200:18.0:1719348281.265138:0:915:0:(peer.c:4162:lnet_peer_discovery()) stopping 00000400:00000200:18.0:1719348281.265140:0:915:0:(peer.c:4208:lnet_peer_discovery()) stopped 00000400:00000200:15.0:1719348281.265143:0:1926:0:(peer.c:4265:lnet_peer_discovery_stop()) discovery stopped ... 00000400:00000200:12.0:1719348281.265164:0:1926:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000afac742d ... 00000400:00040000:12.0:1719348281.265181:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 00000400:00040000:12.0:1719348281.276144:0:1926:0:(lib-md.c:259:lnet_assert_handler_unused()) LBUG I suspect something similar is happening in test 212 as the test cases are similar. It's not clear to me what the best solution is here, but I wanted to share this in case others have an idea.
            adilger Andreas Dilger added a comment - This is still being hit on master, despite patch 54884 landing as commit v2_15_63-8-g3ba393a5cb: https://testing.whamcloud.com/test_sets/979b507b-225a-407d-bc2c-e02925ae9d39 https://testing.whamcloud.com/test_sets/3805363b-bb13-4133-a2cf-41b6bcff7154 https://testing.whamcloud.com/test_sets/7d5b948d-8245-476f-abd1-9fbe8d1cb752 https://testing.whamcloud.com/test_sets/bb60d168-511c-4861-ac7d-5f14389d836e
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54884/
            Subject: LU-14810 lnet: ongoing push when discovery is stopped
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3ba393a5cb21ff0f8bd8a09c341ee01e936321c7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54884/ Subject: LU-14810 lnet: ongoing push when discovery is stopped Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3ba393a5cb21ff0f8bd8a09c341ee01e936321c7

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54884
            Subject: LU-14810 lnet: ongoing push when discovery is stopped
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7c25fd3bc2c4e0cb241688c36acfa4d82cc52078

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54884 Subject: LU-14810 lnet: ongoing push when discovery is stopped Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7c25fd3bc2c4e0cb241688c36acfa4d82cc52078

            Any progress on this? It still seems to be crashing on a regular basis.

            adilger Andreas Dilger added a comment - Any progress on this? It still seems to be crashing on a regular basis.
            adilger Andreas Dilger added a comment - - edited 6x crashes in the past week: https://testing.whamcloud.com/search?status%5B%5D=FAIL&status%5B%5D=TIMEOUT&status%5B%5D=CRASH&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=058541e7-0f00-4f82-a6e5-8316ea99a160&start_date=2024-04-12&end_date=2024-04-13&source=sub_tests#redirect
            adilger Andreas Dilger added a comment - +1 on latest master in sanity-let test_205: https://testing.whamcloud.com/test_sets/3cf7e1b8-0fa3-43e0-a2af-06a1c4a538eb
            adilger Andreas Dilger added a comment - Still seeing several crashes per week on master with 2.15.60.58, when patch 53356 was landed as 2.15.59.81: https://testing.whamcloud.com/search?horizon=518400&status%5B%5D=TIMEOUT&status%5B%5D=CRASH&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=058541e7-0f00-4f82-a6e5-8316ea99a160&source=sub_tests#redirect
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/sub_tests/121f699a-8b7c-40dc-be4e-99f534b874b4

            People

              cbordage Cyril Bordage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: