Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18320

interop: sanity-lnet test_226: ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/cee7992f-b0ae-414f-b3b0-0a419a46c1d9

      test_226 failed with the following error:

      onyx-45vm6 crashed during sanity-lnet test_226
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_10.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64

      Console log on MDS:

      Lustre: DEBUG MARKER: output="$(/usr/sbin/lnetctl route show --net tcp --gateway 10.240.23.245@tcp1 2>/dev/null)";              if [[ -n "${output}" ]]; then                                           echo "Delete route to tcp via 10.240.23.245@tcp1";                              /usr/sbin/lnetctl route del --net tcp --gateway 10.240.23.245@tcp1;                     e               
      LNetError: 1348624:0:(peer.c:2227:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed:
      LNetError: 1348624:0:(peer.c:2227:lnet_destroy_peer_ni_locked()) LBUG
      Pid: 1348624, comm: socknal_sd00_01 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024
      Call Trace TBD:
      [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
      [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      [<0>] lnet_destroy_peer_ni_locked+0x446/0x4e0 [lnet]
      [<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
      [<0>] lnet_select_pathway+0xb95/0x16c0 [lnet]
      [<0>] lnet_send+0x6d/0x1e0 [lnet]
      [<0>] lnet_parse_local+0x3ef/0xde0 [lnet]
      [<0>] lnet_parse+0xd78/0x1480 [lnet]
      [<0>] ksocknal_process_receive+0x4dc/0xdb0 [ksocklnd]
      [<0>] ksocknal_scheduler+0x188/0x17c0 [ksocklnd]
      [<0>] kthread+0x134/0x150
      [<0>] ret_from_fork+0x35/0x40
      Kernel panic - not syncing: LBUG
      CPU: 1 PID: 1348624 Comm: socknal_sd00_01 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
       dump_stack+0x41/0x60
       panic+0xe7/0x2ac
       ? ret_from_fork+0x35/0x40
       lbug_with_loc.cold.8+0x18/0x18 [libcfs]
       lnet_destroy_peer_ni_locked+0x446/0x4e0 [lnet]
       lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
       lnet_select_pathway+0xb95/0x16c0 [lnet]
       ? lnet_try_match_md+0x337/0x630 [lnet]
       lnet_send+0x6d/0x1e0 [lnet]
       lnet_parse_local+0x3ef/0xde0 [lnet]
       lnet_parse+0xd78/0x1480 [lnet]
       ksocknal_process_receive+0x4dc/0xdb0 [ksocklnd]
       ksocknal_scheduler+0x188/0x17c0 [ksocklnd]
       ? finish_wait+0x80/0x80
       ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
       kthread+0x134/0x150
       ? set_kthread_struct+0x50/0x50
       ret_from_fork+0x35/0x40
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_226 - onyx-45vm6 crashed during sanity-lnet test_226

      Attachments

        Issue Links

          Activity

            [LU-18320] interop: sanity-lnet test_226: ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed
            pjones Peter Jones added a comment -

            Seems that this was merged for 2.16 after all

            pjones Peter Jones added a comment - Seems that this was merged for 2.16 after all

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56839/
            Subject: LU-18320 tests: add skip option to sanity-lnet test_226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 33355c666f0070487b4a8f10d6b85b8e92081122

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56839/ Subject: LU-18320 tests: add skip option to sanity-lnet test_226 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 33355c666f0070487b4a8f10d6b85b8e92081122

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56839
            Subject: LU-18320 tests: add skip option to sanity-lnet test_226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5ed628ed028d8834bbacf268cd78c9bef219a147

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56839 Subject: LU-18320 tests: add skip option to sanity-lnet test_226 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5ed628ed028d8834bbacf268cd78c9bef219a147

            ssmirnov, I ask because the timelines for those two changes (patch landing on b2_15 and patch landing on master) are independent. Also, the version check patch on master would also skip testing with other versions older than 2.15.6 that would not have the fix patch in any case.

            Note that there should similarly be a patch on b2_15 that is skipping this test when run with any version older than 2.15.5.1 or so, since it will not have the fix patch either.

            adilger Andreas Dilger added a comment - ssmirnov , I ask because the timelines for those two changes (patch landing on b2_15 and patch landing on master) are independent. Also, the version check patch on master would also skip testing with other versions older than 2.15.6 that would not have the fix patch in any case. Note that there should similarly be a patch on b2_15 that is skipping this test when run with any version older than 2.15.5.1 or so, since it will not have the fix patch either.
            pjones Peter Jones added a comment -

            Sounds sensible and that patch is already lined up for inclusion in 2.15.6

            pjones Peter Jones added a comment - Sounds sensible and that patch is already lined up for inclusion in 2.15.6

            adilger I think I can do what you propose, but why don't we also merge "LU-17440 lnet: prevent errorneous decref for asym route" fix to b2_15 since it fixes the bug introduced in LU-17062 which was already merged to b2_15? https://review.whamcloud.com/#/c/fs/lustre-release/+/54906/

             

             

            ssmirnov Serguei Smirnov added a comment - adilger I think I can do what you propose, but why don't we also merge " LU-17440 lnet: prevent errorneous decref for asym route" fix to b2_15 since it fixes the bug introduced in LU-17062 which was already merged to b2_15? https://review.whamcloud.com/#/c/fs/lustre-release/+/54906/    

            ssmirnov, can you (or Frank) please push a patch to master to skip this test in interop unless the peer node has version >= v2_15_62-31-g2b210f3905 (which is the LU-17440 fix). I'm not sure how sanity-lnet code runs on remote nodes (e.g. would the test check $MDS1_VERSION or $OST1_VERSION or something else?), or I would have pushed a patch myself.

            adilger Andreas Dilger added a comment - ssmirnov , can you (or Frank) please push a patch to master to skip this test in interop unless the peer node has version >= v2_15_62-31-g2b210f3905 (which is the LU-17440 fix). I'm not sure how sanity-lnet code runs on remote nodes (e.g. would the test check $MDS1_VERSION or $OST1_VERSION or something else?), or I would have pushed a patch myself.
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5 client with 2.15.5 server: https://testing.whamcloud.com/test_sets/abf52df8-0800-4646-be8b-959148975297

            It looks like b2_15 branch has "LU-17062 lnet: Update lnet_peer_*_decref_locked usage" commit 60cfceb8c59364f786b31ac36c2c245b9a1e495a from master but not "LU-17440 lnet: prevent errorneous decref for asym route" commit 2b210f39059be998b80b0acc13c12451960b6 which fixes the refcount issue causing the assertion.

            ssmirnov Serguei Smirnov added a comment - It looks like b2_15 branch has " LU-17062 lnet: Update lnet_peer_*_decref_locked usage" commit 60cfceb8c59364f786b31ac36c2c245b9a1e495a from master but not " LU-17440 lnet: prevent errorneous decref for asym route" commit 2b210f39059be998b80b0acc13c12451960b6 which fixes the refcount issue causing the assertion.

            People

              ssmirnov Serguei Smirnov
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: