Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17440

after move from 2.14 to 2.15: LNetError: 31941:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.6
    • Lustre 2.15.4
    • TOSS 4.7-2.1
      lustre 2.15.4_1.llnl-1
      on lustre server asp
    • 3
    • 9223372036854775807

    Description

      We combined a OS update from TOSS 4.6-6 to TOSS 4.7-2.1 with a move from lustre 2.14 to 2.15 (2.14.0_22.llnl-1 to lustre-2.15.4_1.llnl-1).

      A few hours later we began to see this error, and eventually saw it on all 12 asp server nodes.

      2024-01-16 23:13:19 [39638.886090] Lustre: aspls3-OST0004: Client 8f15405e-d4cc-cf3c-7534-051e7352cf50 (at 192.168.128.24@o2ib35) reconnecting
      2024-01-16 23:13:19 [39638.896879] Lustre: Skipped 98 previous similar messages
      2024-01-16 23:13:34 [39654.404520] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed:
      2024-01-16 23:13:34 [39654.416271] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) LBUG
      2024-01-16 23:13:34 [39654.423671] Pid: 165557, comm: kiblnd_sd_00_01 4.18.0-513.9.1.1toss.t4.x86_64 #1 SMP Wed Nov 29 11:04:55 PST 2023
      2024-01-16 23:13:34 [39654.433921] Call Trace TBD:
      2024-01-16 23:13:34 [39654.436731] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
      2024-01-16 23:13:34 [39654.441888] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      2024-01-16 23:13:34 [39654.446688] [<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
      2024-01-16 23:13:34 [39654.452722] [<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
      2024-01-16 23:13:34 [39654.458845] [<0>] lnet_select_pathway+0xb95/0x16c0 [lnet]
      2024-01-16 23:13:34 [39654.464265] [<0>] lnet_send+0x6d/0x1e0 [lnet]
      2024-01-16 23:13:34 [39654.468646] [<0>] lnet_parse_local+0x3ed/0xdd0 [lnet]
      2024-01-16 23:13:34 [39654.473721] [<0>] lnet_parse+0xd7d/0x1490 [lnet]
      2024-01-16 23:13:34 [39654.478366] [<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
      2024-01-16 23:13:34 [39654.483782] [<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
      2024-01-16 23:13:34 [39654.489363] [<0>] kthread+0x14c/0x170
      2024-01-16 23:13:34 [39654.493030] [<0>] ret_from_fork+0x1f/0x40
      2024-01-16 23:13:34 [39654.497050] Kernel panic - not syncing: LBUG
      2024-01-16 23:13:34 [39654.501320] CPU: 47 PID: 165557 Comm: kiblnd_sd_00_01 Kdump: loaded Tainted: P           OE  X --------- -  - 4.18.0-513.9.1.1toss.t\
      4.x86_64 #1
      2024-01-16 23:13:34 [39654.514172] Hardware name: Supermicro SSG-229P-DN2R24264-LL013/X11DSN-TS, BIOS 3.4 11/04/2020
      2024-01-16 23:13:34 [39654.522683] Call Trace:
      2024-01-16 23:13:34 [39654.525137]  dump_stack+0x41/0x60
      2024-01-16 23:13:34 [39654.528457]  panic+0xe7/0x2ac
      2024-01-16 23:13:34 [39654.531429]  ? ret_from_fork+0x1f/0x40
      2024-01-16 23:13:34 [39654.535182]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
      2024-01-16 23:13:34 [39654.540156]  lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
      2024-01-16 23:13:35 [39654.545747]  lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
      2024-01-16 23:13:35 [39654.551423]  ? lnet_peer_ni_find_locked+0x14/0x30 [lnet]
      2024-01-16 23:13:35 [39654.556753]  lnet_select_pathway+0xb95/0x16c0 [lnet]
      2024-01-16 23:13:35 [39654.561735]  ? kiblnd_check_sends_locked+0x1a5/0x4a0 [ko2iblnd]
      2024-01-16 23:13:35 [39654.567656]  lnet_send+0x6d/0x1e0 [lnet]
      2024-01-16 23:13:35 [39654.571600]  lnet_parse_local+0x3ed/0xdd0 [lnet]
      2024-01-16 23:13:35 [39654.576238]  lnet_parse+0xd7d/0x1490 [lnet]
      2024-01-16 23:13:35 [39654.580438]  ? try_to_wake_up+0x1c2/0x4f0
      2024-01-16 23:13:35 [39654.584454]  kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
      2024-01-16 23:13:35 [39654.589427]  ? __wake_up_common+0x7a/0x190
      2024-01-16 23:13:35 [39654.593526]  kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
      2024-01-16 23:13:35 [39654.598665]  ? finish_wait+0x90/0x90
      2024-01-16 23:13:35 [39654.602245]  ? kiblnd_cq_event+0x80/0x80 [ko2iblnd]
      2024-01-16 23:13:35 [39654.607125]  kthread+0x14c/0x170
      2024-01-16 23:13:35 [39654.610357]  ? set_kthread_struct+0x50/0x50
      2024-01-16 23:13:35 [39654.614544]  ret_from_fork+0x1f/0x40
      2024-01-16 23:13:36 [    0.000000] Linux version 4.18.0-513.9.1.1toss.t4.x86_64 (mockbuild@builder2-x86.buildfarm.internal) (gcc version 8.5.0 20210514 (Re\
      d Hat 8.5.0-20) (GCC)) #1 SMP Wed Nov 29 11:04:55 PST 2023

      Attachments

        1. asp-orelic-console.tar.gz
          8.77 MB
          Gian-Carlo Defazio
        2. mutt-garter-2.14-no-bug.tar.gz
          726 kB
          Gian-Carlo Defazio
        3. mutt-garter-2.15-revert-LU-17062-no-bug.tar.gz
          775 kB
          Gian-Carlo Defazio
        4. mutt-garter-debug-2fcd4d27f.tar.gz
          816 kB
          Gian-Carlo Defazio
        5. mutt-garter-debug-2routers.tar.gz
          963 kB
          Gian-Carlo Defazio
        6. mutt-garter-debug-garter3_no_routes.tar.gz
          451 kB
          Gian-Carlo Defazio
        7. mutt-garter-reproducer.tar.gz
          1.23 MB
          Gian-Carlo Defazio

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              defazio Gian-Carlo Defazio
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: