Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14635

LBUG LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.6
    • None
    • CentOS 7.9, Lustre 2.12.6 servers, clients: 2.12.6, 2.13 and 2.14
    • 3
    • 9223372036854775807

    Description

      We hit this OSS LBUG 3 times in total lately with Lustre 2.12.6:

      [330956.154500] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 
      [330956.166945] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) LBUG
      [330956.174812] Pid: 50573, comm: lnet_discovery 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
      [330956.186861] Call Trace:
      [330956.189703]  [<ffffffffc0a2f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [330956.197114]  [<ffffffffc0a2f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [330956.204125]  [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
      [330956.212254]  [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
      [330956.220715]  [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet]
      [330956.228410]  [<ffffffff9b0c5c21>] kthread+0xd1/0xe0
      [330956.233965]  [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [330956.241271]  [<ffffffffffffffff>] 0xffffffffffffffff
      [330956.246936] Kernel panic - not syncing: LBUG
      [330956.251797] CPU: 22 PID: 50573 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1
      [330956.266548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.6.0 10/26/2017
      [330956.274996] Call Trace:
      [330956.277828]  [<ffffffff9b781400>] dump_stack+0x19/0x1b
      [330956.283659]  [<ffffffff9b77a958>] panic+0xe8/0x21f
      [330956.289118]  [<ffffffffc0a2f8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [330956.296131]  [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
      [330956.304198]  [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
      [330956.312653]  [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet]
      [330956.320328]  [<ffffffff9b0c6d10>] ? wake_up_atomic_t+0x30/0x30
      [330956.326940]  [<ffffffffc0d5fc80>] ? lnet_peer_merge_data+0xe00/0xe00 [lnet]
      [330956.334805]  [<ffffffff9b0c5c21>] kthread+0xd1/0xe0
      [330956.340344]  [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40
      [330956.347242]  [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [330956.354526]  [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40
      

      It looks like a duplicate of LU-13652 which is supposedly fixed in 2.12.6, but as we're running Lustre 2.12.6 on all servers on this system (Oak) so I'm opening a new issue tagged 2.12.6.

      Attaching "foreach bt" from the crash dump (available upon request) as oak-io2-s1_foreachbt.txt
      Attaching vmcore-dmesg.txt as oak-io2-s1_vmcore-dmesg_2021-04-21-17-33-41.txt

       
      LNet discovery is supposed to be enabled everywhere:

      [root@oak-io2-s1 127.0.0.1-2021-04-21-17:33:41]# lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 50
          health_sensitivity: 0
          recovery_interval: 1
      

       

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: