Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9933

Hitting ASSERTION in lnet_peer_add_nid()

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.11.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      More than 1/2 the time when I attempt to bring up a file system I hit this assertion:

      2017-08-30T14:33:56.720487-04:00 ninja33.ccs.ornl.gov kernel: LNetError: 1755:0:(peer.c:1248:lnet_peer_add_nid()) ASSERTION( nid !
      = ((lnet_nid_t) -1) ) failed:
      2017-08-30T14:33:56.720539-04:00 ninja33.ccs.ornl.gov kernel: LNetError: 1755:0:(peer.c:1248:lnet_peer_add_nid()) LBUG
      2017-08-30T14:33:56.720559-04:00 ninja33.ccs.ornl.gov kernel: Pid: 1755, comm: lnet_discovery
      2017-08-30T14:33:56.726095-04:00 ninja33.ccs.ornl.gov kernel: #012Call Trace:
      2017-08-30T14:33:56.740849-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc08797de>] libcfs_call_trace+0x4e/0x60 [libcfs]
      2017-08-30T14:33:56.740894-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc087986c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      2017-08-30T14:33:56.756438-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097b1f4>] lnet_peer_add_nid+0x384/0x390 [lnet]
      2017-08-30T14:33:56.756476-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097b226>] lnet_peer_set_primary_nid+0x26/0xe0 [lnet]
      2017-08-30T14:33:56.773089-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097cc46>] lnet_peer_discovery+0xbf6/0xf80 [lnet]
      2017-08-30T14:33:56.773129-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      2017-08-30T14:33:56.789130-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097c050>] ? lnet_peer_discovery+0x0/0xf80 [lnet]
      2017-08-30T14:33:56.789167-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      2017-08-30T14:33:56.795447-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      2017-08-30T14:33:56.808636-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      2017-08-30T14:33:56.808675-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      2017-08-30T14:33:56.814948-04:00 ninja33.ccs.ornl.gov kernel:
      2017-08-30T14:33:56.817775-04:00 ninja33.ccs.ornl.gov kernel: Kernel panic - not syncing: LBUG

      Attachments

        Issue Links

          Activity

            [LU-9933] Hitting ASSERTION in lnet_peer_add_nid()

            No. This impacts dynamic discovery only

            ashehata Amir Shehata (Inactive) added a comment - No. This impacts dynamic discovery only
            pjones Peter Jones added a comment -

            Does this affect b2_10?

            pjones Peter Jones added a comment - Does this affect b2_10?
            pjones Peter Jones added a comment -

            Landed for 2.11.

            pjones Peter Jones added a comment - Landed for 2.11.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28811/
            Subject: LU-9933 lnet: Handle ping buffer with only loopback NID
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 002e25b9277905b97aa827dc3fb72db2f25b32f2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28811/ Subject: LU-9933 lnet: Handle ping buffer with only loopback NID Project: fs/lustre-release Branch: master Current Patch Set: Commit: 002e25b9277905b97aa827dc3fb72db2f25b32f2

            this patch has +2, should we land it since it's a blocker?

            ashehata Amir Shehata (Inactive) added a comment - this patch has +2, should we land it since it's a blocker?

            Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28811
            Subject: LU-9933 lnet: Handle ping buffer with only loopback NID
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f96a5611b5ed1d4414636970df3c20e02b997deb

            gerrit Gerrit Updater added a comment - Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28811 Subject: LU-9933 lnet: Handle ping buffer with only loopback NID Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f96a5611b5ed1d4414636970df3c20e02b997deb

            To get here the ping buffer must have contained only a single NID, which should always be the loopback NID. Like the lnet_peer_data_present() should have the following check changed

                    if (pbuf->pb_info.pi_nnis > 1)
                            nid = pbuf->pb_info.pi_ni[1].ns_nid;
            
            

            to

                    if (pbuf->pb_info.pi_nnis <= 1)
                            goto out;
                    nid = pbuf->pb_info.pi_ni[1].ns_nid;
            
            
            olaf Olaf Weber (Inactive) added a comment - To get here the ping buffer must have contained only a single NID, which should always be the loopback NID. Like the lnet_peer_data_present()  should have the following check changed if (pbuf->pb_info.pi_nnis > 1) nid = pbuf->pb_info.pi_ni[1].ns_nid; to if (pbuf->pb_info.pi_nnis <= 1) goto out; nid = pbuf->pb_info.pi_ni[1].ns_nid;

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: