[LU-9933] Hitting ASSERTION in lnet_peer_add_nid() Created: 30/Aug/17  Updated: 02/Oct/17  Resolved: 30/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-9990 MDS fails to mount due to (client.c:9... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

More than 1/2 the time when I attempt to bring up a file system I hit this assertion:

2017-08-30T14:33:56.720487-04:00 ninja33.ccs.ornl.gov kernel: LNetError: 1755:0:(peer.c:1248:lnet_peer_add_nid()) ASSERTION( nid !
= ((lnet_nid_t) -1) ) failed:
2017-08-30T14:33:56.720539-04:00 ninja33.ccs.ornl.gov kernel: LNetError: 1755:0:(peer.c:1248:lnet_peer_add_nid()) LBUG
2017-08-30T14:33:56.720559-04:00 ninja33.ccs.ornl.gov kernel: Pid: 1755, comm: lnet_discovery
2017-08-30T14:33:56.726095-04:00 ninja33.ccs.ornl.gov kernel: #012Call Trace:
2017-08-30T14:33:56.740849-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc08797de>] libcfs_call_trace+0x4e/0x60 [libcfs]
2017-08-30T14:33:56.740894-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc087986c>] lbug_with_loc+0x4c/0xb0 [libcfs]
2017-08-30T14:33:56.756438-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097b1f4>] lnet_peer_add_nid+0x384/0x390 [lnet]
2017-08-30T14:33:56.756476-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097b226>] lnet_peer_set_primary_nid+0x26/0xe0 [lnet]
2017-08-30T14:33:56.773089-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097cc46>] lnet_peer_discovery+0xbf6/0xf80 [lnet]
2017-08-30T14:33:56.773129-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
2017-08-30T14:33:56.789130-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffffc097c050>] ? lnet_peer_discovery+0x0/0xf80 [lnet]
2017-08-30T14:33:56.789167-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
2017-08-30T14:33:56.795447-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
2017-08-30T14:33:56.808636-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
2017-08-30T14:33:56.808675-04:00 ninja33.ccs.ornl.gov kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
2017-08-30T14:33:56.814948-04:00 ninja33.ccs.ornl.gov kernel:
2017-08-30T14:33:56.817775-04:00 ninja33.ccs.ornl.gov kernel: Kernel panic - not syncing: LBUG



 Comments   
Comment by Olaf Weber [ 31/Aug/17 ]

To get here the ping buffer must have contained only a single NID, which should always be the loopback NID. Like the lnet_peer_data_present() should have the following check changed

        if (pbuf->pb_info.pi_nnis > 1)
                nid = pbuf->pb_info.pi_ni[1].ns_nid;

to

        if (pbuf->pb_info.pi_nnis <= 1)
                goto out;
        nid = pbuf->pb_info.pi_ni[1].ns_nid;

Comment by Gerrit Updater [ 31/Aug/17 ]

Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28811
Subject: LU-9933 lnet: Handle ping buffer with only loopback NID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f96a5611b5ed1d4414636970df3c20e02b997deb

Comment by Amir Shehata (Inactive) [ 22/Sep/17 ]

this patch has +2, should we land it since it's a blocker?

Comment by Gerrit Updater [ 30/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28811/
Subject: LU-9933 lnet: Handle ping buffer with only loopback NID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 002e25b9277905b97aa827dc3fb72db2f25b32f2

Comment by Peter Jones [ 30/Sep/17 ]

Landed for 2.11.

Comment by Peter Jones [ 30/Sep/17 ]

Does this affect b2_10?

Comment by Amir Shehata (Inactive) [ 02/Oct/17 ]

No. This impacts dynamic discovery only

Generated at Sat Feb 10 02:30:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.