[LU-17664] Regression in 2.15.4 backport of LU-14668 lnet: add 'lock_prim_nid" lnet module parameter - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.5
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

A regression was introduced in the 2.15.4 commit:

commit 6cfc8e55a2e77c9c91b81a8842e2cbd886025298
Author: Serguei Smirnov <ssmirnov@whamcloud.com>
Date:   Tue Feb 28 15:02:20 2023 -0800

    LU-14668 lnet: add 'lock_prim_nid" lnet module parameter

this is backport of

commit fc7a0d6013b46ebc17cdfdccc04a5d1d92c6af24
Author: Serguei Smirnov <ssmirnov@whamcloud.com>
Date:   Tue Feb 28 15:02:20 2023 -0800

    LU-14668 lnet: add 'lock_prim_nid" lnet module parameter

This backport was not done correctly. Instead of returning the primary NID assigned to the peer object, it will always return whatever NID was passed to LNetPrimaryNID() as an argument.

Test case:

Filesystem is started (50 MDTs, 240 OSTs) and idle
Subset of OSTs are stopped and then re-mounted (umount targets, unload/reload net, re-mount targets)
Subset of these ^ end up in connection loop with a corrupted peer entry for one or more of the MDSes

In this case, MDS merced6 has:

merced6:~ # lcdl list_nids
2203@kfi4
2267@kfi4
merced6:~ #

i.e.
Primary NID - 2203@kfi4
Secondary NID - 2267@kfi4

OSS merced201 ends up with peer entry where 2267@kfi4 is primary:

merced201:~ # lnetctl peer show --nid 2267@kfi4
peer:
    - primary nid: 2267@kfi4
      Multi-Rail: True
      peer ni:
        - nid: 2203@kfi4
          state: NA
        - nid: 2267@kfi4
          state: NA
merced201:~ #

i.e.
Primary NID - 2267@kfi4
Secondary NID - 2203@kfi4

Merced201 sends MDS_CONNECT to merced6 (MDT0005), but reply buffer is setup using actual primary NID 2203@kfi4:

00000100:00100000:30.0:1711131512.571627:0:274601:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@0000000085d9184b pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:lustre4-MDT0005-lwp-OST0064_UUID:274601:1794251330431424:2203@kfi4:38:

When the reply arrives it is dropped due to NID mismatch:

00000400:00000100:20.0:1711131512.571851:0:906:0:(lib-move.c:4249:lnet_parse_put()) Dropping PUT from 12345-2267@kfi4 portal 10 match 1794251330431424 offset 0 length 416: 4

How does it get into this state?

While OST is being started, an incoming OST_CONNECT request from merced6's secondary NID (2267@kfi4), creates a peer with 2267@kfi4 locked primary.

target_handle_connect()->ptlrpc_connection_get()->LNetPrimaryNID(2267@kfi4):

00000100:00000040:6.0:1711131464.201240:0:274892:0:(lustre_net.h:2402:ptlrpc_rqphase_move()) @@@ move request phase from New to Interpret  req@00000000e34ec254 x1794248965703168/t0(0) o8-><?>@<unknown>:0/0 lens 520/0 e 0 to 0 dl 1711131564 ref 1 fl New:/0/ffffffff rc 0/-1 job:''
00000100:00100000:6.0:1711131464.201243:0:274892:0:(service.c:2309:ptlrpc_server_handle_request()) Handling RPC req@00000000e34ec254 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost01_095:0+-99:28449:x1794248965703168:12345-2267@kfi4:8:
00000100:00000200:6.0:1711131464.201244:0:274892:0:(service.c:2314:ptlrpc_server_handle_request()) got req 1794248965703168
00010000:02020000:6.0:1711131464.201287:0:274892:0:(ldlm_lib.c:1124:target_handle_connect()) 137-5: lustre4-OST0064_UUID: not available for connect from 2267@kfi4 (no target). If you are running an HA pair check that the target is mounted on the other server.
00000400:00000200:6.0:1711131464.223959:0:274892:0:(api-ni.c:1551:lnet_nid4_cpt_hash()) Match nid 2267@kfi4 to cpt 1
00000400:00000200:6.0:1711131464.223961:0:274892:0:(peer.c:2441:lnet_peer_queue_for_discovery()) Queue peer 2267@kfi4: 0
00000400:00000200:6.0:1711131464.223962:0:274892:0:(peer.c:2765:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:6.0:1711131464.223962:0:274892:0:(peer.c:2805:lnet_discover_peer_locked()) non-blocking discovery
00000400:00000200:6.0:1711131464.223963:0:274892:0:(peer.c:2814:lnet_discover_peer_locked()) peer 2267@kfi4 NID 2267@kfi4: 0. pending discovery
00000400:00000200:6.0:1711131464.223964:0:274892:0:(api-ni.c:1551:lnet_nid4_cpt_hash()) Match nid 2267@kfi4 to cpt 1
00000400:00000200:6.0:1711131464.223965:0:274892:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0
^ Peer with 2267@kfi4 locked primary is created

When config log is processed, subsequent calls to LNetPrimaryNID do not (always) return 2267@kfi4 as expected:

[hornc@s-lmo-kalina restart-201-216]$ grep LNetPrimaryNID merced201-dk.log | grep -e 2203@kfi -e 2267@kfi
00000400:00000200:6.0:1711131464.223965:0:274892:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0
00000400:00000200:27.0:1711131512.570599:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0
00000400:00000200:27.0:1711131512.570670:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0
00000400:00000200:27.0:1711131512.571671:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0
00000400:00000200:22.0:1711131560.393558:0:275275:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0
[hornc@s-lmo-kalina restart-201-216]$

Attachments

Issue Links

is related to

LU-14668 LNet: do discovery in the background

Resolved

Activity

[LU-17664] Regression in 2.15.4 backport of LU-14668 lnet: add 'lock_prim_nid" lnet module parameter

Peter Jones added a comment - 17/Apr/24 1:05 PM

Merged for 2.15.5

Peter Jones added a comment - 17/Apr/24 1:05 PM Merged for 2.15.5

Gerrit Updater added a comment - 17/Apr/24 5:24 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54537/
Subject: ~~LU-17664~~ lnet: LNetPrimaryNID returns wrong NID
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 4937c3ccca8a3a3f8d5fdd2e5007d65773fcadc5

Gerrit Updater added a comment - 17/Apr/24 5:24 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54537/ Subject: LU-17664 lnet: LNetPrimaryNID returns wrong NID Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 4937c3ccca8a3a3f8d5fdd2e5007d65773fcadc5

Gerrit Updater added a comment - 22/Mar/24 9:07 PM

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54537
Subject: ~~LU-17664~~ lnet: LNetPrimaryNID returns wrong NID
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b80f1398596ab2173819d45475b19d37ebb5559c

Gerrit Updater added a comment - 22/Mar/24 9:07 PM "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54537 Subject: LU-17664 lnet: LNetPrimaryNID returns wrong NID Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b80f1398596ab2173819d45475b19d37ebb5559c

People

Assignee:: Chris Horn

Reporter:: Chris Horn

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/Mar/24 8:57 PM

Updated:: 24/May/24 12:08 PM

Resolved:: 17/Apr/24 1:05 PM