Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
A regression was introduced in the 2.15.4 commit:
commit 6cfc8e55a2e77c9c91b81a8842e2cbd886025298 Author: Serguei Smirnov <ssmirnov@whamcloud.com> Date: Tue Feb 28 15:02:20 2023 -0800 LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
this is backport of
commit fc7a0d6013b46ebc17cdfdccc04a5d1d92c6af24 Author: Serguei Smirnov <ssmirnov@whamcloud.com> Date: Tue Feb 28 15:02:20 2023 -0800 LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
This backport was not done correctly. Instead of returning the primary NID assigned to the peer object, it will always return whatever NID was passed to LNetPrimaryNID() as an argument.
Test case:
- Filesystem is started (50 MDTs, 240 OSTs) and idle
- Subset of OSTs are stopped and then re-mounted (umount targets, unload/reload net, re-mount targets)
- Subset of these ^ end up in connection loop with a corrupted peer entry for one or more of the MDSes
In this case, MDS merced6 has:
merced6:~ # lcdl list_nids 2203@kfi4 2267@kfi4 merced6:~ #
i.e.
Primary NID - 2203@kfi4
Secondary NID - 2267@kfi4
OSS merced201 ends up with peer entry where 2267@kfi4 is primary:
merced201:~ # lnetctl peer show --nid 2267@kfi4 peer: - primary nid: 2267@kfi4 Multi-Rail: True peer ni: - nid: 2203@kfi4 state: NA - nid: 2267@kfi4 state: NA merced201:~ #
i.e.
Primary NID - 2267@kfi4
Secondary NID - 2203@kfi4
Merced201 sends MDS_CONNECT to merced6 (MDT0005), but reply buffer is setup using actual primary NID 2203@kfi4:
00000100:00100000:30.0:1711131512.571627:0:274601:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@0000000085d9184b pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:lustre4-MDT0005-lwp-OST0064_UUID:274601:1794251330431424:2203@kfi4:38:
When the reply arrives it is dropped due to NID mismatch:
00000400:00000100:20.0:1711131512.571851:0:906:0:(lib-move.c:4249:lnet_parse_put()) Dropping PUT from 12345-2267@kfi4 portal 10 match 1794251330431424 offset 0 length 416: 4
How does it get into this state?
While OST is being started, an incoming OST_CONNECT request from merced6's secondary NID (2267@kfi4), creates a peer with 2267@kfi4 locked primary.
target_handle_connect()->ptlrpc_connection_get()->LNetPrimaryNID(2267@kfi4):
00000100:00000040:6.0:1711131464.201240:0:274892:0:(lustre_net.h:2402:ptlrpc_rqphase_move()) @@@ move request phase from New to Interpret req@00000000e34ec254 x1794248965703168/t0(0) o8-><?>@<unknown>:0/0 lens 520/0 e 0 to 0 dl 1711131564 ref 1 fl New:/0/ffffffff rc 0/-1 job:'' 00000100:00100000:6.0:1711131464.201243:0:274892:0:(service.c:2309:ptlrpc_server_handle_request()) Handling RPC req@00000000e34ec254 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost01_095:0+-99:28449:x1794248965703168:12345-2267@kfi4:8: 00000100:00000200:6.0:1711131464.201244:0:274892:0:(service.c:2314:ptlrpc_server_handle_request()) got req 1794248965703168 00010000:02020000:6.0:1711131464.201287:0:274892:0:(ldlm_lib.c:1124:target_handle_connect()) 137-5: lustre4-OST0064_UUID: not available for connect from 2267@kfi4 (no target). If you are running an HA pair check that the target is mounted on the other server. 00000400:00000200:6.0:1711131464.223959:0:274892:0:(api-ni.c:1551:lnet_nid4_cpt_hash()) Match nid 2267@kfi4 to cpt 1 00000400:00000200:6.0:1711131464.223961:0:274892:0:(peer.c:2441:lnet_peer_queue_for_discovery()) Queue peer 2267@kfi4: 0 00000400:00000200:6.0:1711131464.223962:0:274892:0:(peer.c:2765:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:6.0:1711131464.223962:0:274892:0:(peer.c:2805:lnet_discover_peer_locked()) non-blocking discovery 00000400:00000200:6.0:1711131464.223963:0:274892:0:(peer.c:2814:lnet_discover_peer_locked()) peer 2267@kfi4 NID 2267@kfi4: 0. pending discovery 00000400:00000200:6.0:1711131464.223964:0:274892:0:(api-ni.c:1551:lnet_nid4_cpt_hash()) Match nid 2267@kfi4 to cpt 1 00000400:00000200:6.0:1711131464.223965:0:274892:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0 ^ Peer with 2267@kfi4 locked primary is created
When config log is processed, subsequent calls to LNetPrimaryNID do not (always) return 2267@kfi4 as expected:
[hornc@s-lmo-kalina restart-201-216]$ grep LNetPrimaryNID merced201-dk.log | grep -e 2203@kfi -e 2267@kfi 00000400:00000200:6.0:1711131464.223965:0:274892:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0 00000400:00000200:27.0:1711131512.570599:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0 00000400:00000200:27.0:1711131512.570670:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0 00000400:00000200:27.0:1711131512.571671:0:277716:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2203@kfi4 rc 0 00000400:00000200:22.0:1711131560.393558:0:275275:0:(peer.c:1553:LNetPrimaryNID()) NID 2267@kfi4 primary NID 2267@kfi4 rc 0 [hornc@s-lmo-kalina restart-201-216]$
Attachments
Issue Links
- is related to
-
LU-14668 LNet: do discovery in the background
- Resolved