[LU-17142] MGC long time connection Created: 25/Sep/23  Updated: 10/Jan/24  Resolved: 18/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch
Environment:

2 MDTs in failover pair


Issue Links:
Related
is related to LU-17412 lustre snapshot: write barrier stuck ... Open
is related to LU-17147 sanity-lsnapshot test_1b: Fail to cre... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Initial state, n03 had MDT0 combined with MGS and MDT1. Then failover for MDT1 and failback for MDT0, MDT1 was started first.
After MGS started, MGC did not connect to it on a same node. And getting config lock was unsuccessful, MDT0 failed to start.
Here is the connection attempts to MGS

00000100:02000400:1.0:1684425810.091309:0:67373:0:(import.c:1234:ptlrpc_connect_interpret()) Evicted from MGS (at 90@kfi) after server handle changed from 0xf8988cf2594b4a99 to 0xad30caba1bb5141f
00000100:00080000:17.0:1684425957.393672:0:988963:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 0
00000100:00080000:17.0:1684425957.393676:0:988963:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo
00000100:00080000:15.0:1684426028.139221:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on MGC90@kfi_0 failed (-110)
00000100:00080000:12.0:1684426028.139231:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805
00000100:00080000:12.0:1684426028.139233:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 0
00000100:00080000:12.0:1684426028.139245:0:968287:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 57@kfi)
00000100:00080000:12.0:1684426028.139246:0:968287:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_1/57@kfi
00000100:00080000:0.0:1684426099.819204:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 57@kfi failed (-110)
00000100:00080000:2.0:1684426099.819218:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805
00000100:00080000:2.0:1684426099.819220:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876
00000100:00080000:2.0:1684426099.819221:0:991028:0:(import.c:581:import_select_connection()) MGC90@kfi: tried all connections, increasing latency to 66s
00000100:00080000:2.0:1684426099.819238:0:991028:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 0@lo)
00000100:00080000:2.0:1684426099.819240:0:991028:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo
20000000:00000040:11.0:1684426115.484210:0:995298:0:(mgs_handler.c:1397:mgs_init0()) MGS MGS started
00000100:00080000:4.0:1684426170.475202:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 90@kfi failed (-110)
00000100:00080000:5.0:1684426170.475211:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984948
00000100:00080000:5.0:1684426170.475213:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876
00010000:00010000:11.0:1684426186.859281:0:995298:0:(ldlm_request.c:1045:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000000000000 ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0
00010000:00000040:11.0:1684426186.859283:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1
00010000:00010000:11.0:1684426186.859285:0:995298:0:(ldlm_request.c:1132:ldlm_cli_enqueue()) ### sending request ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0
00010000:00000040:11.0:1684426186.859287:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1
00000100:00000040:11.0:1684426186.859291:0:995298:0:(lustre_net.h:2404:ptlrpc_rqphase_move()) @@@ move request phase from New to Rpc  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl New:QU/0/ffffffff rc 0/-1 job:'mount.lustre.0'
00000100:00080000:11.0:1684426186.859294:0:995298:0:(client.c:1665:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0'

00000100:00080000:11.0:1684426193.003178:0:995298:0:(client.c:1260:ptlrpc_import_delay_req()) @@@ send limit expired  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 10000000:01000000:11.0:1684426193.003262:0:995298:0:(mgc_request.c:2136:mgc_process_log()) MGC90@kfi: configuration from log 'kjlmo12-MDT0000' failed (-5).
00000020:02020000:11.0:1684426193.003265:0:995298:0:(obd_mount.c:109:lustre_process_log()) 15c-8: MGC90@kfi: Confguration from log kjlmo12-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
00000020:00020000:11.0:1684426193.022368:0:995298:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server kjlmo12-MDT0000: -5

MGC reconnection through pinger waits all requests to be timeout before a new attempt. In some situation it leads to fail to connect to MGS on a local node and failure to start a MDT0000.



 Comments   
Comment by Gerrit Updater [ 25/Sep/23 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52498
Subject: LU-17142 mgc: reconnection without pinger
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5fce9cfd00a4527675b10e91bbae17fd35638355

Comment by Etienne Aujames [ 27/Sep/23 ]

Hi Alexander,

Can this be related to LU-16204?

Comment by Alexander Boyko [ 28/Sep/23 ]

>Can this be related to LU-16204?
Not exactly. The description shows connection problem to a local MGS node, when MGC and MGS started on the same node.

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52498/
Subject: LU-17142 mgc: reconnection without pinger
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 867ba433e3a0fce4a1b2f8d37a91d550ada41a26

Comment by Peter Jones [ 18/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:32:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.