[LU-17142] MGC long time connection Created: 25/Sep/23 Updated: 10/Jan/24 Resolved: 18/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
2 MDTs in failover pair |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Initial state, n03 had MDT0 combined with MGS and MDT1. Then failover for MDT1 and failback for MDT0, MDT1 was started first. 00000100:02000400:1.0:1684425810.091309:0:67373:0:(import.c:1234:ptlrpc_connect_interpret()) Evicted from MGS (at 90@kfi) after server handle changed from 0xf8988cf2594b4a99 to 0xad30caba1bb5141f 00000100:00080000:17.0:1684425957.393672:0:988963:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 0 00000100:00080000:17.0:1684425957.393676:0:988963:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo 00000100:00080000:15.0:1684426028.139221:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on MGC90@kfi_0 failed (-110) 00000100:00080000:12.0:1684426028.139231:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805 00000100:00080000:12.0:1684426028.139233:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 0 00000100:00080000:12.0:1684426028.139245:0:968287:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 57@kfi) 00000100:00080000:12.0:1684426028.139246:0:968287:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_1/57@kfi 00000100:00080000:0.0:1684426099.819204:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 57@kfi failed (-110) 00000100:00080000:2.0:1684426099.819218:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805 00000100:00080000:2.0:1684426099.819220:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876 00000100:00080000:2.0:1684426099.819221:0:991028:0:(import.c:581:import_select_connection()) MGC90@kfi: tried all connections, increasing latency to 66s 00000100:00080000:2.0:1684426099.819238:0:991028:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 0@lo) 00000100:00080000:2.0:1684426099.819240:0:991028:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo 20000000:00000040:11.0:1684426115.484210:0:995298:0:(mgs_handler.c:1397:mgs_init0()) MGS MGS started 00000100:00080000:4.0:1684426170.475202:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 90@kfi failed (-110) 00000100:00080000:5.0:1684426170.475211:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984948 00000100:00080000:5.0:1684426170.475213:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876 00010000:00010000:11.0:1684426186.859281:0:995298:0:(ldlm_request.c:1045:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000000000000 ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0 00010000:00000040:11.0:1684426186.859283:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1 00010000:00010000:11.0:1684426186.859285:0:995298:0:(ldlm_request.c:1132:ldlm_cli_enqueue()) ### sending request ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0 00010000:00000040:11.0:1684426186.859287:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1 00000100:00000040:11.0:1684426186.859291:0:995298:0:(lustre_net.h:2404:ptlrpc_rqphase_move()) @@@ move request phase from New to Rpc req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl New:QU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 00000100:00080000:11.0:1684426186.859294:0:995298:0:(client.c:1665:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 00000100:00080000:11.0:1684426193.003178:0:995298:0:(client.c:1260:ptlrpc_import_delay_req()) @@@ send limit expired req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 10000000:01000000:11.0:1684426193.003262:0:995298:0:(mgc_request.c:2136:mgc_process_log()) MGC90@kfi: configuration from log 'kjlmo12-MDT0000' failed (-5). 00000020:02020000:11.0:1684426193.003265:0:995298:0:(obd_mount.c:109:lustre_process_log()) 15c-8: MGC90@kfi: Confguration from log kjlmo12-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info 00000020:00020000:11.0:1684426193.022368:0:995298:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server kjlmo12-MDT0000: -5 MGC reconnection through pinger waits all requests to be timeout before a new attempt. In some situation it leads to fail to connect to MGS on a local node and failure to start a MDT0000. |
| Comments |
| Comment by Gerrit Updater [ 25/Sep/23 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52498 |
| Comment by Etienne Aujames [ 27/Sep/23 ] |
|
Hi Alexander, Can this be related to LU-16204? |
| Comment by Alexander Boyko [ 28/Sep/23 ] |
|
>Can this be related to LU-16204? |
| Comment by Gerrit Updater [ 18/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52498/ |
| Comment by Peter Jones [ 18/Nov/23 ] |
|
Landed for 2.16 |