Details
-
Bug
-
Resolution: Fixed
-
Major
-
Upstream
-
2 MDTs in failover pair
-
3
-
9223372036854775807
Description
Initial state, n03 had MDT0 combined with MGS and MDT1. Then failover for MDT1 and failback for MDT0, MDT1 was started first.
After MGS started, MGC did not connect to it on a same node. And getting config lock was unsuccessful, MDT0 failed to start.
Here is the connection attempts to MGS
00000100:02000400:1.0:1684425810.091309:0:67373:0:(import.c:1234:ptlrpc_connect_interpret()) Evicted from MGS (at 90@kfi) after server handle changed from 0xf8988cf2594b4a99 to 0xad30caba1bb5141f 00000100:00080000:17.0:1684425957.393672:0:988963:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 0 00000100:00080000:17.0:1684425957.393676:0:988963:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo 00000100:00080000:15.0:1684426028.139221:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on MGC90@kfi_0 failed (-110) 00000100:00080000:12.0:1684426028.139231:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805 00000100:00080000:12.0:1684426028.139233:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 0 00000100:00080000:12.0:1684426028.139245:0:968287:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 57@kfi) 00000100:00080000:12.0:1684426028.139246:0:968287:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_1/57@kfi 00000100:00080000:0.0:1684426099.819204:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 57@kfi failed (-110) 00000100:00080000:2.0:1684426099.819218:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805 00000100:00080000:2.0:1684426099.819220:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876 00000100:00080000:2.0:1684426099.819221:0:991028:0:(import.c:581:import_select_connection()) MGC90@kfi: tried all connections, increasing latency to 66s 00000100:00080000:2.0:1684426099.819238:0:991028:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 0@lo) 00000100:00080000:2.0:1684426099.819240:0:991028:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo 20000000:00000040:11.0:1684426115.484210:0:995298:0:(mgs_handler.c:1397:mgs_init0()) MGS MGS started 00000100:00080000:4.0:1684426170.475202:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 90@kfi failed (-110) 00000100:00080000:5.0:1684426170.475211:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984948 00000100:00080000:5.0:1684426170.475213:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876 00010000:00010000:11.0:1684426186.859281:0:995298:0:(ldlm_request.c:1045:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000000000000 ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0 00010000:00000040:11.0:1684426186.859283:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1 00010000:00010000:11.0:1684426186.859285:0:995298:0:(ldlm_request.c:1132:ldlm_cli_enqueue()) ### sending request ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0 00010000:00000040:11.0:1684426186.859287:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1 00000100:00000040:11.0:1684426186.859291:0:995298:0:(lustre_net.h:2404:ptlrpc_rqphase_move()) @@@ move request phase from New to Rpc req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl New:QU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 00000100:00080000:11.0:1684426186.859294:0:995298:0:(client.c:1665:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 00000100:00080000:11.0:1684426193.003178:0:995298:0:(client.c:1260:ptlrpc_import_delay_req()) @@@ send limit expired req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 10000000:01000000:11.0:1684426193.003262:0:995298:0:(mgc_request.c:2136:mgc_process_log()) MGC90@kfi: configuration from log 'kjlmo12-MDT0000' failed (-5). 00000020:02020000:11.0:1684426193.003265:0:995298:0:(obd_mount.c:109:lustre_process_log()) 15c-8: MGC90@kfi: Confguration from log kjlmo12-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info 00000020:00020000:11.0:1684426193.022368:0:995298:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server kjlmo12-MDT0000: -5
MGC reconnection through pinger waits all requests to be timeout before a new attempt. In some situation it leads to fail to connect to MGS on a local node and failure to start a MDT0000.