Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17142

MGC long time connection

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Upstream
    • 2 MDTs in failover pair
    • 3
    • 9223372036854775807

    Description

      Initial state, n03 had MDT0 combined with MGS and MDT1. Then failover for MDT1 and failback for MDT0, MDT1 was started first.
      After MGS started, MGC did not connect to it on a same node. And getting config lock was unsuccessful, MDT0 failed to start.
      Here is the connection attempts to MGS

      00000100:02000400:1.0:1684425810.091309:0:67373:0:(import.c:1234:ptlrpc_connect_interpret()) Evicted from MGS (at 90@kfi) after server handle changed from 0xf8988cf2594b4a99 to 0xad30caba1bb5141f
      00000100:00080000:17.0:1684425957.393672:0:988963:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 0
      00000100:00080000:17.0:1684425957.393676:0:988963:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo
      00000100:00080000:15.0:1684426028.139221:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on MGC90@kfi_0 failed (-110)
      00000100:00080000:12.0:1684426028.139231:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805
      00000100:00080000:12.0:1684426028.139233:0:968287:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 0
      00000100:00080000:12.0:1684426028.139245:0:968287:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 57@kfi)
      00000100:00080000:12.0:1684426028.139246:0:968287:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_1/57@kfi
      00000100:00080000:0.0:1684426099.819204:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 57@kfi failed (-110)
      00000100:00080000:2.0:1684426099.819218:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984805
      00000100:00080000:2.0:1684426099.819220:0:991028:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876
      00000100:00080000:2.0:1684426099.819221:0:991028:0:(import.c:581:import_select_connection()) MGC90@kfi: tried all connections, increasing latency to 66s
      00000100:00080000:2.0:1684426099.819238:0:991028:0:(import.c:606:import_select_connection()) MGC90@kfi: Connection changing to MGS (at 0@lo)
      00000100:00080000:2.0:1684426099.819240:0:991028:0:(import.c:615:import_select_connection()) MGC90@kfi: import 00000000e1884310 using connection MGC90@kfi_0/0@lo
      20000000:00000040:11.0:1684426115.484210:0:995298:0:(mgs_handler.c:1397:mgs_init0()) MGS MGS started
      00000100:00080000:4.0:1684426170.475202:0:67373:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of MGS on 90@kfi failed (-110)
      00000100:00080000:5.0:1684426170.475211:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 0@lo last attempt 1984948
      00000100:00080000:5.0:1684426170.475213:0:971094:0:(import.c:534:import_select_connection()) MGC90@kfi: connect to NID 57@kfi last attempt 1984876
      00010000:00010000:11.0:1684426186.859281:0:995298:0:(ldlm_request.c:1045:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000000000000 ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0
      00010000:00000040:11.0:1684426186.859283:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1
      00010000:00010000:11.0:1684426186.859285:0:995298:0:(ldlm_request.c:1132:ldlm_cli_enqueue()) ### sending request ns: MGC90@kfi lock: 0000000054f13717/0xad30caba1bb518e1 lrc: 3/1,0 mode: --/CR res: [0x32316f6d6c6a6b:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x0 expref: -99 pid: 995298 timeout: 0 lvb_type: 0
      00010000:00000040:11.0:1684426186.859287:0:995298:0:(ldlm_resource.c:1648:ldlm_resource_putref()) putref res: 000000000b5a37f1 count: 1
      00000100:00000040:11.0:1684426186.859291:0:995298:0:(lustre_net.h:2404:ptlrpc_rqphase_move()) @@@ move request phase from New to Rpc  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl New:QU/0/ffffffff rc 0/-1 job:'mount.lustre.0'
      00000100:00080000:11.0:1684426186.859294:0:995298:0:(client.c:1665:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0'
      
      00000100:00080000:11.0:1684426193.003178:0:995298:0:(client.c:1260:ptlrpc_import_delay_req()) @@@ send limit expired  req@00000000c8cf8fee x1764170257385152/t0(0) o101->MGC90@kfi@57@kfi:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:WQU/0/ffffffff rc 0/-1 job:'mount.lustre.0' 10000000:01000000:11.0:1684426193.003262:0:995298:0:(mgc_request.c:2136:mgc_process_log()) MGC90@kfi: configuration from log 'kjlmo12-MDT0000' failed (-5).
      00000020:02020000:11.0:1684426193.003265:0:995298:0:(obd_mount.c:109:lustre_process_log()) 15c-8: MGC90@kfi: Confguration from log kjlmo12-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
      00000020:00020000:11.0:1684426193.022368:0:995298:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server kjlmo12-MDT0000: -5
      

      MGC reconnection through pinger waits all requests to be timeout before a new attempt. In some situation it leads to fail to connect to MGS on a local node and failure to start a MDT0000.

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: