Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14031

long time between reconnects

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      From client vmcore

      00000800:02000000:23.0:1597079777.744738:0:2440:0:(gnilnd_stack.c:127:kgnilnd_quiesce_wait()) Quiesce start: hardware quiesce
      00000800:02000000:23.0:1597079799.248713:0:2440:0:(gnilnd_stack.c:148:kgnilnd_quiesce_wait()) Quiesce complete: hardware quiesce
      00000100:00000400:11.0F:1597079855.312701:0:9337:0:(client.c:2164:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1597079760/real 0]  req@ffff880f869d4900 x1674651545897792/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 1 dl 1597079855 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 job:''
      00000100:00000400:23.0:1597079855.312701:0:9310:0:(client.c:2164:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1597079760/real 1597079760]  req@ffff8807c33cf200 x1674651545898176/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 1 dl 1597079855 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 job:''
      00000100:00080000:11.0:1597079855.312718:0:9337:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from FULL to DISCONN
      00000100:02000400:11.0:1597079855.312720:0:9337:0:(import.c:200:ptlrpc_set_import_discon()) snx11281-OST0004-osc-ffff88078d8f7000: Connection to snx11281-OST0004 (at 10.10.100.9@o2ib3) was lost; in progress operations using this service will wait for recovery to complete
      00000100:00080000:11.0:1597079855.395469:0:9337:0:(import.c:446:ptlrpc_pinger_force()) snx11281-OST0004_UUID: waking up pinger s:DISCONN
      00000100:00080000:11.0:1597079855.395485:0:224:0:(pinger.c:246:ptlrpc_pinger_process_import()) e7dcb43d-d757-d1da-940a-fa23715e018f->snx11281-OST0004_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
      00000100:00080000:11.0:1597079855.395487:0:224:0:(recover.c:58:ptlrpc_initiate_recovery()) snx11281-OST0004_UUID: starting recovery
      00000100:00080000:11.0:1597079855.395488:0:224:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:11.0:1597079855.395501:0:224:0:(import.c:543:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: connect to NID 10.10.100.9@o2ib3 last attempt 793
      00000100:00080000:11.0:1597079855.395504:0:224:0:(import.c:619:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: import ffff88079f339000 using connection 10.10.100.9@o2ib3/10.10.100.9@o2ib3
      00000100:00080000:8.0:1597079855.917811:0:9328:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@ffff880f88e3d240 x1674651545898880/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:14.0F:1597079910.352694:0:9307:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from CONNECTING to DISCONN
      00000100:00080000:14.0:1597079910.352702:0:9307:0:(import.c:1393:ptlrpc_connect_interpret()) recovery of snx11281-OST0004_UUID on 10.10.100.9@o2ib3 failed (-110)
      00000100:00080000:11.0:1597079913.247074:0:9332:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff880f885a7240 x1674651545899200/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:25.0F:1597079923.789527:0:9336:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff880f869d4d80 x1674651545901760/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:8.0:1597079925.298288:0:9325:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff880f885a76c0 x1674651545902016/t0(0) o103->snx11281 OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:1.0:1597079998.672651:0:13448:0:(pinger.c:246:ptlrpc_pinger_process_import()) e7dcb43d-d757-d1da-940a-fa23715e018f->snx11281-OST0004_UUID: level DISCONN/3 force 0 force_next 0 deactive 0 pingable 1 suppress 0
      00000100:00080000:1.0:1597079998.672654:0:13448:0:(recover.c:58:ptlrpc_initiate_recovery()) snx11281-OST0004_UUID: starting recovery
      00000100:00080000:1.0:1597079998.672656:0:13448:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:1.0:1597079998.672659:0:13448:0:(import.c:543:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: connect to NID 10.10.100.9@o2ib3 last attempt 8723
      00000100:00080000:1.0:1597079998.672661:0:13448:0:(import.c:543:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: connect to NID 10.10.100.10@o2ib3 last attempt 0
      00000100:00080000:1.0:1597079998.672664:0:13448:0:(import.c:611:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: Connection changing to snx11281-OST0004 (at 10.10.100.10@o2ib3)
      00000100:00080000:1.0:1597079998.672665:0:13448:0:(import.c:619:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: import ffff88079f339000 using connection 10.10.100.10@o2ib3/10.10.100.10@o2ib3
      00000100:00080000:14.0:1597079998.692719:0:9307:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from CONNECTING to DISCONN
      00000100:00080000:14.0:1597079998.692722:0:9307:0:(import.c:1393:ptlrpc_connect_interpret()) recovery of snx11281-OST0004_UUID on 10.10.100.10@o2ib3 failed (-19)
      00000100:00080000:11.0:1597080017.424689:0:9337:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff880f869d5b00 x1674651547657024/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.10@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:28.0:1597080017.432530:0:9339:0:(client.c:1594:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff880f8bd4cdc0 x1674651547657152/t0(0) o103->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.10@o2ib3:17/18 lens 328/224 e 0 to 0 dl 0 ref 1 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00000100:00080000:1.0:1597080074.448656:0:13448:0:(pinger.c:246:ptlrpc_pinger_process_import()) e7dcb43d-d757-d1da-940a-fa23715e018f->snx11281-OST0004_UUID: level DISCONN/3 force 0 force_next 0 deactive 0 pingable 1 suppress 0
      00000100:00080000:1.0:1597080074.448659:0:13448:0:(recover.c:58:ptlrpc_initiate_recovery()) snx11281-OST0004_UUID: starting recovery
      00000100:00080000:1.0:1597080074.448660:0:13448:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:1.0:1597080074.448663:0:13448:0:(import.c:543:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: connect to NID 10.10.100.9@o2ib3 last attempt 8723
      00000100:00080000:1.0:1597080074.448665:0:13448:0:(import.c:543:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: connect to NID 10.10.100.10@o2ib3 last attempt 8866
      00000100:00080000:1.0:1597080074.448666:0:13448:0:(import.c:585:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: tried all connections, increasing latency to 45s
      00000100:00080000:1.0:1597080074.448668:0:13448:0:(import.c:611:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: Connection changing to snx11281-OST0004 (at 10.10.100.9@o2ib3)
      00000100:00080000:1.0:1597080074.448669:0:13448:0:(import.c:619:import_select_connection()) snx11281-OST0004-osc-ffff88078d8f7000: import ffff88079f339000 using connection 10.10.100.9@o2ib3/10.10.100.9@o2ib3
      00000100:00080000:14.0:1597080074.449190:0:9307:0:(import.c:1080:ptlrpc_connect_interpret()) snx11281-OST0004-osc-ffff88078d8f7000: connect to target with instance 551
      00000100:00080000:14.0:1597080074.449205:0:9307:0:(import.c:1288:ptlrpc_connect_interpret()) @@@ snx11281-OST0004-osc-ffff88078d8f7000: evicting (reconnect/recover flags not set: 4)  req@ffff8807d0ad8000 x1674651548694080/t0(0) o8->snx11281-OST0004-osc-ffff88078d8f7000@10.10.100.9@o2ib3:28/4 lens 520/416 e 0 to 0 dl 1597080134 ref 1 fl Interpret:RN/0/0 rc 0/0 job:''
      00000100:00080000:14.0:1597080074.449219:0:9307:0:(import.c:86:import_set_state_nolock()) ffff88079f339000 snx11281-OST0004_UUID: changing import state from CONNECTING to EVICTED
      00000100:02020000:14.0:1597080074.449221:0:9307:0:(import.c:1535:ptlrpc_import_recovery_state_machine()) 167-0: snx11281-OST0004-osc-ffff88078d8f7000: This client was evicted by snx11281-OST0004; in progress operations using this service will fail.
       

      The client saw requests timeout at 1597079855, and then it started recovery/reconnect. A first attempt to 10.10.100.9@o2ib3 failed with ETIMEDOUT (-110) at 1597079910. A second attempt started at 1597079998 to 10.10.100.10@o2ib3, (why do we have such timeout between two?) and failed at 1597079998 with ENODEV(-19). A third started at 1597080074 and found that client was evicted from a server. So 219 seconds from a first connect to a third with eviction, and 56 seconds from a quiesce end to a first connect (275 sec summary). And second and third attempt got reply immediately.

      Aug 10 12:18:42 snx11281n004 kernel: Lustre: snx11281-OST0000: Connection restored to e7dcb43d-d757-d1da-940a-fa23715e018f (at 166@gni6)
      (Looks like server received a connect from first attempt and connected client, but client did not get reply for connect and failed with timedout)
      ...
      Aug 10 12:19:58 snx11281n009 kernel: LustreError: 137-5: snx11281-OST0004_UUID: not available for connect from 166@gni6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      (This is a second attempt to another node .10)
      ...
      Aug 10 12:20:20 snx11281n008 kernel: LustreError: 17823:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 260s: evicting client at 166@gni6  ns: filter-snx11281-OST0004_UUID lock: ffff95af61158d80/0xae761f1dcb5aa5e9 lrc: 3/0,0 mode: PR/PR res: [0xae594e92:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x60000400010020 nid: 166@gni6 remote: 0x6c558346a714dd7e expref: 16193 pid: 18561 timeout: 232339 lvb_type: 1
      (Server evicted client)
      

      And a bit later client reconnected and evicted.
      Reconnection logic is based on PING_INTERVAL=obd_timeout/4, for a failed client 300/4=75 seconds. Connection request timeout is INITIAL_CONNECT_TIMEOUT and equal to 15seconds for a failed client. So request send/reply timeout should happened a much early than PING_INTERVAL. I think it is too much to wait reconnection for a 75sec, and waste of time on both side client and server.

      Attachments

        Activity

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: