Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.17.0
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      In case of overloaded target or overloaded network, it happens that clients are evicted too early, whereas they may not have just started the resend, or they started but not only have time to survive a switch to a failover pair, but sometimes no time even for one resend. The problem is in the ping evictor timeout which does not match the client RPC timeouts based on AT. Examples:

      Server evicts the 60@gni4 client after 90+ seconds of silence:

      Jul 22 14:50:14 snx11922n002 kernel: Lustre: snx11922-MDT0000: haven't heard from client 091e68e9-5525-41c9-9fa0-8311165b74ea (at 60@gni4) in 92 seconds. I think it's dead, and I am evicting it. exp 000000007bcf39d1, cur 1690055414 expire 1690055324 last 1690055322

      From the client side, it just gets evicted:

      [119444.413895] LustreError: 167-0: snx11922-MDT0000-mdc-ffff888f8b22a800: This client was evicted by snx11922-MDT0000; in progress operations using this service will fail.

      and there are no attempts to resend RPC from the client side.

      Another one:

      2023-08-07 23:29:19 [70984.932454] Lustre: 12111:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1691472474/real 1691472474] req@00000000c3fa3ae6 x1773568329048768/t0(0) o400->snx11922-MDT0001-mdc-ffff889f88a14000@10.12.2.51@o2ib4001:12/10 lens 224/224 e 0 to 1 dl 1691472560 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      2023-08-07 23:30:32 [71057.552316] LustreError: 167-0: snx11922-MDT0001-mdc-ffff889f88a14000: This client was evicted by snx11922-MDT0001; in progress operations using this service will fail.

      the resend did not succeeded to get to the target in time.

      RPC deadline 1691472560 - RPC sent time 1691472474 = 86 s to expire ping request. and the next reconnect took 73s for the client
      but the eviction occurred after instead of ping_interval (15) * ping_evict_timeout_multiplier (6) = 90s

      another one:

      client reconnection log has 3 connection attempt
      10.12.0.51@o2ib4/10.12.2.51@o2ib4001 - 40sec, timeout
      10.12.0.50@o2ib4/10.12.0.50@o2ib4000 -30sec, failover pair, error
      10.12.0.51@o2ib4/10.12.2.51@o2ib4001 - <1 sec, success

      00000100:00080000:60.0:1691472560.304907:0:12111:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from FULL to DISCONN
      00000100:00080000:59.0:1691472560.335863:0:58946:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:59.0:1691472560.335869:0:58946:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.51@o2ib4/10.12.2.51@o2ib4001
      00000100:00080000:36.0:1691472601.229134:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to DISCONN
      00000100:00080000:36.0:1691472601.229140:0:12055:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of snx11922-MDT0001_UUID on 10.12.0.51@o2ib4 failed (-110)
      00000100:00080000:0.0:1691472601.229171:0:58533:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:0.0:1691472601.229181:0:58533:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.50@o2ib4/10.12.0.50@o2ib4000
      00000100:00080000:40.0:1691472632.857384:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to DISCONN
      00000100:00080000:40.0:1691472632.857389:0:12055:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of snx11922-MDT0001_UUID on 10.12.0.50@o2ib4 failed (-19)
      00000100:00080000:0.0:1691472632.857400:0:58533:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING
      00000100:00080000:0.0:1691472632.857409:0:58533:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.51@o2ib4/10.12.2.51@o2ib4001
      00000100:00080000:40.0:1691472632.857826:0:12055:0:(import.c:1075:ptlrpc_connect_interpret()) snx11922-MDT0001-mdc-ffff889f88a14000: connect to target with instance 759
      00000100:00080000:40.0:1691472632.857834:0:12055:0:(import.c:1292:ptlrpc_connect_interpret()) @@@ snx11922-MDT0001-mdc-ffff889f88a14000: evicting (reconnect/recover flags not set: 4) req@00000000249bbeb0 x1773568329071424/t0(0) o38->snx11922-MDT0001-mdc-ffff889f88a14000@10.12.2.51@o2ib4001:12/10 lens 520/416 e 0 to 0 dl 1691472678 ref 1 fl Interpret:RNQU/0/0 rc 0/0 job:''
      00000100:00080000:40.0:1691472632.857840:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to EVICTED

      the solution must be to move the ping eviction logic to the AT as was already done for the ldlm prolong timeout.

      Attachments

        Activity

          People

            vitaly_fertman Vitaly Fertman
            vitaly_fertman Vitaly Fertman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: