Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
2
-
9223372036854775807
Description
In case of overloaded target or overloaded network, it happens that clients are evicted too early, whereas they may not have just started the resend, or they started but not only have time to survive a switch to a failover pair, but sometimes no time even for one resend. The problem is in the ping evictor timeout which does not match the client RPC timeouts based on AT. Examples:
Server evicts the 60@gni4 client after 90+ seconds of silence:
Jul 22 14:50:14 snx11922n002 kernel: Lustre: snx11922-MDT0000: haven't heard from client 091e68e9-5525-41c9-9fa0-8311165b74ea (at 60@gni4) in 92 seconds. I think it's dead, and I am evicting it. exp 000000007bcf39d1, cur 1690055414 expire 1690055324 last 1690055322
From the client side, it just gets evicted:
[119444.413895] LustreError: 167-0: snx11922-MDT0000-mdc-ffff888f8b22a800: This client was evicted by snx11922-MDT0000; in progress operations using this service will fail.
and there are no attempts to resend RPC from the client side.
Another one:
2023-08-07 23:29:19 [70984.932454] Lustre: 12111:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1691472474/real 1691472474] req@00000000c3fa3ae6 x1773568329048768/t0(0) o400->snx11922-MDT0001-mdc-ffff889f88a14000@10.12.2.51@o2ib4001:12/10 lens 224/224 e 0 to 1 dl 1691472560 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2023-08-07 23:30:32 [71057.552316] LustreError: 167-0: snx11922-MDT0001-mdc-ffff889f88a14000: This client was evicted by snx11922-MDT0001; in progress operations using this service will fail.
the resend did not succeeded to get to the target in time.
RPC deadline 1691472560 - RPC sent time 1691472474 = 86 s to expire ping request. and the next reconnect took 73s for the client
but the eviction occurred after instead of ping_interval (15) * ping_evict_timeout_multiplier (6) = 90s
another one:
client reconnection log has 3 connection attempt
10.12.0.51@o2ib4/10.12.2.51@o2ib4001 - 40sec, timeout
10.12.0.50@o2ib4/10.12.0.50@o2ib4000 -30sec, failover pair, error
10.12.0.51@o2ib4/10.12.2.51@o2ib4001 - <1 sec, success
00000100:00080000:60.0:1691472560.304907:0:12111:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from FULL to DISCONN 00000100:00080000:59.0:1691472560.335863:0:58946:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:59.0:1691472560.335869:0:58946:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.51@o2ib4/10.12.2.51@o2ib4001 00000100:00080000:36.0:1691472601.229134:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:36.0:1691472601.229140:0:12055:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of snx11922-MDT0001_UUID on 10.12.0.51@o2ib4 failed (-110) 00000100:00080000:0.0:1691472601.229171:0:58533:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0.0:1691472601.229181:0:58533:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.50@o2ib4/10.12.0.50@o2ib4000 00000100:00080000:40.0:1691472632.857384:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:40.0:1691472632.857389:0:12055:0:(import.c:1435:ptlrpc_connect_interpret()) recovery of snx11922-MDT0001_UUID on 10.12.0.50@o2ib4 failed (-19) 00000100:00080000:0.0:1691472632.857400:0:58533:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0.0:1691472632.857409:0:58533:0:(import.c:615:import_select_connection()) snx11922-MDT0001-mdc-ffff889f88a14000: import 0000000016d808db using connection 10.12.0.51@o2ib4/10.12.2.51@o2ib4001 00000100:00080000:40.0:1691472632.857826:0:12055:0:(import.c:1075:ptlrpc_connect_interpret()) snx11922-MDT0001-mdc-ffff889f88a14000: connect to target with instance 759 00000100:00080000:40.0:1691472632.857834:0:12055:0:(import.c:1292:ptlrpc_connect_interpret()) @@@ snx11922-MDT0001-mdc-ffff889f88a14000: evicting (reconnect/recover flags not set: 4) req@00000000249bbeb0 x1773568329071424/t0(0) o38->snx11922-MDT0001-mdc-ffff889f88a14000@10.12.2.51@o2ib4001:12/10 lens 520/416 e 0 to 0 dl 1691472678 ref 1 fl Interpret:RNQU/0/0 rc 0/0 job:'' 00000100:00080000:40.0:1691472632.857840:0:12055:0:(import.c:86:import_set_state_nolock()) 0000000016d808db snx11922-MDT0001_UUID: changing import state from CONNECTING to EVICTED
the solution must be to move the ping eviction logic to the AT as was already done for the ldlm prolong timeout.