Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
There seems to be some flaw in the idle client ping interval, but maybe I'm just missing something. An idle client sends an OBD_PING to a target every ping_interval seconds (default obd_timeout / 4 = 25 seconds), but there is no consideration of the RPC timeout, so you can end up with multiple ones in flight at same time if RPC timeout is > 25 seconds.
This can lead to odd behavior. For example, if I drop a single OBD ping on the server, then subsequent OBD pings may succeed, but when the one that was dropped hits timeout this causes a reconnect.
ping1 -> dropped
<25 seconds later>
ping2 -> succeeds
<some time later>
ping1 hits timeout, and causes client reconnect
Example showing 6 pings in flight before the first one hits timeout.
00000100:00000040:12.0:1667595564.976159:0:13921:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595717 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595591.600170:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@00000000cf5effe6 x1748600553999424/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595744 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:8.0:1667595618.224202:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@000000001e6c674d x1748600553999680/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595771 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595644.848234:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@0000000031efedb5 x1748600553999936/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595797 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:8.0:1667595671.472178:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@00000000fe59e179 x1748600554000192/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595824 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595698.096210:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@0000000022f3b477 x1748600554000448/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595851 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000400:12.0:1667595717.552079:0:13921:0:(client.c:2308:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667595564/real 1667595564] req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 1 dl 1667595717 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
Attachments
Issue Links
- is related to
-
LU-16754 replay-single: test 200 RPC did not time out
- Open