[LU-16483] Loss of idle ping causes reconnect even if subsequent ping succeeds Created: 17/Jan/23  Updated: 26/Jan/24  Resolved: 27/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Alena Nikitenko
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16754 replay-single: test 200 RPC did not t... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There seems to be some flaw in the idle client ping interval, but maybe I'm just missing something. An idle client sends an OBD_PING to a target every ping_interval seconds (default obd_timeout / 4 = 25 seconds), but there is no consideration of the RPC timeout, so you can end up with multiple ones in flight at same time if RPC timeout is > 25 seconds.
This can lead to odd behavior. For example, if I drop a single OBD ping on the server, then subsequent OBD pings may succeed, but when the one that was dropped hits timeout this causes a reconnect.

ping1 -> dropped
<25 seconds later>
ping2 -> succeeds
<some time later>
ping1 hits timeout, and causes client reconnect

Example showing 6 pings in flight before the first one hits timeout.

00000100:00000040:12.0:1667595564.976159:0:13921:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595717 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000040:12.0:1667595591.600170:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000cf5effe6 x1748600553999424/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595744 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000040:8.0:1667595618.224202:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000001e6c674d x1748600553999680/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595771 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000040:12.0:1667595644.848234:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000031efedb5 x1748600553999936/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595797 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000040:8.0:1667595671.472178:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000fe59e179 x1748600554000192/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595824 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000040:12.0:1667595698.096210:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000022f3b477 x1748600554000448/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595851 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
00000100:00000400:12.0:1667595717.552079:0:13921:0:(client.c:2308:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667595564/real 1667595564]  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 1 dl 1667595717 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''


 Comments   
Comment by Andreas Dilger [ 20/Jan/23 ]

Seems like a real bug. The whole point of sending pings with obd_timeout/4?is to ensure at least one makes it through before the timeout, and to catch server failure early enough to reconnect during the recovery window.

Seems like the later ping should save the successful ping XID (or any successful RPC XID) in the import, and the later timeout of the earlier ping (with lower XID) should be ignored.

Comment by Gerrit Updater [ 27/Jan/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49807
Subject: LU-16483 ptlrpc: Track highest reply xid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 68aa10cc110c7a66d9d2f7561d30208abfb8e5f4

Comment by Gerrit Updater [ 09/Feb/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49958
Subject: LU-16483 tests: Test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: afe29157cb586727ab57c9493fd1f1e2345a897f

Comment by Gerrit Updater [ 15/Feb/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50011
Subject: LU-16483 tests: Test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 22f83229cf078550f13141dbdafcca75a4e6580b

Comment by Gerrit Updater [ 22/Mar/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50384
Subject: LU-16483 tests: Test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a58c08fe7e7aa99989173d8753695d225a545493

Comment by Gerrit Updater [ 18/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49807/
Subject: LU-16483 ptlrpc: Track highest reply XID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eb1f4a5222039be9f728839ec8f9cde904a1273f

Comment by Peter Jones [ 18/Apr/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 26/Apr/23 ]

Hi Chris,
it looks like the newly-added replay-single test_200 is failing intermittently during testing since it landed to master on 2023-04-18, could you please take a look:

https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fea9d884-926c-4e41-b86d-9679d197c5f8&source=sub_tests#redirect

Comment by Chris Horn [ 05/May/23 ]

adilger Sure, I'll take a look.

Comment by Gerrit Updater [ 05/May/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50869
Subject: LU-16483 tests: Test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: caecbcb8daa3c34811cdfabaf5fddcc2f002df8c

Comment by Chris Horn [ 06/May/23 ]

I think the issue is the test assumes that idle_timeout is set to some non-zero value for all the targets.

Comment by Gerrit Updater [ 08/May/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50891
Subject: LU-16483 tests: replay-single test_200 fixes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8002abc25fd6754446790835aa56b8fd0972fde0

Comment by Andreas Dilger [ 29/Jun/23 ]

Alena, would you be able to rebase Chris' patch so that this issue can be fixed.

Comment by Chris Horn [ 29/Jun/23 ]

Sorry for letting this languish, but I have cycles today to pick it up.

Comment by Gerrit Updater [ 27/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50891/
Subject: LU-16483 tests: replay-single test_200 fixes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fdfdf5c05cf64294068a5cbfe818b64bd9e577f9

Comment by Peter Jones [ 27/Jul/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:27:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.