[LU-16483] Loss of idle ping causes reconnect even if subsequent ping succeeds Created: 17/Jan/23 Updated: 26/Jan/24 Resolved: 27/Jul/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Alena Nikitenko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
There seems to be some flaw in the idle client ping interval, but maybe I'm just missing something. An idle client sends an OBD_PING to a target every ping_interval seconds (default obd_timeout / 4 = 25 seconds), but there is no consideration of the RPC timeout, so you can end up with multiple ones in flight at same time if RPC timeout is > 25 seconds. ping1 -> dropped Example showing 6 pings in flight before the first one hits timeout. 00000100:00000040:12.0:1667595564.976159:0:13921:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595717 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595591.600170:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@00000000cf5effe6 x1748600553999424/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595744 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:8.0:1667595618.224202:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@000000001e6c674d x1748600553999680/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595771 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595644.848234:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@0000000031efedb5 x1748600553999936/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595797 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:8.0:1667595671.472178:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@00000000fe59e179 x1748600554000192/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595824 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000040:12.0:1667595698.096210:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0 req@0000000022f3b477 x1748600554000448/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595851 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:'' 00000100:00000400:12.0:1667595717.552079:0:13921:0:(client.c:2308:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667595564/real 1667595564] req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 1 dl 1667595717 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' |
| Comments |
| Comment by Andreas Dilger [ 20/Jan/23 ] |
|
Seems like a real bug. The whole point of sending pings with obd_timeout/4?is to ensure at least one makes it through before the timeout, and to catch server failure early enough to reconnect during the recovery window. Seems like the later ping should save the successful ping XID (or any successful RPC XID) in the import, and the later timeout of the earlier ping (with lower XID) should be ignored. |
| Comment by Gerrit Updater [ 27/Jan/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49807 |
| Comment by Gerrit Updater [ 09/Feb/23 ] |
|
|
| Comment by Gerrit Updater [ 15/Feb/23 ] |
|
|
| Comment by Gerrit Updater [ 22/Mar/23 ] |
|
|
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49807/ |
| Comment by Peter Jones [ 18/Apr/23 ] |
|
Landed for 2.16 |
| Comment by Andreas Dilger [ 26/Apr/23 ] |
|
Hi Chris, |
| Comment by Chris Horn [ 05/May/23 ] |
|
adilger Sure, I'll take a look. |
| Comment by Gerrit Updater [ 05/May/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50869 |
| Comment by Chris Horn [ 06/May/23 ] |
|
I think the issue is the test assumes that idle_timeout is set to some non-zero value for all the targets. |
| Comment by Gerrit Updater [ 08/May/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50891 |
| Comment by Andreas Dilger [ 29/Jun/23 ] |
|
Alena, would you be able to rebase Chris' patch so that this issue can be fixed. |
| Comment by Chris Horn [ 29/Jun/23 ] |
|
Sorry for letting this languish, but I have cycles today to pick it up. |
| Comment by Gerrit Updater [ 27/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50891/ |
| Comment by Peter Jones [ 27/Jul/23 ] |
|
Landed for 2.16 |