Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16483

Loss of idle ping causes reconnect even if subsequent ping succeeds

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      There seems to be some flaw in the idle client ping interval, but maybe I'm just missing something. An idle client sends an OBD_PING to a target every ping_interval seconds (default obd_timeout / 4 = 25 seconds), but there is no consideration of the RPC timeout, so you can end up with multiple ones in flight at same time if RPC timeout is > 25 seconds.
      This can lead to odd behavior. For example, if I drop a single OBD ping on the server, then subsequent OBD pings may succeed, but when the one that was dropped hits timeout this causes a reconnect.

      ping1 -> dropped
      <25 seconds later>
      ping2 -> succeeds
      <some time later>
      ping1 hits timeout, and causes client reconnect

      Example showing 6 pings in flight before the first one hits timeout.

      00000100:00000040:12.0:1667595564.976159:0:13921:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595717 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595591.600170:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000cf5effe6 x1748600553999424/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595744 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:8.0:1667595618.224202:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000001e6c674d x1748600553999680/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595771 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595644.848234:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000031efedb5 x1748600553999936/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595797 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:8.0:1667595671.472178:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000fe59e179 x1748600554000192/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595824 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595698.096210:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000022f3b477 x1748600554000448/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595851 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000400:12.0:1667595717.552079:0:13921:0:(client.c:2308:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667595564/real 1667595564]  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 1 dl 1667595717 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      

      Attachments

        Issue Links

          Activity

            [LU-16483] Loss of idle ping causes reconnect even if subsequent ping succeeds
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50891/
            Subject: LU-16483 tests: replay-single test_200 fixes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fdfdf5c05cf64294068a5cbfe818b64bd9e577f9

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50891/ Subject: LU-16483 tests: replay-single test_200 fixes Project: fs/lustre-release Branch: master Current Patch Set: Commit: fdfdf5c05cf64294068a5cbfe818b64bd9e577f9
            hornc Chris Horn added a comment -

            Sorry for letting this languish, but I have cycles today to pick it up.

            hornc Chris Horn added a comment - Sorry for letting this languish, but I have cycles today to pick it up.

            Alena, would you be able to rebase Chris' patch so that this issue can be fixed.

            adilger Andreas Dilger added a comment - Alena, would you be able to rebase Chris' patch so that this issue can be fixed.
            gerrit Gerrit Updater added a comment - - edited

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50891
            Subject: LU-16483 tests: replay-single test_200 fixes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8002abc25fd6754446790835aa56b8fd0972fde0

            gerrit Gerrit Updater added a comment - - edited "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50891 Subject: LU-16483 tests: replay-single test_200 fixes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8002abc25fd6754446790835aa56b8fd0972fde0
            hornc Chris Horn added a comment -

            I think the issue is the test assumes that idle_timeout is set to some non-zero value for all the targets.

            hornc Chris Horn added a comment - I think the issue is the test assumes that idle_timeout is set to some non-zero value for all the targets.

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50869
            Subject: LU-16483 tests: Test patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: caecbcb8daa3c34811cdfabaf5fddcc2f002df8c

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50869 Subject: LU-16483 tests: Test patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: caecbcb8daa3c34811cdfabaf5fddcc2f002df8c
            hornc Chris Horn added a comment -

            adilger Sure, I'll take a look.

            hornc Chris Horn added a comment - adilger Sure, I'll take a look.

            Hi Chris,
            it looks like the newly-added replay-single test_200 is failing intermittently during testing since it landed to master on 2023-04-18, could you please take a look:

            https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fea9d884-926c-4e41-b86d-9679d197c5f8&source=sub_tests#redirect

            adilger Andreas Dilger added a comment - Hi Chris, it looks like the newly-added replay-single test_200 is failing intermittently during testing since it landed to master on 2023-04-18, could you please take a look: https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fea9d884-926c-4e41-b86d-9679d197c5f8&source=sub_tests#redirect
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            People

              anikitenko Alena Nikitenko (Inactive)
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: