Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16483

Loss of idle ping causes reconnect even if subsequent ping succeeds

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      There seems to be some flaw in the idle client ping interval, but maybe I'm just missing something. An idle client sends an OBD_PING to a target every ping_interval seconds (default obd_timeout / 4 = 25 seconds), but there is no consideration of the RPC timeout, so you can end up with multiple ones in flight at same time if RPC timeout is > 25 seconds.
      This can lead to odd behavior. For example, if I drop a single OBD ping on the server, then subsequent OBD pings may succeed, but when the one that was dropped hits timeout this causes a reconnect.

      ping1 -> dropped
      <25 seconds later>
      ping2 -> succeeds
      <some time later>
      ping1 hits timeout, and causes client reconnect

      Example showing 6 pings in flight before the first one hits timeout.

      00000100:00000040:12.0:1667595564.976159:0:13921:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595717 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595591.600170:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000cf5effe6 x1748600553999424/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595744 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:8.0:1667595618.224202:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@000000001e6c674d x1748600553999680/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595771 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595644.848234:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000031efedb5 x1748600553999936/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595797 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:8.0:1667595671.472178:0:13914:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@00000000fe59e179 x1748600554000192/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595824 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000040:12.0:1667595698.096210:0:13918:0:(niobuf.c:939:ptl_send_rpc()) @@@ send flags=0  req@0000000022f3b477 x1748600554000448/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 0 dl 1667595851 ref 2 fl Rpc:Nr/0/ffffffff rc 0/-1 job:''
      00000100:00000400:12.0:1667595717.552079:0:13921:0:(client.c:2308:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1667595564/real 1667595564]  req@000000003cb0eae1 x1748600553999104/t0(0) o400->lustre-OST0000-osc-ffff9e68576b3800@16@kfi:28/4 lens 224/224 e 0 to 1 dl 1667595717 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      

      Attachments

        Issue Links

          Activity

            People

              anikitenko Alena Nikitenko (Inactive)
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: