Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5534

LNET_EVENT_SEND delayed until after RPC had timed out

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 15406

    Description

      On Lola, some OBD_PING RPCs had "rq_real_sent == 0" when they timed out. This indicates the LNET_EVENT_SEND events hadn't occur by then. The RPCs were supposed to go through two routers, on which we had 0.1% message drop rules applied either way. One occasion looks like this:

      Aug 18 07:07:34 lola-24 kernel: Lustre: 3738:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [se
      nt 1408370847/real 0]  req@ffff880ff9115000 x1476487418549148/t0(0) o400->soaked-MDT0000-mdc-ffff8810329a9800@192.168.1.108@o2ib:12/10 lens 2
      24/224 e 0 to 1 dl 1408370854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Aug 18 07:07:34 lola-24 kernel: Lustre: 3741:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [se
      nt 1408370847/real 0]  req@ffff880fd1896c00 x1476487418549160/t0(0) o400->soaked-OST0001-osc-ffff8810329a9800@192.168.1.103@o2ib:28/4 lens 22
      4/224 e 0 to 1 dl 1408370854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-OST0001-osc-ffff8810329a9800: Connection to soaked-OST0001 (at 192.168.1.103@o2ib) was lost; i
      n progress operations using this service will wait for recovery to complete
      Aug 18 07:07:34 lola-24 kernel: Lustre: 3743:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [se
      nt 1408370847/real 0]  req@ffff880ffecaec00 x1476487418549168/t0(0) o400->soaked-OST0003-osc-ffff8810329a9800@192.168.1.105@o2ib:28/4 lens 22
      4/224 e 0 to 1 dl 1408370854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-OST0003-osc-ffff8810329a9800: Connection to soaked-OST0003 (at 192.168.1.105@o2ib) was lost; i
      n progress operations using this service will wait for recovery to complete
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-OST0001-osc-ffff8810329a9800: Connection restored to soaked-OST0001 (at 192.168.1.103@o2ib)
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-OST0003-osc-ffff8810329a9800: Connection restored to soaked-OST0003 (at 192.168.1.105@o2ib)
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-MDT0000-mdc-ffff8810329a9800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; i
      n progress operations using this service will wait for recovery to complete
      Aug 18 07:07:34 lola-24 kernel: Lustre: soaked-MDT0000-mdc-ffff8810329a9800: Connection restored to soaked-MDT0000 (at 192.168.1.108@o2ib)
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            liwei Li Wei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: