Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12441

Response tracker is not detached on router ping reply

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.4
    • Lustre 2.13.0, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      This leads to some false-positive "timeouts". In the excerpt below we see the response tracker attached to the ping msg, we see the reply, but we never unlink the md so the response tracker does not get detached.

      00000400:00000200:26.0:1560633919.659270:0:2482:0:(router.c:1045:lnet_ping_router_locked()) Check: 12345-93@gni4
      00000400:00000200:26.0:1560633919.659290:0:2482:0:(lib-move.c:4844:LNetGet()) LNetGet msg ffff8807832cdc00 -> 12345-93@gni4
      00000400:00000200:26.0:1560633919.659291:0:2482:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff8807832cdc00
      00000400:00000200:26.0:1560633919.659293:0:2482:0:(lib-move.c:4505:lnet_attach_rsp_tracker()) Add rspt ffff88078b5ef000 to md ffff88078a861f68 dl 1560633969s ne false
      00000400:00000200:6.0:1560633919.659345:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff8807832cdc00 unlink false
      00000400:00000200:6.0:1560633919.659466:0:2479:0:(lib-move.c:3890:lnet_parse_reply()) 60@gni4: Reply msg ffff88078c803800 from 12345-93@gni4 of length 80/80 into md 0x65931
      00000400:00000200:6.0:1560633919.659467:0:2479:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff88078c803800
      00000400:00000200:6.0:1560633919.659474:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
      00000400:00000200:6.0:1560633919.659475:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff88078c803800 unlink false
      00000400:00000200:6.0:1560633919.659507:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
      00000400:00000200:6.0:1560633919.659578:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
      00000400:00000200:6.0:1560633919.659619:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
      00000400:00000100:26.0:1560633977.003290:0:2482:0:(lib-move.c:2781:lnet_finalize_expired_responses()) Response timed out: md = ffff88078a861f68: nid = 93@gni4
      

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: