Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3774

LNet don't response to wrong match bits

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9773

    Description

      while invested long timeouts with bulk requests found a LNet drop requests with wrong xid without response with error code and application stick in bulk timemout loop.

      00000100:00000400:0.0:1376644405.966092:0:11157:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1376644402/real 1376644402]  req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      00000400:00000001:0.0:1376644405.966117:0:11157:0:(lib-msg.c:46:lnet_build_unlink_event()) Process entered
      00000400:00000001:0.0:1376644405.966118:0:11157:0:(lib-msg.c:55:lnet_build_unlink_event()) Process leaving
      00000100:00000001:0.0:1376644405.966121:0:11157:0:(events.c:95:reply_in_callback()) Process entered
      00000100:00000200:0.0:1376644405.966122:0:11157:0:(events.c:97:reply_in_callback()) @@@ type 5, status 0  req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      00000100:00000200:0.0:1376644405.966127:0:11157:0:(events.c:118:reply_in_callback()) @@@ unlink  req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      00000100:00080000:0.0:1376644405.967280:0:11157:0:(client.c:1724:ptlrpc_check_set()) resend bulk old x1443516130525992 new x1443516130526000
      00000400:00000010:1.0:1376644405.968303:0:11829:0:(lib-lnet.h:333:lnet_msg_alloc()) kmalloced 'msg': 336 at ffff88002fb85600 (tot 48234866).
      00000400:00000200:1.0:1376644405.968343:0:11829:0:(lib-move.c:163:lnet_match_md()) Request from 12345-0@lo of length 102400 into portal 8 MB=0x520deca300328
      00000400:00000100:1.0:1376644405.968345:0:11829:0:(lib-move.c:1920:lnet_parse_get()) Dropping GET from 12345-0@lo portal 8 match 1443516130525992 offset 0 length 102400
      00000400:00000010:1.0:1376644405.968346:0:11829:0:(lib-lnet.h:344:lnet_msg_free()) kfreed 'msg': 336 at ffff88002fb85600 (tot 48234530).
      00000100:00000001:1.0:1376644405.968349:0:11829:0:(events.c:381:server_bulk_callback()) Process entered
      00000100:00000200:1.0:1376644405.968350:0:11829:0:(events.c:392:server_bulk_callback()) event type 4, status 0, desc ffff88003dbe6c00
      00000100:00000001:1.0:1376644405.968352:0:11829:0:(events.c:419:server_bulk_callback()) Process leaving
      

      I think LNet node should be send a response if LNet xid don't registered to transfer in that case we will be avoid long waiting in bulk loop and sending a early reply to the failed node.

      Attachments

        Activity

          People

            wc-triage WC Triage
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: