Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
9773
Description
while invested long timeouts with bulk requests found a LNet drop requests with wrong xid without response with error code and application stick in bulk timemout loop.
00000100:00000400:0.0:1376644405.966092:0:11157:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376644402/real 1376644402] req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000400:00000001:0.0:1376644405.966117:0:11157:0:(lib-msg.c:46:lnet_build_unlink_event()) Process entered 00000400:00000001:0.0:1376644405.966118:0:11157:0:(lib-msg.c:55:lnet_build_unlink_event()) Process leaving 00000100:00000001:0.0:1376644405.966121:0:11157:0:(events.c:95:reply_in_callback()) Process entered 00000100:00000200:0.0:1376644405.966122:0:11157:0:(events.c:97:reply_in_callback()) @@@ type 5, status 0 req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:0.0:1376644405.966127:0:11157:0:(events.c:118:reply_in_callback()) @@@ unlink req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00080000:0.0:1376644405.967280:0:11157:0:(client.c:1724:ptlrpc_check_set()) resend bulk old x1443516130525992 new x1443516130526000 00000400:00000010:1.0:1376644405.968303:0:11829:0:(lib-lnet.h:333:lnet_msg_alloc()) kmalloced 'msg': 336 at ffff88002fb85600 (tot 48234866). 00000400:00000200:1.0:1376644405.968343:0:11829:0:(lib-move.c:163:lnet_match_md()) Request from 12345-0@lo of length 102400 into portal 8 MB=0x520deca300328 00000400:00000100:1.0:1376644405.968345:0:11829:0:(lib-move.c:1920:lnet_parse_get()) Dropping GET from 12345-0@lo portal 8 match 1443516130525992 offset 0 length 102400 00000400:00000010:1.0:1376644405.968346:0:11829:0:(lib-lnet.h:344:lnet_msg_free()) kfreed 'msg': 336 at ffff88002fb85600 (tot 48234530). 00000100:00000001:1.0:1376644405.968349:0:11829:0:(events.c:381:server_bulk_callback()) Process entered 00000100:00000200:1.0:1376644405.968350:0:11829:0:(events.c:392:server_bulk_callback()) event type 4, status 0, desc ffff88003dbe6c00 00000100:00000001:1.0:1376644405.968352:0:11829:0:(events.c:419:server_bulk_callback()) Process leaving
I think LNet node should be send a response if LNet xid don't registered to transfer in that case we will be avoid long waiting in bulk loop and sending a early reply to the failed node.