[LU-3774] LNet don't response to wrong match bits Created: 19/Aug/13 Updated: 24/Jan/14 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexey Lyashkov | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9773 |
| Description |
|
while invested long timeouts with bulk requests found a LNet drop requests with wrong xid without response with error code and application stick in bulk timemout loop. 00000100:00000400:0.0:1376644405.966092:0:11157:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376644402/real 1376644402] req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000400:00000001:0.0:1376644405.966117:0:11157:0:(lib-msg.c:46:lnet_build_unlink_event()) Process entered 00000400:00000001:0.0:1376644405.966118:0:11157:0:(lib-msg.c:55:lnet_build_unlink_event()) Process leaving 00000100:00000001:0.0:1376644405.966121:0:11157:0:(events.c:95:reply_in_callback()) Process entered 00000100:00000200:0.0:1376644405.966122:0:11157:0:(events.c:97:reply_in_callback()) @@@ type 5, status 0 req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:0.0:1376644405.966127:0:11157:0:(events.c:118:reply_in_callback()) @@@ unlink req@ffff88003a96d800 x1443516130525992/t0(0) o4->lustre-OST0000-osc-ffff88001cd0a800@0@lo:6/4 lens 456/416 e 0 to 1 dl 1376644405 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00080000:0.0:1376644405.967280:0:11157:0:(client.c:1724:ptlrpc_check_set()) resend bulk old x1443516130525992 new x1443516130526000 00000400:00000010:1.0:1376644405.968303:0:11829:0:(lib-lnet.h:333:lnet_msg_alloc()) kmalloced 'msg': 336 at ffff88002fb85600 (tot 48234866). 00000400:00000200:1.0:1376644405.968343:0:11829:0:(lib-move.c:163:lnet_match_md()) Request from 12345-0@lo of length 102400 into portal 8 MB=0x520deca300328 00000400:00000100:1.0:1376644405.968345:0:11829:0:(lib-move.c:1920:lnet_parse_get()) Dropping GET from 12345-0@lo portal 8 match 1443516130525992 offset 0 length 102400 00000400:00000010:1.0:1376644405.968346:0:11829:0:(lib-lnet.h:344:lnet_msg_free()) kfreed 'msg': 336 at ffff88002fb85600 (tot 48234530). 00000100:00000001:1.0:1376644405.968349:0:11829:0:(events.c:381:server_bulk_callback()) Process entered 00000100:00000200:1.0:1376644405.968350:0:11829:0:(events.c:392:server_bulk_callback()) event type 4, status 0, desc ffff88003dbe6c00 00000100:00000001:1.0:1376644405.968352:0:11829:0:(events.c:419:server_bulk_callback()) Process leaving I think LNet node should be send a response if LNet xid don't registered to transfer in that case we will be avoid long waiting in bulk loop and sending a early reply to the failed node. |
| Comments |
| Comment by Jodi Levi (Inactive) [ 19/Aug/13 ] |
|
Liang, |
| Comment by Liang Zhen (Inactive) [ 13/Sep/13 ] |
|
It's not a bug and LNet is supposed to do this, match-bits is key or remote memory address and it's reasonable to drop illegal access. Also, it will not be a small change if we want send reply/ack to return error, because different LNDs will have different GET protocol, e.g socklnd will send a reply in LNet, o2iblnd will use rdma get and w/o sending a reply, so I tend to not have this change for now. |
| Comment by Alexey Lyashkov [ 13/Sep/13 ] |
|
From my point view - it's may be handled as special type of lnet messages - like LNet pings (in similar to ICMP protocol) where we have reply about access to invalid match bit key. otherwise it's produce a very very very large timeout in such situation, on most productive systems obd_timeout is 900s and at_max is over 3xobd_timeout. |
| Comment by Alexey Lyashkov [ 13/Sep/13 ] |
|
btw. IB LND don't have a check timeout for any incoming request so don't have own timeout in that case (in difference to outgoing messages which have a timeout on tx), so server side will stick in waiting for infinitely time. |
| Comment by Liang Zhen (Inactive) [ 24/Sep/13 ] |
|
Shadow, sorry I'm (and have been) busy on our project milestone, but will come back to you after I get more time to think over it. |
| Comment by Alexey Lyashkov [ 24/Sep/13 ] |
|
How much time is it need? I want to start a discussion - how we may introduce a control protocol and Eric (on LAD) have a agree in initial ideas about it. |
| Comment by Eric Barton (Inactive) [ 23/Oct/13 ] |
|
Firstly, why was there no matching ME? Was this a bug - i.e. is this needed for debug support? Normally the upper level protocol should ensure ME are attached before the sender can send. |
| Comment by Alexey Lyashkov [ 23/Oct/13 ] |
|
Eric, it's not a bug. it's trivial situation but rare as need to be race -
/that situation will described by log in description - client unlink request during timeout but server send an early reply and extend an bulk timeout as none errors hit/ Same situation may hit if client will reboot while BRW request will executed on server side. |
| Comment by Eric Barton (Inactive) [ 24/Oct/13 ] |
|
Can you confirm that you're trying to make the server abandon the RPC if the client has already abandoned it? Is the real issue that the continued existence of the RPC prevents the client from reconnecting? |
| Comment by Alexey Lyashkov [ 25/Oct/13 ] |
|
Eric, I'm not trying to make server abandon RPC, it's already done via early reply. that issue to block an ost_io thread to exit from target_bulk_io loop as AT extend an timeout after each send an early reply from an OST to the client. so - if client will reply with some error for early reply we will be exit from loop in start, but in current situation we have wait transfer finished until AT_MAX exceed, to return error to the caller. you should be easily replicate that issue with put comment at single line exp->exp_abort_active_req = 1; in target_handle_connect() function and run any recovery test with bulk resend. is it explanation clean enough? |
| Comment by Eric Barton (Inactive) [ 29/Oct/13 ] |
|
So you do want the server to abandon the request(i.e. remove it from the request queue and avoid blocking an OST service thread on a request that can only timeout after AT_MAX) - right? If so, we need to evaluate Personally, I'm a little reluctant to change the protocol if the real issue is that AT is fragile. |
| Comment by Alexey Lyashkov [ 29/Oct/13 ] |
|
Eric, No. I want to return an some error if one lnet send an request with invalid match bits to other side. Like TCP stack have. |
| Comment by Eric Barton (Inactive) [ 30/Oct/13 ] |
|
Yes, but this still does not solve the problem in the case of router failure or client failure. I don't think a protocol change like this is merited unless this issue is more common than router and/or client failure and can't be avoided by tweaking AT. |
| Comment by Alexey Lyashkov [ 31/Oct/13 ] |
|
well, routing environment is different story. if you remember my presentation on developer summit on LAD, i talk about both cases and say about about solution for both.
LNet have a special handling for LNet ping - why we can't extent that special PID with two additional operations? it's change will be backward compatible - as older nodes will be ignore that message an continue work as before. |
| Comment by John Forgan (Inactive) [ 22/Jan/14 ] |
|
Hi, this is still an issue for us. Can someone look at Shadow's proposal please? |
| Comment by Eric Barton (Inactive) [ 24/Jan/14 ] |
|
Today's ACK and REPLY semantics are clear - the initiator receives them if... ...so receiving the ACK/REPLY is a confirmation the everying worked OK. I'm not clear what semantics you are asking for - especially in the presence of router failure. Getting a NAK when there is either no matching ME at the destination or the destination is not reachable is not so simple since there are more failure cases... 1. The router receiving the PUT/GET crashed after receiving it but before it was forwarded ...which will result in the PUT/GET initiator still having to wait for the full timeout. If the client is responsive, but has given up on the server and therefore has unlinked bulk and reply buffers for in-flight RPCs, will it not be trying to reconnect? Isn't successful receipt of a reconnection attempt a good and timely indication that the server should abandon its bulk/reply communications? |
| Comment by Alexey Lyashkov [ 24/Jan/14 ] |
|
Eric, i say about two things - both is similar to the ICMP protocol in TCP/UDP. ACK semantic will introduce a long delay if destination is not accessible. So it's huge delay if router don't have a destination reachable. It's found by dead per detector (on router) and router may abort any transfers and may send a special message to provide asynchronous notification and sender will abort own descriptors and return error to ptlrpc layer. So we will be avoid huge delay in waiting reply for outgonning requests (like ldlm callbacks, bulk transfer or other if need). 2) match bits don't exist on destination |
| Comment by Eric Barton (Inactive) [ 24/Jan/14 ] |
|
Shadow, I do think I grok your request. However, do you grok the failure cases I listed (the 2nd set of numbered statements in my comment)? These failures will result in a timeout at the initiator (the Lustre server in this case). How will that be handled? Your suggestion to add features at the LNET level is incomplete is it not? If the purpose is to allow a live, responsive client to reconnect promptly without having to wait for bulk and replies to its previous incarnation to time out, can we not implement a complete solution at the Lustre level? |