[LU-5116] Race between resend and reply processing Created: 28/May/14  Updated: 03/Sep/14  Resolved: 02/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5554 Race between resend and reply process... Resolved
is related to LU-2232 LustreError: 9120:0:(ost_handler.c:16... Resolved
Severity: 3
Rank (Obsolete): 14104

 Description   

Server evict client during invalid request

00000100:00100000:9.0:1400505646.197736:0:83755:0:(service.c:1734:ptlrpc_server_handle_req_in()) got req x1468534034672908
00000100:00020000:9.0:1400505646.197738:0:83755:0:(service.c:975:ptlrpc_check_req()) @@@ Invalid replay without recovery  req@ffff88079c2b0850 x1468534034672908/t0(88947828) o4->7f3cf026-15bd-c61a-088c-a943e5bce2bf@335@gni1:0/0 lens 488/0 e 0 to 0 dl 0 ref 1 fl New:/6/ffffffff rc 0/-1
00000020:00080000:9.0:1400505646.221792:0:83755:0:(genops.c:1391:class_fail_export()) disconnecting export ffff8805acac6400/7f3cf026-15bd-c61a-088c-a943e5bce2bf
00000020:00000080:10.0:1400505646.221811:0:83755:0:(genops.c:1229:class_disconnect()) disconnect: cookie 0xa74fa39ba3a7cd61
00000020:00010000:10.0:1400505646.221817:0:83755:0:(genops.c:1746:obd_stale_export_put()) Put export ffff8805acac6400: total 1
00000100:00080000:10.0:1400505646.221820:0:83755:0:(import.c:1502:ptlrpc_cleanup_imp()) ffff88054809a800 ^W: changing import state from FULL to CLOSED

At the client side we can see a race

00000100:00080000:22.0:1400505646.246037:0:19252:0:(client.c:2487:ptlrpc_resend_req()) @@@ going to resend  req@ffff880ffea86000 x1468534034670388/t88947827(88947827) o4->snx11063-OST0050-osc-ffff881039a22400@10.149.150.25@o2ib4008:6/4 lens 488/416 e 2 to 0 dl 1400505782 ref 2 fl Interpret:R/4/0 rc 0/0

Client going to resend request but it already has req->rq_replied flag (Interpret:R), and req->rq_reqmsg = MSG_REPLAY flag (/4).

There was disconnect/reconnect at the client side (lnet error) and no recovery happened.

The race exist between ptlrpc_check_set() and reconnect->ptlrpc_resend_req. The request belong to the imp->imp_sending_list and has MSG_REPLAY flag after after_reply() at ptlrpc_check_set() and before

                if (!cfs_list_empty(&req->rq_list)) {
                        cfs_list_del_init(&req->rq_list);
                        cfs_atomic_dec(&imp->imp_inflight);                    
                }

The reconnect code process this list to resend request. So, it could happened that request got reply, after_reply() processed it, set MSG_REPLAY. But ptlrpc_resend_req() set rq_resend flag, and request going to resend. After such request with MSG_REPLAY flag come to server, it cause client eviction.



 Comments   
Comment by Alexander Boyko [ 28/May/14 ]

patch http://review.whamcloud.com/10471
Xyratex-bug-id: MRP-1888

Comment by Chris Horn [ 28/May/14 ]

Cray tested this patch. Without the patch we would occasionally hit this race condition leading to eviction and job failure. With the patch we stopped seeing the evictions.

Comment by Cliff White (Inactive) [ 02/Jun/14 ]

The patch has been merged, so I will close this issue.

Comment by Chris Horn [ 02/Jun/14 ]

Additional testing revealed that the patch has not completely closed the race window. We may want to keep this ticket open to track additional improvements/fixes. Otherwise we can open a new ticket when we have something to contribute.

Comment by Cliff White (Inactive) [ 02/Jun/14 ]

It would be better to open up a new ticket, especially if you think there might be a delay. It is easy to link tickets if needed later.

Comment by Chris Horn [ 02/Jun/14 ]

Thanks, sounds good.

Comment by James Nunez (Inactive) [ 05/Jun/14 ]

patch for b2_5 at http://review.whamcloud.com/#/c/10562

Comment by Alexander Boyko [ 17/Jun/14 ]

http://review.whamcloud.com/10735 one more patch for master.

Comment by Peter Jones [ 17/Jun/14 ]

Could you please track this latest patch under a new JIRA ticket? Thanks!

Generated at Sat Feb 10 01:48:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.