[LU-5116] Race between resend and reply processing Created: 28/May/14 Updated: 03/Sep/14 Resolved: 02/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 14104 | ||||||||||||
| Description |
|
Server evict client during invalid request 00000100:00100000:9.0:1400505646.197736:0:83755:0:(service.c:1734:ptlrpc_server_handle_req_in()) got req x1468534034672908 00000100:00020000:9.0:1400505646.197738:0:83755:0:(service.c:975:ptlrpc_check_req()) @@@ Invalid replay without recovery req@ffff88079c2b0850 x1468534034672908/t0(88947828) o4->7f3cf026-15bd-c61a-088c-a943e5bce2bf@335@gni1:0/0 lens 488/0 e 0 to 0 dl 0 ref 1 fl New:/6/ffffffff rc 0/-1 00000020:00080000:9.0:1400505646.221792:0:83755:0:(genops.c:1391:class_fail_export()) disconnecting export ffff8805acac6400/7f3cf026-15bd-c61a-088c-a943e5bce2bf 00000020:00000080:10.0:1400505646.221811:0:83755:0:(genops.c:1229:class_disconnect()) disconnect: cookie 0xa74fa39ba3a7cd61 00000020:00010000:10.0:1400505646.221817:0:83755:0:(genops.c:1746:obd_stale_export_put()) Put export ffff8805acac6400: total 1 00000100:00080000:10.0:1400505646.221820:0:83755:0:(import.c:1502:ptlrpc_cleanup_imp()) ffff88054809a800 ^W: changing import state from FULL to CLOSED At the client side we can see a race 00000100:00080000:22.0:1400505646.246037:0:19252:0:(client.c:2487:ptlrpc_resend_req()) @@@ going to resend req@ffff880ffea86000 x1468534034670388/t88947827(88947827) o4->snx11063-OST0050-osc-ffff881039a22400@10.149.150.25@o2ib4008:6/4 lens 488/416 e 2 to 0 dl 1400505782 ref 2 fl Interpret:R/4/0 rc 0/0 Client going to resend request but it already has req->rq_replied flag (Interpret:R), and req->rq_reqmsg = MSG_REPLAY flag (/4). There was disconnect/reconnect at the client side (lnet error) and no recovery happened. The race exist between ptlrpc_check_set() and reconnect->ptlrpc_resend_req. The request belong to the imp->imp_sending_list and has MSG_REPLAY flag after after_reply() at ptlrpc_check_set() and before if (!cfs_list_empty(&req->rq_list)) {
cfs_list_del_init(&req->rq_list);
cfs_atomic_dec(&imp->imp_inflight);
}
The reconnect code process this list to resend request. So, it could happened that request got reply, after_reply() processed it, set MSG_REPLAY. But ptlrpc_resend_req() set rq_resend flag, and request going to resend. After such request with MSG_REPLAY flag come to server, it cause client eviction. |
| Comments |
| Comment by Alexander Boyko [ 28/May/14 ] |
|
patch http://review.whamcloud.com/10471 |
| Comment by Chris Horn [ 28/May/14 ] |
|
Cray tested this patch. Without the patch we would occasionally hit this race condition leading to eviction and job failure. With the patch we stopped seeing the evictions. |
| Comment by Cliff White (Inactive) [ 02/Jun/14 ] |
|
The patch has been merged, so I will close this issue. |
| Comment by Chris Horn [ 02/Jun/14 ] |
|
Additional testing revealed that the patch has not completely closed the race window. We may want to keep this ticket open to track additional improvements/fixes. Otherwise we can open a new ticket when we have something to contribute. |
| Comment by Cliff White (Inactive) [ 02/Jun/14 ] |
|
It would be better to open up a new ticket, especially if you think there might be a delay. It is easy to link tickets if needed later. |
| Comment by Chris Horn [ 02/Jun/14 ] |
|
Thanks, sounds good. |
| Comment by James Nunez (Inactive) [ 05/Jun/14 ] |
|
patch for b2_5 at http://review.whamcloud.com/#/c/10562 |
| Comment by Alexander Boyko [ 17/Jun/14 ] |
|
http://review.whamcloud.com/10735 one more patch for master. |
| Comment by Peter Jones [ 17/Jun/14 ] |
|
Could you please track this latest patch under a new JIRA ticket? Thanks! |