[LU-904] import invalidation doesn't fail all requests Created: 08/Dec/11 Updated: 28/Sep/12 Resolved: 21/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.3.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Niu Yawei (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Sub-Tasks: |
|
||||||||||
| Severity: | 3 | ||||||||||
| Rank (Obsolete): | 4632 |
| Description |
|
When client invaliates import on eviction, it only abort the requests in sending list (imp_sending_list) and delay list (imp_delayed_list), but the requests which are in request set but not linked in sending/delayed list will not be failed out, then those requests will stay across eviction and be sent later, which could probably cause data corruption at the end. Those leaked requests are usually the retry request, for instance: in brw_interpret(), if we found a request failed for a recoverable error, we'll try to generate a new request and retry it, such retry request will usually be kept in the request set for a while, see ptlrpc_send_new_req(): if (req->rq_sent && (req->rq_sent > cfs_time_current_sec()))
RETURN (0);
so if the import invalidation happened before the request is being added into sending or delayed list, the request will not be aborted. We probably need another list to track those requests and make sure they are failed out during invalidation. |
| Comments |
| Comment by Niu Yawei (Inactive) [ 13/Jan/12 ] |
|
patch for master: http://review.whamcloud.com/1962 |
| Comment by Andreas Dilger [ 16/Jan/12 ] |
|
The patch in http://review.whamcloud.com/1962 appears to be moving in just the opposite direction - that requests will be resent from the client if -EINPROGRESS is returned. How does that relate to the bug described here? Also, what effect does this implication have to client/server interoperability? What will an old client do with -EINPROGRESS? Since this is not in the list of recoverable error codes, the client will immediately fail instead of retrying. |
| Comment by Niu Yawei (Inactive) [ 17/Jan/12 ] |
|
Hi, Andreas The patch includes two part:
Since the defect isn't easy to be triggered with limited retry count, I put the -EINPROGRESS stuff in this patch, then we can easily inject the -EINPROGRESS error in test script to trigger the defect. The old client will just fail the io with -EINPROGRESS, so when old client interoperate with Orion server,the client will just fail the io with -EINPROGRESS when Orion server want it retry until quota master available, that's why we want to put this patch in 2.2 client. Of course, we can also fix the interoperability issue in Orion server (not return -EINPROGRESS for old client request, but acquire quota infinitely on server side like what the current master code does), obviously, this way is much more complex than patching the 2.2 client. |
| Comment by Andreas Dilger [ 17/Jan/12 ] |
|
Two issues arise in this case:
I'm not against getting interoperability support into earlier versions of Lustre, but this might be difficult to get into 2.2 at this point, and it still doesn't address interoperability with 1.8 and 2.1 clients. You should discuss this with Johann, but I don't think this would be accepted into 2.1.1 either, but it might. |
| Comment by Niu Yawei (Inactive) [ 21/May/12 ] |
|
patch landed for 2.3 |