[LU-904] import invalidation doesn't fail all requests Created: 08/Dec/11  Updated: 28/Sep/12  Resolved: 21/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Minor
Reporter: Niu Yawei (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-1582 replay-ost-single.sh test_8b: client ... Technical task Resolved Niu Yawei  
Severity: 3
Rank (Obsolete): 4632

 Description   

When client invaliates import on eviction, it only abort the requests in sending list (imp_sending_list) and delay list (imp_delayed_list), but the requests which are in request set but not linked in sending/delayed list will not be failed out, then those requests will stay across eviction and be sent later, which could probably cause data corruption at the end.

Those leaked requests are usually the retry request, for instance: in brw_interpret(), if we found a request failed for a recoverable error, we'll try to generate a new request and retry it, such retry request will usually be kept in the request set for a while, see ptlrpc_send_new_req():

        if (req->rq_sent && (req->rq_sent > cfs_time_current_sec()))
                RETURN (0);

so if the import invalidation happened before the request is being added into sending or delayed list, the request will not be aborted.

We probably need another list to track those requests and make sure they are failed out during invalidation.



 Comments   
Comment by Niu Yawei (Inactive) [ 13/Jan/12 ]

patch for master: http://review.whamcloud.com/1962

Comment by Andreas Dilger [ 16/Jan/12 ]

The patch in http://review.whamcloud.com/1962 appears to be moving in just the opposite direction - that requests will be resent from the client if -EINPROGRESS is returned. How does that relate to the bug described here?

Also, what effect does this implication have to client/server interoperability? What will an old client do with -EINPROGRESS? Since this is not in the list of recoverable error codes, the client will immediately fail instead of retrying.

Comment by Niu Yawei (Inactive) [ 17/Jan/12 ]

Hi, Andreas

The patch includes two part:

  • client redo io infinitely when get -EINPROGRESS from server;
  • fix the defect mentioned in this ticket by introducing imp_generation_set stuff;

Since the defect isn't easy to be triggered with limited retry count, I put the -EINPROGRESS stuff in this patch, then we can easily inject the -EINPROGRESS error in test script to trigger the defect.

The old client will just fail the io with -EINPROGRESS, so when old client interoperate with Orion server,the client will just fail the io with -EINPROGRESS when Orion server want it retry until quota master available, that's why we want to put this patch in 2.2 client. Of course, we can also fix the interoperability issue in Orion server (not return -EINPROGRESS for old client request, but acquire quota infinitely on server side like what the current master code does), obviously, this way is much more complex than patching the 2.2 client.

Comment by Andreas Dilger [ 17/Jan/12 ]

Two issues arise in this case:

  • this assumes that orion servers only need to interoperate with 2.2 clients at the oldest. I don't think that will be true for LLNL and other customers.
  • how does the server know whether the client understands -EINPROGRESS? That would need an OBD_CONNECT flag.

I'm not against getting interoperability support into earlier versions of Lustre, but this might be difficult to get into 2.2 at this point, and it still doesn't address interoperability with 1.8 and 2.1 clients. You should discuss this with Johann, but I don't think this would be accepted into 2.1.1 either, but it might.

Comment by Niu Yawei (Inactive) [ 21/May/12 ]

patch landed for 2.3

Generated at Sat Feb 10 01:11:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.