[LU-6631] Retry LDLM_CANCEL in ldlm_cli_cancel_req Created: 22/May/15  Updated: 26/May/15  Resolved: 26/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Hiroya Nozaki Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ldlm_cli_cancel_req immediately returns if it gets -EWOULDBLOCK(-EAGAIN) from a LDLM_CANCEL request it tried to send, and the client is being evicted before long.

Some network errors can be recovered in many cases then we can avoid eviction. (eviction is really annoying in suppress_ping environment. so I want to avoid it as far as I can)

that's why I believe ldlm_cli_cancel_req should retry LDLM_CANCEL here.

By the way, the patch I'll upload has worked well in Luster-1.8.8 base FEFS for a very long time and the same in Lustre-2.6 based FEFS. I haven't seen any serious issues so far. That's why I'm convinced it will work in Lustre-2.x too.



 Comments   
Comment by Gerrit Updater [ 22/May/15 ]

Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14917
Subject: LU-6631 ptlrpc: Retry LDLM_CANCEL in ldlm_cli_cancel_req
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6e0287d739e7217a7e118dc7767e0f7584273573

Comment by Andreas Dilger [ 22/May/15 ]

Could you please explain a bit more about where the -EWOULDBLOCK error is coming from? I see this in ptlrpc_import_delay_req() but only if req->rq_no_delay is set. Is rq_no_delay being set in the LDLM_CANCEL request somewhere? I don't see it at first glance.

Comment by Hiroya Nozaki [ 26/May/15 ]

I realized that LDLM_CANCEL doesn't have rq_no_resend and rq_no_delay in Lustre-2.x anymore !! so this patch isn't need.

Generated at Sat Feb 10 02:01:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.