[LU-1239] cascading client evictions Created: 20/Mar/12 Updated: 17/Mar/15 Resolved: 25/Jul/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.3.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Vitaly Fertman | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4565 | ||||||||
| Description |
|
recently I have found the following scenario that may lead to cascading client reconnects, lock timeouts, evictions, etc. 1. MDS is overloaded with enqueues, they consume all the threads on MDS_REQUEST portal. static int ptlrpc_server_allow_normal(struct ptlrpc_service *svc, int force)
{
#ifndef __KERNEL__
if (1) /* always allow to handle normal request for liblustre */
return 1;
#endif
if (force ||
svc->srv_n_active_reqs < svc->srv_threads_running - 2)
return 1;
if (svc->srv_n_active_reqs >= svc->srv_threads_running - 1)
return 0;
return svc->srv_n_active_hpreq > 0 || svc->srv_hpreq_handler == NULL;
}
no thread to handle pings - other clients get timed out rpc. fixes: |
| Comments |
| Comment by Vitaly Fertman [ 20/Mar/12 ] |
| Comment by Andreas Dilger [ 05/Apr/12 ] |
|
What version of Lustre hit this problem, and what kind of workload blocked all of the MDS threads? |
| Comment by Nathan Rutman [ 05/Apr/12 ] |
|
Lustre 2.10 server, Lustre 1.8.6 clients.
Why can't they all be high priority? They should take 0 time to process on the MDS relative to anything else, and we want a responsive client mount command. And it's simpler code. |
| Comment by Nathan Rutman [ 18/May/12 ] |
|
This patch has been sitting here for two months with no review - what should we do with it? |
| Comment by Nathan Rutman [ 18/May/12 ] |
|
sorry, one month. I didn't realize it had gone through some revisions. |
| Comment by Peter Jones [ 25/Jul/12 ] |
|
Landed for 2.3 |
| Comment by Christopher Morrone [ 02/Aug/12 ] |
|
Is this patch suitable for 2.1? I think we're seeing the same cascading client evictions there. |
| Comment by Cory Spitz [ 03/Aug/12 ] |
|
Chris, yes the patch is suitable for 2.1. Cray initially found this bug on 2.1 and Vitaly developed the fix for Xyratex's 2.1+patches: https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e. |
| Comment by Nathan Rutman [ 21/Nov/12 ] |
|
Xyratex MRP-455 |