Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.0, Lustre 2.1.4
-
3
-
8584
Description
Serialization of in-flight RPCs on mdc_rpc_lock makes non-modifying operations, such as calling open() on a directory, vulnerable to blocking due to a slow backend MDS filesystem. In particular, users may see long delays running ls when LDLM_ENQUEUE requests get blocked behind long-lived metadata-modifying MDS_REINT requests.
For example, in LU-3442, RPCs involving writing llog records experienced long service times due to a misbehaving backend filesystem. Therefore RPCs for operations like create(), unlink() and rename() would stay in flight for many seconds on the client. Unfortunately, these long-lived in-flight RPCs prevent LDLM_ENQUEUE requests for open() on a directory from being sent, due to the mdc_get_rpc_lock() call in mdc_enqueue(). Once issued, the LDLM_ENQUEUE request completes almost immediately since it doesn't involve synchronous I/O on the backend. It would be desirable if such non-modifying operations could be shielded from the effects of slow synchronous operations.
To that end, it would be helpful to clarify what the mdc_rpc_lock is protecting. mdc_enqueue() has this to say:
812 /* It is important to obtain rpc_lock first (if applicable), so that 813 * threads that are serialised with rpc_lock are not polluting our 814 * rpcs in flight counter. We do not do flock request limiting, though*/ 815 if (it) { 816 mdc_get_rpc_lock(obddev->u.cli.cl_rpc_lock, it); 817 rc = mdc_enter_request(&obddev->u.cli);
but it's not clear to me what is meant by "polluting", and why the counter can't be protected by a separate lock that need no be held across the entire network request.
I also observe that the in-flight RPC counter for an OBD import rarely exceeds 1 or 2, and never approaches the upper limit of 8. So it seems we are not doing a good job of keeping a full pipeline of in-flight RPCs.
LLNL-bug-id: TOSS-2084
Attachments
Issue Links
- is related to
-
LU-3442 MDS performance degraded by reading of ZFS spacemaps
-
- Resolved
-
Activity
Fix Version/s | New: Lustre 2.5.0 [ 10295 ] | |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Labels | New: performance |
Assignee | Original: WC Triage [ wc-triage ] | New: Alex Zhuravlev [ bzzz ] |