Details
-
Technical task
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.1.0, Lustre 2.2.0
-
10219
Description
In LU-874 the shared single-file IOR testing with 512 threads on 32 clients (16 cores per client) writing 128MB chunks to a file striped over 2 OSTs. This showed clients timing out on DLM locks. The threads on the single client are writing to disjoint parts of the file (i.e. each thread has its own DLM extent that is not adjacent to the extents written by other threads on that client).
For example, to reproduce this workload with 4 clients (A, B, C, D) against 2 OSTs (1, 2):
Client ABCDABCDABCD...
OST 121212121212...
While this IOR test is running, other tests are also running on different clients to create a very heavy IO load on the OSTs.
It may be that DLM locks on the OST are not getting any IO requests sent to refresh the DLM locks:
- due to the number of active DLM locks on the client for a single OST being more than the number of rpcs in flight, some of the locks may be starved for sending BRW RPCs under that lock to the OST to refresh the lock timeout
- due to the IO ordering of the BRW requests on the client, it may be that all of the pages for the lower-offset extent are sent to the OST before the pages for a higher-offset extent are ever sent
- the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time
Some solutions that might help this (individually, or in combination):
1. increase the max_rpcs_in_flight = core count, but I think this is bad in the long run since it can dramatically increase the number of RPCs that need to be handled at one time by each OST
2. always allow at least one BRW RPC in flight for each lock that is being canceled
3. prioritize ALL BRW RPCs for a blocked lock in advance of non-blocked BRW requests (e.g. like a high-priority request queue on the client)
4. both (2) and (3) may be needed in order to avoid starvation as the client core count increases