[LU-17503] IO500: improve NRS TBF to sort requests by object offset for ior-hard-write Created: 05/Feb/24  Updated: 05/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.17.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: hard

Issue Links:
Related
is related to LU-8433 Maximizing Bandwidth utilization by T... Open
is related to LU-17296 NRS TBF default rules Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the IO500 benchmark, the "ior-hard-write" phase simulates many threads writing to a single large file (e.g. writing out regions of a very large array from memory), with a stonewall timer, after which all threads must continue to write until each thread has written the same amount of data as the farthest write offset from any thread.

In the current implementation, some "early mover" jobs have a large advantage to write to the file because they are granted DLM locks for non-conflicting regions of the file, and get far ahead of other writers that must contend for the DLM locks. This causes the "IOR hard write" phase to take a long time due to a "long tail" where threads need to "fill in" the large gaps in the file. Having the NRS TBF request handler sort the RPCs by file offset (in addition to arrival time) and prioritize writes with smaller offsets over writes with higher offsets would slow down the faster writers and speed up the slower ones, until they are in lockstep. Having the writes processed sequentially is also beneficial for managing the server cache and IO request merging for submission to the underlying filesystem, so should result in improved aggregate performance even though some threads are deliberately slowed down.

The NRS ORR engine exists to do request ordering within an object, but having a single NRS TBF policy is preferred, since ORR is missing much of the functionality of TBF, and doing tiered request sorting is unlikely to produce an optimal result.


Generated at Sat Feb 10 03:35:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.