Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.5.2
-
None
-
3
-
17503
Description
On our compute cluster we are seeing the issue that occasionally a number of the standard processes are way slower than normal. (taking >90s instead of normally <10s).
The affected processes are reading and writing files on lustre, we have never seen them slow down this much when using other file systems. While they are slow all the CPU cycles are in system time.
If the processes are slow on a particular node, they usually are slow on that node until we do manual intervention. Currently the manual intervention involves dropping all caches, which takes >90s on our nodes with 24GB nodes. Reducing the maximum amount that Lustre is allowed to cache to ensure there is always free memory doesn't improve the situation at all.
We have noticed LU-1784 which might be related? The ticket seems to suggest nothing has changed there recently, is this correct?
We have had quite a bit success in making the application slow by copying many files and many GB data from lustre to lustre until the swap is (nearly) full (as seen in /proc/fs/lustre/llite/*/max_cached_mb), but even this isn't always an indicator.
We are still working on a suitable reproducer or at least test case that we can share. All we currently have involves custom software which we can't distribute.
While we are looking for a reproducer, we'd also be interested if there are any additional things we might want to watch out for on the Lustre side, any profiling to be done or any other data to gather.