[LU-6250] slow down of processing, cache related Created: 16/Feb/15  Updated: 07/Jun/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 17503

 Description   

On our compute cluster we are seeing the issue that occasionally a number of the standard processes are way slower than normal. (taking >90s instead of normally <10s).
The affected processes are reading and writing files on lustre, we have never seen them slow down this much when using other file systems. While they are slow all the CPU cycles are in system time.

If the processes are slow on a particular node, they usually are slow on that node until we do manual intervention. Currently the manual intervention involves dropping all caches, which takes >90s on our nodes with 24GB nodes. Reducing the maximum amount that Lustre is allowed to cache to ensure there is always free memory doesn't improve the situation at all.

We have noticed LU-1784 which might be related? The ticket seems to suggest nothing has changed there recently, is this correct?

We have had quite a bit success in making the application slow by copying many files and many GB data from lustre to lustre until the swap is (nearly) full (as seen in /proc/fs/lustre/llite/*/max_cached_mb), but even this isn't always an indicator.

We are still working on a suitable reproducer or at least test case that we can share. All we currently have involves custom software which we can't distribute.

While we are looking for a reproducer, we'd also be interested if there are any additional things we might want to watch out for on the Lustre side, any profiling to be done or any other data to gather.



 Comments   
Comment by Peter Jones [ 16/Feb/15 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Andreas Dilger [ 17/Feb/15 ]

Frederik, you wrote:

We have had quite a bit success in making the application slow by copying many files and many GB data from lustre to lustre until the swap is (nearly) full (as seen in /proc/fs/lustre/llite/*/max_cached_mb), but even this isn't always an indicator.

do you actually mean "swap is (nearly) full" or "cache is (nearly) full"? The kernel shouldn't swap out any pages from data files to the swap device, only pages with allocated variables from user executables. The kernel-internal memory cannot be swapped out either.

Could you please add the contents of /proc/slabinfo and /proc/meminfo into the bug so we can see where the memory is being used.

Comment by Frederik Ferner (Inactive) [ 17/Feb/15 ]

Andreas,

thanks for catching my mistake. I did mean "cache is (nearly) full". These nodes use only very little swap, if any.

I'll add /proc/slabinfo and /proc/meminfo for in the slow state next time we manage to reproduce it.

Generated at Sat Feb 10 01:58:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.