[LU-6250] slow down of processing, cache related Created: 16/Feb/15 Updated: 07/Jun/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 17503 |
| Description |
|
On our compute cluster we are seeing the issue that occasionally a number of the standard processes are way slower than normal. (taking >90s instead of normally <10s). If the processes are slow on a particular node, they usually are slow on that node until we do manual intervention. Currently the manual intervention involves dropping all caches, which takes >90s on our nodes with 24GB nodes. Reducing the maximum amount that Lustre is allowed to cache to ensure there is always free memory doesn't improve the situation at all. We have noticed LU-1784 which might be related? The ticket seems to suggest nothing has changed there recently, is this correct? We have had quite a bit success in making the application slow by copying many files and many GB data from lustre to lustre until the swap is (nearly) full (as seen in /proc/fs/lustre/llite/*/max_cached_mb), but even this isn't always an indicator. We are still working on a suitable reproducer or at least test case that we can share. All we currently have involves custom software which we can't distribute. While we are looking for a reproducer, we'd also be interested if there are any additional things we might want to watch out for on the Lustre side, any profiling to be done or any other data to gather. |
| Comments |
| Comment by Peter Jones [ 16/Feb/15 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Andreas Dilger [ 17/Feb/15 ] |
|
Frederik, you wrote:
do you actually mean "swap is (nearly) full" or "cache is (nearly) full"? The kernel shouldn't swap out any pages from data files to the swap device, only pages with allocated variables from user executables. The kernel-internal memory cannot be swapped out either. Could you please add the contents of /proc/slabinfo and /proc/meminfo into the bug so we can see where the memory is being used. |
| Comment by Frederik Ferner (Inactive) [ 17/Feb/15 ] |
|
Andreas, thanks for catching my mistake. I did mean "cache is (nearly) full". These nodes use only very little swap, if any. I'll add /proc/slabinfo and /proc/meminfo for in the slow state next time we manage to reproduce it. |