Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
Lustre 2.6.0
-
3
-
13338
Description
After commit 586e95a5b3f7b9525d78e7efc9f2949387fc9d54 we have significant performance degradation in master. The following pictures show the regress measured on Xeon Phi:
Attachments
- lu-4841.png
- 42 kB
- lu-4841-1.png
- 46 kB
- lu-4841-5.png
- 24 kB
- perf_2.5.png
- 15 kB
- perf_master.png
- 13 kB
Activity
There are a few things to look at for this topic:
1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;
2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;
3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.
There are a few things to look at for this topic:
1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;
2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;
3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.
Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.
This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up.
The 5 second txg timeout in ZFS is is the typical txg maximum. Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit.
And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.
In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.
I didn't mean the memory is for Lustre client to use for buffer space. For write RPC, clients have to hold those writing pages until the OST commits the corresponding transaction. Therefore, clients have to have extra memory to pin those pages in memory and applications can not use them. For ZFS, the typical txg timeout is 5 seconds, which means the clients will pin 5 seconds of writing data in memory; depending on the writing throughput on the client side, this can be a lot.
There is really nothing we can do on the client side. Probably we can do some tune for ZFS. The I/O generated by Lustre is different from I/O of generic workload, so we may look into the timeout of txg or restrict the memory of write cache.
By plenty, I meant there are plenty of available memory.
Yes, but how much memory is "plenty"? In the real world, memory is a finite resource. We can not program with the assumption that there is always free memory. Lustre must behave reasonably by default when memory is under contention.
Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST.
No, it does not. Client memory is far faster than disk on a remote OST over the network pretty much by definition. Client memory under normal use cases is also under contention by actual applications, which are not represented by the naive tests that were used to create the graphs in this ticket.
In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.
Memory contention is normal in the real world, and Lustre's defaults should be selected to meet and function reasonably under real world usage.
By plenty, I meant there are plenty of available memory. The memory can be temporarily `lost' when the write has completed but the transaction is not committed. Therefore, if the client has the extra memory available to hold UNSTABLE pages between two transactions, it should be able to maintain the highest write speed sustainable. Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST.
Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed.
That statement is puzzling. How much memory is "plenty"? Our I/O nodes have 64GiB of RAM, which I would have thought would be considered "plenty".
But it also kind of misses the point. In the real world, it doesn't matter how much memory is installed on the node. The people who designed the system probably intended the memory to actually be used not just sit idle all the time because Lustre has no sane memory management.
On an "I/O" node, that memory needs might need to be shared by function shipping buffers, system debuggers, system management tools, and other filesystem software. On normal HPC compute nodes the memory is going to be under contention with actual user applications, other filesystems, etc.
My point is that memory contention is a normal situation in the real world. It is not a corner case. If we treat it as a corner case, we'll be putting out a subpar product.
So item 2 only happens if unstable change tracking is enabled? Or all the time?
Only when unstable pages tracking is enabled.
And 3, is that only removed when unstable change tracking is disabled? Or is that removed when it is enabled as well?
It's removed.
Also, I might wonder if the unstable page tracking setting is backwards. One offers correctness under fairly typical HPC workloads, and the other sacrifices correctness for speed. Shouldn't correctness be the default, and the speed-but-patholigically-bad-in-low-memory-situations be optional?
Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed. I don't know what you mean by correctness. Actually in current implementation, it doesn't get any feedback for system memory pressure, it sends SOFT_SYNC only when it's low on available LRU slots.
Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.