[LU-4841] Performance regression in master for threads more than 2 Created: 31/Mar/14 Updated: 08/Feb/18 Resolved: 08/Feb/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Dmitry Eremin (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 13338 | ||||
| Description |
|
After commit 586e95a5b3f7b9525d78e7efc9f2949387fc9d54 we have significant performance degradation in master. The following pictures show the regress measured on Xeon Phi: |
| Comments |
| Comment by Oleg Drokin [ 01/Apr/14 ] |
|
This is a very serious performance drop off that we cannot afford to have in 2.6 release. |
| Comment by Christopher Morrone [ 01/Apr/14 ] |
|
The unstable-pages tracking patch is not a "feature", it is the only thing that makes the client at all useable on our production workloads. We can draw you a performance graph with a line at zero if that helps. I can't emphasize enough how pathologically bad the client's behavior was without the unstable-tracking patch. It is not acceptable to release 2.6 with that problem either. I am completely open to a solution avoids the pathologically bad behavior and is also reasonably fast. |
| Comment by Prakash Surya (Inactive) [ 01/Apr/14 ] |
|
Can we get a more complete report of the issue at hand? For example: * What's the client node(s) look like?
* x86_64? Size of RAM? Number of CPU cores? NIC bandwidth?
* What's the server nodes(s) look like?
* x64_64? etc...
* What's the lustre version of the servers?
* What's the filesystem of the servers?
* Any pertinent configuration details?
* Why do you feel that specific commit is to blame?
* What does the {clean,dirty,unstable} page statistics look like during the test?
There's so much more information needed to make any kind of accurate judgement here. It's really frustrating when a patch that we NEED in order to run on Sequoia continually runs the gauntlet without any reasonable discussion or explanation. Depending on the setup, this patch is supposed to make performance worse, but that's to prevent an OOM and crash of the client. I would really like to work up a solution that works for us and you guys, but just takes way too much effort to do that with these vague reports, rash decisions, uncooperative review system, and overall lack of communication from the upstream developers. Throwing up a graph of a line with a few labels is not a proper bug report. I honestly can't make any reasonable conclusion from the information provided. I really wish the unstable page patches could have been properly discussed and designed back when I was actively engaged with them a year and a half ago. I apologize for the rant, and I'm not calling anybody out specifically, but being a downstream Lustre developer with real systems to maintain can be a very frustrating experience. |
| Comment by Oleg Drokin [ 01/Apr/14 ] |
|
Xeon Phi node is 240 HT cores (only 60 or so real ones). It has 8G RAM. Certainly when the node memory runs low performance is expected to have a hit, whenever this is the case I'd defer to Dmitry. |
| Comment by Prakash Surya (Inactive) [ 01/Apr/14 ] |
|
Judging solely by the number of cores and the amount of RAM on the Xeon Phi, I would not be surprised if it's hitting the exact case we hit on Sequoia (with 64 HT cores and 16G of RAM). It should be fairly easy to spot check this just by watching the unstable page accounting stats in the llite proc handler and looking at the client stacks for the process doing the IO. IIRC, the thread will sit on a wait queue if it hits Lustre's limit (should be easy to determine if it's the unstable waitq by looking at the stack trace), or it might be temporarily halted by the kernel (which should also be easy to spot looking at the threads backtrace). With that said, it might be educational if to run the "fast" test against a ZFS backend (I'm assuming ldiskfs was used to make the report). With the default 5s sync time of ZFS, writing at 1GB/s with 8GB of RAM, you might just hit the same exact OOM issues we were seeing on Sequoia. Maybe even bump the ZFS sync timeout to 10s and see if the client falls over due to it running out of memory. |
| Comment by Oleg Drokin [ 01/Apr/14 ] |
|
We'll see if Dmitr can do that. If it's just the case of early throttling, though, we'd need to fiddle with the logic here so that things are not stalled unless really necessary since the test did not OOM in the past either. As for how this patch was singled out - Dmitry has actually run git bisect to find it. |
| Comment by Prakash Surya (Inactive) [ 01/Apr/14 ] |
|
Sure, I'm all for making things better; whether that's changing the logic or simply finding better suited default values. I just want to make sure any changes are well thought out, and we're not jumping to wrongful conclusions. I don't think I did much testing of those patches against ldiskfs, nor did I spend much time looking for "optimal" default values, so there's probably a lot of room for improvement on those fronts. Was the server(s) running with the server-side portion of that patch-set during testing? If the server is ignoring the soft-sync requests, it's definitely much more likely for a client to exhaust its unstable pages and stall (at least with a ZFS backend; I'm not too familiar with ldiskfs's syncing behavior). |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/14 ] |
|
Here is a patch to disable unstable pages tracking optionally: http://review.whamcloud.com/8524 For whoever is suffering from the performance loss, please apply the patch while we're working on this. In my point of view, the problem with current unstable page tracking algo is that it limits a really small work set for writing. By default, each OSC can cache dirty pages up to 32MB, which is 32 RPCs if the RPC size is 1M. Therefore, a target can have at most 32 transactions in memory for each client. Plus the setting of max_rpcs_in_flight is 8 by default, a client has to wait for committing after 4 round of RPCs. One way to solve this problem is to adopt unstable pages accounting into cache control algorithm, instead of using dirty pages accounting as what we’re doing now. Usually the cache memory for Lustre is much larger than 32M that will provide us a much larger work set for writing. We can verify this by setting osc.*.max_dirty_mb to 4096 for example. In implementation, we can add cl_lru_busy for the number pages in transfer, and the cl_lru_busy count will be subtracted after the write RPC are committed. We can add cl_lru_list and cl_lru_busy together to check if cache memory is exceeded. |
| Comment by Prakash Surya (Inactive) [ 02/Apr/14 ] |
|
Hmm.. I'll have to revisit the implementation details to decide if I agree with that or not. But regardless, keep in mind that the reason the limit is there in the first place is to throttle these "pinned" pages. So if we simply raise the limit, then how is that any different than removing the limit altogether? Perhaps this issue really boils down to the fact that we have this limit on a per OSC basis when we really should have the limit be on a global scale (per client of course). For a system using a "large" number of OSC's 32M each is a lot of data to have irreclaimable. For a real example, on Sequoia we have 768 OSTs, so that mean a max of 24GB of data which is already more than the total amount of RAM on the system (16GB). For "small" OST job raising the per OSC dirty size limit might make sense, but on a "large" OSC job it's effectively removing the limit altogether putting us right back into the OOM scenario. IIRC, I tried to go down the route of having a global limit on the unstable pages (or maybe it was per filesystem, I can't remember), but that was shot down in favor of the per OSC limit. We can even drop the Lustre limit altogether and rely solely on the kernel to stall IO once it hits a certain threshold; I believe it will already do this if the NFS_Unstable is used for this. But we if do that, we won't be sending the soft-sync requests to the server which is essential to maintaining any kind of reasonable performance. Now that you mention it, the OSC count of the job would definitely play a factor here. How many OSTs/OSCs were being used by the original test? I still have no idea what the client workload was for the original test... |
| Comment by Dmitry Eremin (Inactive) [ 02/Apr/14 ] |
Client was Xeon Phi card with 61 CPU (244 cores) 8 GB memory. The Mellanox FDR from host was used. LNet self_test shows 2476 RPC/s and 1232.55 MB/s.
Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz 16 cores 64 GB memory.
build: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-22-g3f0084f-PRISTINE-2.6.32-431.5.1.el6_lustre.g3f0084f.x86_64
ldiskfs
ost{0,1,2,3}, ost{4,5,6,7}, ost{8,9,10,11}, ost{12,13,14,15}, ost{16}
0 UP osd-ldiskfs MGS-osd MGS-osd_UUID 5
I used git bisect to find an exact commit which bring this degradation.
I didn't get this statistics. But I observed that an application time to time goes to sleep state for a while instead of "R" or "D" in top. This was measured by "iozone -i 0 -i 1 -M -C -r 1m -s 16g -t $i" |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ] |
|
Hi Dmitry, Can you please try to increase the size of dirty pages allowance to see if it can help? You can run the following command: lctl set_param osc.*.max_dirty_mb=4096 on the client nodes to set the max dirty size to 4G for example. Jinshan |
| Comment by Prakash Surya (Inactive) [ 02/Apr/14 ] |
|
Dmitry, I really appreciate the added information. Thanks. So it looks like the workload has a max of 17 OSTs to use, which is only about 500 MB of dirty space available. It sounds like it's working as designed, but designed poorly. I agree with Jinshan; increase the per OSC dirty limit and see if things get better. It's really starting to sounds like the root of the issue is using a per OSC dirty limit. I think it'd be much more desirable to have a global limit of dirty space, say 20% of RAM, that all OSCs can use. Then a single OSC can use all 20% of RAM if needed, or that can be split up between multiple OSCs; this decision being driven completely by the workload of the client. |
| Comment by Oleg Drokin [ 03/Apr/14 ] |
|
I agree the dirty limit should be global. |
| Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ] |
|
I cannot increase the max dirty size more than 1924Mb on Xeon Phi. osc.lustrefs-OST000f-osc-ffff8801e02cac00.max_dirty_mb=1925 error: set_param: setting /proc/fs/lustre/osc/lustrefs-OST000f-osc-ffff8801e02cac00/max_dirty_mb=1925: Numerical result out of range The statistics during the write test with max_dirty_mb=1924 is following: NFS_Unstable: 1157120 kB llite.lustrefs-ffff8801e02cac00.unstable_stats= unstable_pages: 289280 unstable_mb: 1130 The throughput for 8 initial writers is still bounded by 48957.79 KB/sec The full /proc/meminfo is following: MemTotal: 7882352 kB MemFree: 2060024 kB Buffers: 0 kB Cached: 3860324 kB SwapCached: 0 kB Active: 150276 kB Inactive: 3733736 kB Active(anon): 150192 kB Inactive(anon): 91688 kB Active(file): 84 kB Inactive(file): 3642048 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 11964 kB Writeback: 0 kB AnonPages: 39676 kB Mapped: 4312 kB Shmem: 218880 kB Slab: 1202232 kB SReclaimable: 45752 kB SUnreclaim: 1156480 kB KernelStack: 13456 kB PageTables: 1672 kB NFS_Unstable: 1128536 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3941176 kB Committed_AS: 604580 kB VmallocTotal: 34359738367 kB VmallocUsed: 347656 kB VmallocChunk: 34359090472 kB AnonHugePages: 34816 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 10236 kB DirectMap2M: 8097792 kB |
| Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ] |
# lctl get_param llite.*.unstable_stats llite.lustrefs-ffff8801e1440800.unstable_stats= unstable_pages: 302592 unstable_mb: 1182 # lctl get_param osc.*.unstable_stats osc.lustrefs-OST0000-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0001-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0002-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 38144 unstable_mb: 149 osc.lustrefs-OST0003-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36864 unstable_mb: 144 osc.lustrefs-OST0004-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0005-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0006-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 34304 unstable_mb: 134 osc.lustrefs-OST0007-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 32768 unstable_mb: 128 osc.lustrefs-OST0008-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0009-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 34560 unstable_mb: 135 osc.lustrefs-OST000a-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 38144 unstable_mb: 149 osc.lustrefs-OST000b-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36096 unstable_mb: 141 osc.lustrefs-OST000c-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000d-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000e-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000f-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36608 unstable_mb: 143 # lctl set_param osc.*.max_dirty_mb=1924 # lctl get_param llite.*.unstable_stats llite.lustrefs-ffff8801e1440800.unstable_stats= unstable_pages: 286720 unstable_mb: 1120 # lctl get_param osc.*.unstable_stats osc.lustrefs-OST0000-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0001-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0002-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36864 unstable_mb: 144 osc.lustrefs-OST0003-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 34816 unstable_mb: 136 osc.lustrefs-OST0004-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0005-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0006-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 38144 unstable_mb: 149 osc.lustrefs-OST0007-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36352 unstable_mb: 142 osc.lustrefs-OST0008-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST0009-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 31488 unstable_mb: 123 osc.lustrefs-OST000a-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 34304 unstable_mb: 134 osc.lustrefs-OST000b-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 37888 unstable_mb: 148 osc.lustrefs-OST000c-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000d-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000e-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 0 unstable_mb: 0 osc.lustrefs-OST000f-osc-ffff8801e1440800.unstable_stats= unstable_check: 1 unstable_pages: 36864 unstable_mb: 144 |
| Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ] |
|
| Comment by Prakash Surya (Inactive) [ 05/Apr/14 ] |
|
I'd need to verify with the code, but I think even with lctl set_param osc.*.unstable_stats=0 we increment the NFS_Unstable field, the the kernel interprets that counter as "dirty" space. So even though Lustre's unstable "waiting" will be disabled, the kernel might still slow us down. Again, I'm going off of memory here, so I might be mistaken. With that said, having just over 1/8th of total RAM un-reclaimable due to Lustre's network IO traffic seems high. As Oleg said, there's going to be real applications running on the client which could (probably will) consume a lot of RAM as well. I'm not sure how to properly address this off hand. We're pinning these pages in the client cache for recovery purposes, but I wonder if that's strictly necessary. Obviously it wouldn't be "safe" in the face of a server crash, but for the common case, I don't think anybody would notice. Perhaps a mount flag stating pages can be dropped once they hit the server cache; evictions to due server failures would become more common but it could potentially "help" the client and allow greater performace. Idk... Just brainstorming here... |
| Comment by Christopher Morrone [ 05/Apr/14 ] |
Actually, that part is pretty essential to Lustre's recovery mechanism. Unfortunately filesystem users are not satisfied with "we won't corrupt your data most of the time"; they insist that we avoid corruption and data loss by design. |
| Comment by Prakash Surya (Inactive) [ 05/Apr/14 ] |
|
I don't think anybody should be satisfied with "we won't corrupt your data most of the time", I definitely would not be. Although, would it be OK to return an IO error? What happens now when a client does a write, gets back a successful completion, server fails before it is committed, and then the client is evicted by the server and can't replay that outstanding write? I was envisioning a similar case, except the client would, in a sense, evict itself (i.e. it can't perform the replay because it dropped those pages). I think I need to better understand exactly what is the bottleneck here and why. Dmitry, did you say the servers were running with the soft-sync server-side portion of that patch set? I can't recall, and didn't see you mention it quickly flipping back through the comments. |
| Comment by Dmitry Eremin (Inactive) [ 09/Apr/14 ] |
|
No, servers were just regular 2.5.57 and 2.4.3 as I remember. I didn't patch them. |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ] |
|
I'm working on this issue. |
| Comment by Andreas Dilger [ 17/Apr/14 ] |
|
Prakash, I believe the servers are running the soft-sync code, since that is landed on master. Dmitry, in your latest test results:
According to the graphs, 700MB is where the performance was for 2.5 with 8 writers. It wasn't clear from your comment whether you thought this means the patch is doing its job correctly to disable the unstable page throlling, or if you thought this was a regression from the 2.5 performance? |
| Comment by Dmitry Eremin (Inactive) [ 17/Apr/14 ] |
|
There are two versions of servers were. One was latest master the second was 2.4.3. The behavior was the same. To be clear with Jinshan's patch and unsettling unstable_stats
The performance become the same as in 2.5 branch. So, the patch fix performance regression and can enable unstable page throttling if need. |
| Comment by Andreas Dilger [ 17/Apr/14 ] |
|
Assign to Jinshan, since he reported he is working on this issue. |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ] |
|
After taking a further look, it turned out that the attempt to tally UNSTABLE_NFS count kills performance. In osc_[inc|dec]_unstable_pages, it does: for (i = 0; i < page_count; i++)
[inc|dec]_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);
Therefore, to handle a normal 256-page RPC, it has to operate atomic operations for 512 times. After I comment that piece of code out, I can't see any noticeable performance change. There is another problem with the current implementation of unstable page accounting - it decides if to send the SOFT_SYNC flag in osc_over_unstable_soft_limit() where it checks how many dirty pages existing: obd_upages = atomic_read(&obd_unstable_pages);
obd_dpages = atomic_read(&obd_dirty_pages);
osc_upages = atomic_read(&cli->cl_unstable_count);
return osc_upages != 0 &&
obd_upages >= (obd_max_dirty_pages - obd_dpages) / 2;
In the default configuration, obd_max_dirty_pages is set to 1/2 of total RAM and per client dirty limit is set to 32MB. This will cause two problems: Therefore, I'd like to control unstable pages in cache machanism. Prakash, how do you think? |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ] |
|
patch to disable kernel unstable page accounting is at: http://review.whamcloud.com/10003 |
| Comment by Jinshan Xiong (Inactive) [ 18/Apr/14 ] |
|
Patch 10003 is revised to implement my idea above. Please take a look. Dmitry, please benchmark this patch. Thanks. |
| Comment by Prakash Surya (Inactive) [ 21/Apr/14 ] |
So, all that is needed to get the performance back, is to remove these kernel calls? If that's the case, I don't completely agree with Jinshan's patch as it stands now (revision 2). First off, we should try to maintain the kernel's NR_UNSTABLE_NFS value as best as possible. That's there for a reason, and it will allow the kernel to throttle (i.e. halt) IO if necessary. In general, we need to make an effort to work with the kernel and use the interfaces it exposes to us. If we ignore NR_UNSTABLE_NFS, then we run the risk of OOM-ing if the server decides not to sync the client's data. Think about a case where an OSTs storage goes offline, but it's still accepting network traffic; the client can potentially fill all of its RAM with unstable pages, right? I don't see anything to stop this from happening, but maybe I'm over looking something.
Actually, SOFT_SYNC will be set at obd_max_dirty_pages / 2. So with obd_max_dirty_pages = 1/2 of RAM, then SOFT_SYNC will be set at 1/4 of RAM. Why is this a problem?? How does keying off of "max_cached_mb" make things any better?
What exactly is the point of the "cache control mechanism"? I think we need to clearly and explicitly document what its intended purpose is. I thought it was to limit clean pages on the client. Am I mistaken? Does it have a different purpose? Honestly, I'm against relying on "max_cached_mb" than is strictly necessary. If the lustre client was more linux friendly none of that infrastructure would be needed at all; so, I'd rather not use it here. Which brings me back to my original question, is the performance degradation completely attributed to my sloppy handling of the [inc|dec]_zone_page_state calls? I did most of my performance testing of this patch on PPC nodes with 64K pages, so the overhead would have been less for me on those nodes. |
| Comment by Jinshan Xiong (Inactive) [ 21/Apr/14 ] |
Actually this won't happen. If server refuses, or whatever reasons, to flush transactions, the client will stop generating more caching pages and wait for more allowance of free LRU slots. Actually before this patch is introduced, it may indeed cause the OOM problem you mentioned - in the case that # of unstable pages + caching pages are close to total RAM. The reason for having max_cached_mb is that the caching pages of Lustre is considered as overhead. In another word, under the condition of keeping Lustre working properly, customers usually tend to cache minimum pages by Lustre so that applications can use as much memory as possible.
This was exactly what we thought before. We had ever relied on linux kernel to release Lustre pages, which was exactly why we removed cache control(i.e., LRU) of Lustre in early versions of Lustre 2.x. It turned out not working so I revived a cache control machanism for 2.x. The reason why it didn't work was linux kernel uses per page zone page releasing thread so mostly a single thread is in charge of dropping clean pages. This would work for slow IO case but Lustre can produce several gigabytes per second IO. Obviously it was overkilled.
Yes, this is what I have seen on OpenSFS cluster. After I removed these lines, I couldn't see any noticeable performance regression. However, there were some problems with the network on OpenSFS cluster and the QDR speed can't exceed 2GB/s, so I didn't have a chance to try the patch on faster environment. |
| Comment by Prakash Surya (Inactive) [ 22/Apr/14 ] |
Is this because unstable pages also count against the "max_cached_mb" allowance?
Well, the "caching pages" should be able to be pitched when the system gets low on memory, while only the unstable pages are pinned. Also, since we're using NR_UNSTABLE_NFS, the kernel will prevent us from an OOM. The NR_UNSTABLE_NFS value is counted as "dirty" pages and the kernel will halt IO in balance_dirty_pages() with a call trace like so: fio S 00000fffae72633c 0 59338 59283 0x00000000
Call Trace:
[c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
[c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
[c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
[c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
[c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
[c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
[c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
[c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
[c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
[c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
[c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
[c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
[c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
[c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
[c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
[c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
[c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
[c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
OK, thanks for the clarification. I just wanted to make sure I understood the intention of max_cached_mb. I'm well aware that it's a necessary evil for the time being due to the high cost of lustre client page reclamation. Since we both seem to agree that in an ideal world max_cached_mb would not exist, I suggest we try to refrain from using it as best as possible. It'd be really nice to eventually get to a position where it can be removed, and we'll never get there if we start to rely on it more and more as the client is worked on.
If that's the case, can we just make the *_zone_page_state calls smarter instead of making larger (potentially disruptive) changes? Maybe there's a way to group the per-page calls into a single (or a few) call(s) per RPC? I'm looking at the implementation of the *zone_page_state calls, though, and that might be tougher than I thought.... I'd like to keep that tracking around, but doing a local_irq[save|restore] and a couple atomic_long_add()'s per page doesn't seem ideal either. Maybe we can leverage mod_zone_page_state() instead? Something like: /* This need to be implemented */
int get_index_from_zone(struct zone *);
zone *get_zone_from_index(int index);
int zone_deltas[MAX_NR_ZONES] = {0};
for (i = 0; i < page_count; i++) {
int zone_idx = get_index_from_zone(page_zone(desc->bd_iov[i].kiov_page))
zone_deltas[zone_idx] += PAGE_SIZE; /* the size of the unstable page */
}
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *z = get_zone_from_index(i)
mod_zone_page_state(z, NR_UNSTABLE_NFS, zone_deltas[i])
}
Do you think something like that would be OK (or even possible)? |
| Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ] |
|
It's worth a shot. I will try it out tomorrow. Anyway, we should track Unstable pages in LRU. |
| Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ] |
|
Hi Prakash, The reason I want Unstable pages under control is that max_cache_mb should be seriously complied. It means how much memory Lustre can use (right now) for data cache. Unstable pages may be several gigabytes which I think it's unacceptable in some cases. Accounting Unstable pages may help a little bit though - the major problem w/ it is that if the dirty pages are over allowance, Linux kernel can only wait there to throttle writing; ideally it should provide and call a BDI specific callback to flush disk cache. That makes a problem for Lustre because we don't have a chance to issue SOFT_SYNC therefore the client will have to wait for transaction to commit, which will be ~10 seconds typically. Anyway, if it won't affect performance, I will agree to still account Unstable pages. I will give it a try tomorrow. |
| Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ] |
|
Hi Prakash, With the patch below which can be applied upon patch 10003, I can see about 5% performance regression. I did the test on a x86_64 machine with only one NUMA node so mostly the pages belong to the same zone. diff --git a/lustre/osc/osc_page.c b/lustre/osc/osc_page.c
index 0f43417..7657909 100644
--- a/lustre/osc/osc_page.c
+++ b/lustre/osc/osc_page.c
@@ -891,6 +891,36 @@ out:
RETURN(rc);
}
+#ifdef __KERNEL__
+static inline void unstable_page_accounting(struct ptlrpc_bulk_desc *desc, int fact)
+{
+ obd_count page_count = desc->bd_iov_count;
+ void *zone = NULL;
+ int count = 0;
+ int i;
+
+ for (i = 0; i < page_count; i++) {
+ void *pz = page_zone(desc->bd_iov[i].kiov_page);
+
+ if (pz == zone) {
+ ++count;
+ continue;
+ }
+
+ if (count > 0) {
+ mod_zone_page_state(zone, NR_UNSTABLE_NFS, fact * count);
+ count = 0;
+ }
+ zone = pz;
+ ++count;
+ }
+ if (count > 0)
+ mod_zone_page_state(zone, NR_UNSTABLE_NFS, fact * count);
+}
+#else
+#define unstable_page_accounting(desc, fact)
+#endif
+
/* Performs "unstable" page accounting. This function balances the
* increment operations performed in osc_inc_unstable_pages. It is
* registered as the RPC request callback, and is executed when the
@@ -907,6 +937,7 @@ void osc_dec_unstable_pages(struct ptlrpc_request *req)
return;
LASSERT(page_count >= 0);
+ unstable_page_accounting(desc, -1);
atomic_sub(page_count, &cli->cl_cache->ccc_unstable_nr);
LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
@@ -933,6 +964,7 @@ void osc_inc_unstable_pages(struct ptlrpc_request *req)
return;
LASSERT(page_count >= 0);
+ unstable_page_accounting(desc, 1);
LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
atomic_add(page_count, &cli->cl_cache->ccc_unstable_nr);
|
| Comment by Jodi Levi (Inactive) [ 29/Apr/14 ] |
| Comment by Dmitry Eremin (Inactive) [ 16/May/14 ] |
|
|
| Comment by Prakash Surya (Inactive) [ 20/May/14 ] |
|
Dmitry, can you get stacks of the client threads and see if you're getting hung up in balance_dirty_pages()? I set up a small test cluster of VMs (4 clients, 2 OSS, 1 MDS, lustre 2.4.0-28chaos), and it's very clear that I'm getting throttled back by the kernel due to the NR_UNSTABLE_NFS accounting. The interesting bit is that if I take that accounting out of Lustre, I haven't seen any OOM events on the clients. I have a feeling the Lustre client just isn't set up to play nice the Linux kernel, and we're going to have to make serious changes to get there; although that's just a hunch. It might make sense to make the accounting of NR_UNSTABLE_NFS tunable, and potentially disabled by default. It works well on our Sequoia clients, but I think it needs some work to work well in general. |
| Comment by Andreas Dilger [ 22/May/14 ] |
|
Prakash, we could go further and set the default to "on" for PPC nodes (or something more specific to BGL clients that can be determined at compile time), and "off" for non-PPC nodes. This should probably only be a module parameter, so that it cannot be changed at runtime to avoid complexity. |
| Comment by Christopher Morrone [ 22/May/14 ] |
|
I don't like that solution. The main difference on the BG/Q client (vs the x86_64 nodes you have more experience with) is that there are many low power cores. We've already hit the transistor scaling wall, and all architectures are moving to a many-low-power-cores model. This is the harbinger of things to come, not a one-off issue that can be brushed under the carpet. Setting the default based on architecture would be the wrong way to do it. Lower count and load nodes (like our PPC login nodes) are going to behave very much like the x86_64 nodes, in all likelyhood. Lustre should make good design scaling decisions rather than forcing compile time options. |
| Comment by Andreas Dilger [ 23/May/14 ] |
|
Chris, my comment was only in reply to Prakash's previous one where he was suggesting that the NR_UNSTABLE_NFS accounting should be off by default. I was proposing that this be improved so that you do not need to manually configure it for your BGL nodes. I'm not sure that the number of cores is the primary issue, but rather the amount of memory on those IO nodes compared to how quickly the memory is dirtied. Jinshan's latest patch reduces the overhead with NR_UNSTABLE_NFS accounting enabled, which is good even for BGL, but AFAIK it still does not return performance to the same level as without this accounting. Obviously I'd rather that this issue be resolved with a single solution that works optimally everywhere, but not at the cost of a serious performance regression on all the other systems Lustre runs on. I'd hope that we can find some better long-term solution for this, but the time has run out for major surgery in 2.6. |
| Comment by Christopher Morrone [ 23/May/14 ] |
We probably have more memory for IO on the BGQ IONs than normal compute nodes, but yes, we fill it fast and are frequently churning through pages. Our non-BG-related data mover nodes are x86_64, but will have pretty much the same issues. But I suppose we are probably saved by the fact that single client metadata change rates are terrible, so bulk IO throughput rarely comes under pressure. Real compute nodes are highly constrained for IO memory because they have real applications running on them that want most of the memory for themselves. My point is that constrained memory resources are actually the norm in the real world, not the exception. But I can see that I'm fighting a losing battle. So ok, we'll concede to a custom tuning. Now how do you propose to ensure that Lustre continues to be tested and function with that tuning enabled? |
| Comment by Christopher Morrone [ 23/May/14 ] |
|
Actually, we'll never use Lustre 2.6. So if you want to take it out, you can just do that. But you will need to open a new ticket and commit to a more general fix in time for 2.7. Does that sound reasonable? |
| Comment by Jinshan Xiong (Inactive) [ 23/May/14 ] |
|
I value this piece of work because it fixes a data corruption issue when a client is umounted but still has uncommitted RPC. I'm going to work with our guys to do some performance tune. |
| Comment by Dmitry Eremin (Inactive) [ 26/May/14 ] |
|
Writers stack: [<ffffffffa09cce44>] osc_enter_cache+0x48d/0x999 [osc] [<ffffffffa09c6138>] osc_queue_async_io+0xa58/0x17b0 [osc] [<ffffffffa09b10b7>] osc_page_cache_add+0x47/0x120 [osc] [<ffffffffa09b82ea>] osc_io_commit_async+0x10a/0x3b0 [osc] [<ffffffffa0450b17>] cl_io_commit_async+0x77/0x130 [obdclass] [<ffffffffa082f581>] lov_io_commit_async+0x271/0x4c0 [lov] [<ffffffffa0450b17>] cl_io_commit_async+0x77/0x130 [obdclass] [<ffffffffa0905796>] vvp_io_write_commit+0xf6/0x7e0 [lustre] [<ffffffffa0905fb6>] vvp_io_write_start+0x136/0x370 [lustre] [<ffffffffa0451de5>] cl_io_start+0x65/0x130 [obdclass] [<ffffffffa0455294>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa08ab1d8>] ll_file_io_generic+0x238/0x6d0 [lustre] [<ffffffffa08ab8f5>] ll_file_aio_write+0xe5/0x1f0 [lustre] [<ffffffffa08abb30>] ll_file_write+0x130/0x240 [lustre] [<ffffffff810dfb9d>] vfs_write+0xac/0xf3 [<ffffffff810dfd8c>] sys_write+0x4a/0x6e [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b Readers stack: [<ffffffff8109f851>] sync_page+0x4d/0x51 [<ffffffff810a012f>] sync_page_killable+0xe/0x3b [<ffffffff8109f788>] __lock_page_killable+0x66/0x68 [<ffffffff810a1c2f>] generic_file_aio_read+0x657/0x864 [<ffffffffa0904066>] vvp_io_read_start+0x2f6/0x460 [lustre] [<ffffffffa0451de5>] cl_io_start+0x65/0x130 [obdclass] [<ffffffffa0455294>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa08ab1d8>] ll_file_io_generic+0x238/0x6d0 [lustre] [<ffffffffa08abd22>] ll_file_aio_read+0xe2/0x1f0 [lustre] [<ffffffffa08abf60>] ll_file_read+0x130/0x240 [lustre] [<ffffffff810dfc8d>] vfs_read+0xa9/0xf0 [<ffffffff810dfd1e>] sys_read+0x4a/0x6e [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b |
| Comment by Dmitry Eremin (Inactive) [ 26/May/14 ] |
|
For pure master client and client with patch set 3 the stacks are following: Writers stack: [<ffffffff810a86dd>] balance_dirty_pages_ratelimited_nr+0x329/0x3a4 [<ffffffff810a2dc1>] generic_file_buffered_write+0x975/0xa9e [<ffffffff810a3129>] __generic_file_aio_write+0x23f/0x270 [<ffffffff810a31b2>] generic_file_aio_write+0x58/0xaa [<ffffffffa08f710d>] vvp_io_write_start+0xfd/0x390 [lustre] [<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass] [<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0896be9>] ll_file_io_generic+0x3b9/0x860 [lustre] [<ffffffffa08972f6>] ll_file_aio_write+0xd6/0x1e0 [lustre] [<ffffffffa089751c>] ll_file_write+0x11c/0x230 [lustre] [<ffffffff810dfb9d>] vfs_write+0xac/0xf3 [<ffffffff810dfd8c>] sys_write+0x4a/0x6e [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b Readers stack: [<ffffffff8109f851>] sync_page+0x4d/0x51 [<ffffffff810a012f>] sync_page_killable+0xe/0x3b [<ffffffff8109f788>] __lock_page_killable+0x66/0x68 [<ffffffff810a1c2f>] generic_file_aio_read+0x657/0x864 [<ffffffffa08f5216>] vvp_io_read_start+0x2f6/0x460 [lustre] [<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass] [<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0896be9>] ll_file_io_generic+0x3b9/0x860 [lustre] [<ffffffffa0897703>] ll_file_aio_read+0xd3/0x1e0 [lustre] [<ffffffffa089792c>] ll_file_read+0x11c/0x230 [lustre] [<ffffffff810dfc8d>] vfs_read+0xa9/0xf0 [<ffffffff810dfd1e>] sys_read+0x4a/0x6e [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b |
| Comment by Dmitry Eremin (Inactive) [ 28/May/14 ] |
|
Writers stack: [<ffffffffa09ba014>] osc_enter_cache+0x48d/0x999 [osc] [<ffffffffa09b32f8>] osc_queue_async_io+0xa58/0x1780 [osc] [<ffffffffa099e1d7>] osc_page_cache_add+0x47/0x120 [osc] [<ffffffffa09a54fa>] osc_io_commit_async+0x10a/0x3b0 [osc] [<ffffffffa03242f7>] cl_io_commit_async+0x77/0x130 [obdclass] [<ffffffffa081d581>] lov_io_commit_async+0x271/0x4c0 [lov] [<ffffffffa03242f7>] cl_io_commit_async+0x77/0x130 [obdclass] [<ffffffffa08f38d6>] vvp_io_write_commit+0xf6/0x7e0 [lustre] [<ffffffffa08f40d4>] vvp_io_write_start+0x114/0x390 [lustre] [<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass] [<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0893be9>] ll_file_io_generic+0x3b9/0x860 [lustre] [<ffffffffa08942f6>] ll_file_aio_write+0xd6/0x1e0 [lustre] [<ffffffffa089451c>] ll_file_write+0x11c/0x230 [lustre] [<ffffffff810dfb9d>] vfs_write+0xac/0xf3 [<ffffffff810dfd8c>] sys_write+0x4a/0x6e [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b |
| Comment by Jinshan Xiong (Inactive) [ 29/May/14 ] |
|
Then let's use patch set 5 for 2.6 |
| Comment by Jodi Levi (Inactive) [ 02/Jun/14 ] |
| Comment by Jodi Levi (Inactive) [ 12/Jun/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed. |
| Comment by Jian Yu [ 07/Nov/14 ] |
|
Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12615 |
| Comment by Christopher Morrone [ 07/Nov/14 ] |
|
Folks, can someone explain to me the rationale behind these changes in change 10003?
Was not the entire point of the unstable page tracking pages to use the NFS unstable pages tracking? The idea was that it gives the Linux kernel a back pressure mechanism when memory is low. max_cached_mb has absolutely no regard for the actual memory usage of a running system; it only works in its own little lustre world. And why would it be correct to make unstable pages part of the page LRU? Those pages cannot be freed. They are still in the limbo state where the server has acknowledged them, but not necessarily yet committed them to disk. When memory needs to be freed, we don't send them again, so there is no reason to walk them as part of the LRU. I am rather concerned that we have thrown away correctness in pursuit of performance. |
| Comment by Jinshan Xiong (Inactive) [ 09/Nov/14 ] |
Yes, I agree that it's more reasonable to take UNSTABLE pages out of LRU. However, if we didn't do that, we need to find the corresponding cl_page in brw_commit() callback and that will use an atomic operation. Actually the current mechanism is not that bad because usually there are only small part of pages are in UNSTABLE state, and those pages should stay at the tail LRU list; therefore, they shouldn't be scanned at all. One of my concerns with using UNSTABLE pages is that it will make OST to commit transactions frequently. Therefore a problematic client will send lots of SOFT_SYNC RPCs to OSTs it talks to, which will degrade system performance significantly. Anyway, this patch didn't remove the feature of UNSTABLE pages tracking. Users can still enable it if performance is not a problem for them. |
| Comment by Christopher Morrone [ 10/Nov/14 ] |
|
Did you, or did you not, do the following?
That was a critical part of the design. Can you explain how you fixed the memory management issues without that part of the design? |
| Comment by Jinshan Xiong (Inactive) [ 10/Nov/14 ] |
|
I didn't remove that code but just make it optional, there is a lprocfs entry to control NFS unstable pages tracking, and it's off by default. |
| Comment by Christopher Morrone [ 10/Nov/14 ] |
|
Wow. In that case, that was a really terrible commit message. And probably too many things done in a single commit. So what about these two things:
So item 2 only happens if unstable change tracking is enabled? Or all the time? And 3, is that only removed when unstable change tracking is disabled? Or is that removed when it is enabled as well? |
| Comment by Christopher Morrone [ 10/Nov/14 ] |
|
Also, I might wonder if the unstable page tracking setting is backwards. One offers correctness under fairly typical HPC workloads, and the other sacrifices correctness for speed. Shouldn't correctness be the default, and the speed-but-patholigically-bad-in-low-memory-situations be optional? |
| Comment by Jinshan Xiong (Inactive) [ 11/Nov/14 ] |
Only when unstable pages tracking is enabled.
It's removed.
Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed. I don't know what you mean by correctness. Actually in current implementation, it doesn't get any feedback for system memory pressure, it sends SOFT_SYNC only when it's low on available LRU slots. |
| Comment by Christopher Morrone [ 11/Nov/14 ] |
Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed. That statement is puzzling. How much memory is "plenty"? Our I/O nodes have 64GiB of RAM, which I would have thought would be considered "plenty". But it also kind of misses the point. In the real world, it doesn't matter how much memory is installed on the node. The people who designed the system probably intended the memory to actually be used not just sit idle all the time because Lustre has no sane memory management. On an "I/O" node, that memory needs might need to be shared by function shipping buffers, system debuggers, system management tools, and other filesystem software. On normal HPC compute nodes the memory is going to be under contention with actual user applications, other filesystems, etc. My point is that memory contention is a normal situation in the real world. It is not a corner case. If we treat it as a corner case, we'll be putting out a subpar product. |
| Comment by Jinshan Xiong (Inactive) [ 11/Nov/14 ] |
|
By plenty, I meant there are plenty of available memory. The memory can be temporarily `lost' when the write has completed but the transaction is not committed. Therefore, if the client has the extra memory available to hold UNSTABLE pages between two transactions, it should be able to maintain the highest write speed sustainable. Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST. |
| Comment by Gerrit Updater [ 01/Dec/14 ] |
|
N/A |
| Comment by Christopher Morrone [ 04/Jun/15 ] |
Yes, but how much memory is "plenty"? In the real world, memory is a finite resource. We can not program with the assumption that there is always free memory. Lustre must behave reasonably by default when memory is under contention.
No, it does not. Client memory is far faster than disk on a remote OST over the network pretty much by definition. Client memory under normal use cases is also under contention by actual applications, which are not represented by the naive tests that were used to create the graphs in this ticket. In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space. Memory contention is normal in the real world, and Lustre's defaults should be selected to meet and function reasonably under real world usage. |
| Comment by Jinshan Xiong (Inactive) [ 06/Jun/15 ] |
I didn't mean the memory is for Lustre client to use for buffer space. For write RPC, clients have to hold those writing pages until the OST commits the corresponding transaction. Therefore, clients have to have extra memory to pin those pages in memory and applications can not use them. For ZFS, the typical txg timeout is 5 seconds, which means the clients will pin 5 seconds of writing data in memory; depending on the writing throughput on the client side, this can be a lot. There is really nothing we can do on the client side. Probably we can do some tune for ZFS. The I/O generated by Lustre is different from I/O of generic workload, so we may look into the timeout of txg or restrict the memory of write cache. |
| Comment by Christopher Morrone [ 08/Jun/15 ] |
|
This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up. The 5 second txg timeout in ZFS is is the typical txg maximum. Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit. And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes. |
| Comment by Andreas Dilger [ 08/Jul/15 ] |
|
Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory. |
| Comment by Jinshan Xiong (Inactive) [ 09/Jul/15 ] |
|
There are a few things to look at for this topic: 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure. |
| Comment by Jinshan Xiong (Inactive) [ 09/Jul/15 ] |
|
There are a few things to look at for this topic: 2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion; 3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure. |
| Comment by Christopher Morrone [ 09/Jul/15 ] |
Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug. |
| Comment by Jinshan Xiong (Inactive) [ 13/Jul/15 ] |
|
new ticket filed at |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |