[LU-4841] Performance regression in master for threads more than 2 Created: 31/Mar/14  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Dmitry Eremin (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl

Attachments: PNG File lu-4841-1.png     PNG File lu-4841-5.png     PNG File lu-4841.png     PNG File perf_2.5.png     PNG File perf_master.png    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 13338

 Description   

After commit 586e95a5b3f7b9525d78e7efc9f2949387fc9d54 we have significant performance degradation in master. The following pictures show the regress measured on Xeon Phi:



 Comments   
Comment by Oleg Drokin [ 01/Apr/14 ]

This is a very serious performance drop off that we cannot afford to have in 2.6 release.
We really need to fix the bottleneck introduced by the unstable-pages tracking patch or find some other solution.
Hipefully we can do this without gutting out the unstable-pages tracking feature at the same time.

Comment by Christopher Morrone [ 01/Apr/14 ]

The unstable-pages tracking patch is not a "feature", it is the only thing that makes the client at all useable on our production workloads. We can draw you a performance graph with a line at zero if that helps.

I can't emphasize enough how pathologically bad the client's behavior was without the unstable-tracking patch. It is not acceptable to release 2.6 with that problem either.

I am completely open to a solution avoids the pathologically bad behavior and is also reasonably fast.

Comment by Prakash Surya (Inactive) [ 01/Apr/14 ]

Can we get a more complete report of the issue at hand? For example:

 * What's the client node(s) look like?
   * x86_64? Size of RAM? Number of CPU cores? NIC bandwidth?
 * What's the server nodes(s) look like?
   * x64_64? etc...
 * What's the lustre version of the servers?
 * What's the filesystem of the servers?
 * Any pertinent configuration details?
 * Why do you feel that specific commit is to blame?
   * What does the {clean,dirty,unstable} page statistics look like during the test?

There's so much more information needed to make any kind of accurate judgement here.

It's really frustrating when a patch that we NEED in order to run on Sequoia continually runs the gauntlet without any reasonable discussion or explanation. Depending on the setup, this patch is supposed to make performance worse, but that's to prevent an OOM and crash of the client. I would really like to work up a solution that works for us and you guys, but just takes way too much effort to do that with these vague reports, rash decisions, uncooperative review system, and overall lack of communication from the upstream developers.

Throwing up a graph of a line with a few labels is not a proper bug report. I honestly can't make any reasonable conclusion from the information provided. I really wish the unstable page patches could have been properly discussed and designed back when I was actively engaged with them a year and a half ago.

I apologize for the rant, and I'm not calling anybody out specifically, but being a downstream Lustre developer with real systems to maintain can be a very frustrating experience.

Comment by Oleg Drokin [ 01/Apr/14 ]

Xeon Phi node is 240 HT cores (only 60 or so real ones). It has 8G RAM.

Certainly when the node memory runs low performance is expected to have a hit, whenever this is the case I'd defer to Dmitry.
I assumed (possibly incorrectly) that the performance dropoff happened due to extra contention in page accounting added and it would be also observable on other systems (that I do not possess so that theory is probably pending verification too).
If it's a case of RAM starvation, that would be a totally different story of course.

Comment by Prakash Surya (Inactive) [ 01/Apr/14 ]

Judging solely by the number of cores and the amount of RAM on the Xeon Phi, I would not be surprised if it's hitting the exact case we hit on Sequoia (with 64 HT cores and 16G of RAM). It should be fairly easy to spot check this just by watching the unstable page accounting stats in the llite proc handler and looking at the client stacks for the process doing the IO. IIRC, the thread will sit on a wait queue if it hits Lustre's limit (should be easy to determine if it's the unstable waitq by looking at the stack trace), or it might be temporarily halted by the kernel (which should also be easy to spot looking at the threads backtrace).

With that said, it might be educational if to run the "fast" test against a ZFS backend (I'm assuming ldiskfs was used to make the report). With the default 5s sync time of ZFS, writing at 1GB/s with 8GB of RAM, you might just hit the same exact OOM issues we were seeing on Sequoia. Maybe even bump the ZFS sync timeout to 10s and see if the client falls over due to it running out of memory.

Comment by Oleg Drokin [ 01/Apr/14 ]

We'll see if Dmitr can do that. If it's just the case of early throttling, though, we'd need to fiddle with the logic here so that things are not stalled unless really necessary since the test did not OOM in the past either.

As for how this patch was singled out - Dmitry has actually run git bisect to find it.

Comment by Prakash Surya (Inactive) [ 01/Apr/14 ]

Sure, I'm all for making things better; whether that's changing the logic or simply finding better suited default values. I just want to make sure any changes are well thought out, and we're not jumping to wrongful conclusions. I don't think I did much testing of those patches against ldiskfs, nor did I spend much time looking for "optimal" default values, so there's probably a lot of room for improvement on those fronts.

Was the server(s) running with the server-side portion of that patch-set during testing? If the server is ignoring the soft-sync requests, it's definitely much more likely for a client to exhaust its unstable pages and stall (at least with a ZFS backend; I'm not too familiar with ldiskfs's syncing behavior).

Comment by Jinshan Xiong (Inactive) [ 01/Apr/14 ]

Here is a patch to disable unstable pages tracking optionally: http://review.whamcloud.com/8524

For whoever is suffering from the performance loss, please apply the patch while we're working on this.

In my point of view, the problem with current unstable page tracking algo is that it limits a really small work set for writing. By default, each OSC can cache dirty pages up to 32MB, which is 32 RPCs if the RPC size is 1M. Therefore, a target can have at most 32 transactions in memory for each client. Plus the setting of max_rpcs_in_flight is 8 by default, a client has to wait for committing after 4 round of RPCs.

One way to solve this problem is to adopt unstable pages accounting into cache control algorithm, instead of using dirty pages accounting as what we’re doing now. Usually the cache memory for Lustre is much larger than 32M that will provide us a much larger work set for writing. We can verify this by setting osc.*.max_dirty_mb to 4096 for example.

In implementation, we can add cl_lru_busy for the number pages in transfer, and the cl_lru_busy count will be subtracted after the write RPC are committed. We can add cl_lru_list and cl_lru_busy together to check if cache memory is exceeded.

Comment by Prakash Surya (Inactive) [ 02/Apr/14 ]

Hmm.. I'll have to revisit the implementation details to decide if I agree with that or not. But regardless, keep in mind that the reason the limit is there in the first place is to throttle these "pinned" pages. So if we simply raise the limit, then how is that any different than removing the limit altogether?

Perhaps this issue really boils down to the fact that we have this limit on a per OSC basis when we really should have the limit be on a global scale (per client of course). For a system using a "large" number of OSC's 32M each is a lot of data to have irreclaimable. For a real example, on Sequoia we have 768 OSTs, so that mean a max of 24GB of data which is already more than the total amount of RAM on the system (16GB). For "small" OST job raising the per OSC dirty size limit might make sense, but on a "large" OSC job it's effectively removing the limit altogether putting us right back into the OOM scenario.

IIRC, I tried to go down the route of having a global limit on the unstable pages (or maybe it was per filesystem, I can't remember), but that was shot down in favor of the per OSC limit. We can even drop the Lustre limit altogether and rely solely on the kernel to stall IO once it hits a certain threshold; I believe it will already do this if the NFS_Unstable is used for this. But we if do that, we won't be sending the soft-sync requests to the server which is essential to maintaining any kind of reasonable performance.

Now that you mention it, the OSC count of the job would definitely play a factor here. How many OSTs/OSCs were being used by the original test? I still have no idea what the client workload was for the original test...

Comment by Dmitry Eremin (Inactive) [ 02/Apr/14 ]

What's the client node(s) look like? x86_64? Size of RAM? Number of CPU cores? NIC bandwidth?

Client was Xeon Phi card with 61 CPU (244 cores) 8 GB memory. The Mellanox FDR from host was used. LNet self_test shows 2476 RPC/s and 1232.55 MB/s.

What's the server nodes(s) look like? x64_64? etc...

Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz 16 cores 64 GB memory.

What's the lustre version of the servers?

build: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-22-g3f0084f-PRISTINE-2.6.32-431.5.1.el6_lustre.g3f0084f.x86_64

What's the filesystem of the servers?

ldiskfs

Any pertinent configuration details?

ost{0,1,2,3}, ost{4,5,6,7}, ost{8,9,10,11}, ost{12,13,14,15}, ost{16}

0 UP osd-ldiskfs MGS-osd MGS-osd_UUID 5
1 UP mgs MGS MGS 49
2 UP mgc MGC192.168.3.1@o2ib b31c90c0-aedf-126a-fa6b-eeb6c8da0f4f 5
3 UP osd-ldiskfs lustrefs-MDT0000-osd lustrefs-MDT0000-osd_UUID 24
4 UP mgc MGC192.168.5.1@o2ib2 e5501e77-63d0-f8c0-237b-80688c90c139 5
5 UP mds MDS MDS_uuid 3
6 UP lod lustrefs-MDT0000-mdtlov lustrefs-MDT0000-mdtlov_UUID 4
7 UP mdt lustrefs-MDT0000 lustrefs-MDT0000_UUID 71
8 UP mdd lustrefs-MDD0000 lustrefs-MDD0000_UUID 4
9 UP qmt lustrefs-QMT0000 lustrefs-QMT0000_UUID 4
10 UP osp lustrefs-OST0000-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
11 UP osp lustrefs-OST0001-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
12 UP osp lustrefs-OST0002-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
13 UP osp lustrefs-OST0003-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
14 UP osp lustrefs-OST0004-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
15 UP osp lustrefs-OST0005-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
16 UP osp lustrefs-OST0006-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
17 UP osp lustrefs-OST0007-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
18 UP osp lustrefs-OST0008-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
19 UP osp lustrefs-OST0009-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
20 UP osp lustrefs-OST000a-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
21 UP osp lustrefs-OST000b-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
22 UP osp lustrefs-OST000c-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
23 UP osp lustrefs-OST000d-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
24 UP osp lustrefs-OST000e-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
25 UP osp lustrefs-OST000f-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
26 UP osp lustrefs-OST0010-osc-MDT0000 lustrefs-MDT0000-mdtlov_UUID 5
27 UP lwp lustrefs-MDT0000-lwp-MDT0000 lustrefs-MDT0000-lwp-MDT0000_UUID 5

Why do you feel that specific commit is to blame?

I used git bisect to find an exact commit which bring this degradation.

What does the {clean,dirty,unstable} page statistics look like during the test?

I didn't get this statistics. But I observed that an application time to time goes to sleep state for a while instead of "R" or "D" in top. This was measured by "iozone -i 0 -i 1 -M -C -r 1m -s 16g -t $i"

Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ]

Hi Dmitry,

Can you please try to increase the size of dirty pages allowance to see if it can help? You can run the following command:

lctl set_param osc.*.max_dirty_mb=4096

on the client nodes to set the max dirty size to 4G for example.

Jinshan

Comment by Prakash Surya (Inactive) [ 02/Apr/14 ]

Dmitry, I really appreciate the added information. Thanks.

So it looks like the workload has a max of 17 OSTs to use, which is only about 500 MB of dirty space available. It sounds like it's working as designed, but designed poorly. I agree with Jinshan; increase the per OSC dirty limit and see if things get better.

It's really starting to sounds like the root of the issue is using a per OSC dirty limit. I think it'd be much more desirable to have a global limit of dirty space, say 20% of RAM, that all OSCs can use. Then a single OSC can use all 20% of RAM if needed, or that can be split up between multiple OSCs; this decision being driven completely by the workload of the client.

Comment by Oleg Drokin [ 03/Apr/14 ]

I agree the dirty limit should be global.
I don't think we should look at all at the total amount of available RAM (other than some absolutely high watermark of say 50% or whatever (perhaps configurable for special cases)). This is because there are user applications and those tend to need RAM as well. It's a bad idea to make user apps to fight with such stealthy RAM usage and could easily lead to OOMs as well I suspect. So ideall we'd also find away to make lustre to take real free memory into account.

Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ]

I cannot increase the max dirty size more than 1924Mb on Xeon Phi.

osc.lustrefs-OST000f-osc-ffff8801e02cac00.max_dirty_mb=1925
error: set_param: setting /proc/fs/lustre/osc/lustrefs-OST000f-osc-ffff8801e02cac00/max_dirty_mb=1925: Numerical result out of range

The statistics during the write test with max_dirty_mb=1924 is following:

NFS_Unstable:    1157120 kB
llite.lustrefs-ffff8801e02cac00.unstable_stats=
unstable_pages:   289280
unstable_mb:        1130

The throughput for 8 initial writers is still bounded by 48957.79 KB/sec

The full /proc/meminfo is following:

MemTotal:        7882352 kB
MemFree:         2060024 kB
Buffers:               0 kB
Cached:          3860324 kB
SwapCached:            0 kB
Active:           150276 kB
Inactive:        3733736 kB
Active(anon):     150192 kB
Inactive(anon):    91688 kB
Active(file):         84 kB
Inactive(file):  3642048 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             11964 kB
Writeback:             0 kB
AnonPages:         39676 kB
Mapped:             4312 kB
Shmem:            218880 kB
Slab:            1202232 kB
SReclaimable:      45752 kB
SUnreclaim:      1156480 kB
KernelStack:       13456 kB
PageTables:         1672 kB
NFS_Unstable:    1128536 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3941176 kB
Committed_AS:     604580 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      347656 kB
VmallocChunk:   34359090472 kB
AnonHugePages:     34816 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       10236 kB
DirectMap2M:     8097792 kB
Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ]
# lctl get_param llite.*.unstable_stats
llite.lustrefs-ffff8801e1440800.unstable_stats=
unstable_pages:   302592
unstable_mb:        1182

# lctl get_param osc.*.unstable_stats
osc.lustrefs-OST0000-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0001-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0002-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    38144
unstable_mb:         149
osc.lustrefs-OST0003-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36864
unstable_mb:         144
osc.lustrefs-OST0004-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0005-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0006-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    34304
unstable_mb:         134
osc.lustrefs-OST0007-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    32768
unstable_mb:         128
osc.lustrefs-OST0008-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0009-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    34560
unstable_mb:         135
osc.lustrefs-OST000a-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    38144
unstable_mb:         149
osc.lustrefs-OST000b-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36096
unstable_mb:         141
osc.lustrefs-OST000c-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000d-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000e-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000f-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36608
unstable_mb:         143
# lctl set_param osc.*.max_dirty_mb=1924

# lctl get_param llite.*.unstable_stats
llite.lustrefs-ffff8801e1440800.unstable_stats=
unstable_pages:   286720
unstable_mb:        1120

# lctl get_param osc.*.unstable_stats
osc.lustrefs-OST0000-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0001-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0002-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36864
unstable_mb:         144
osc.lustrefs-OST0003-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    34816
unstable_mb:         136
osc.lustrefs-OST0004-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0005-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0006-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    38144
unstable_mb:         149
osc.lustrefs-OST0007-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36352
unstable_mb:         142
osc.lustrefs-OST0008-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST0009-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    31488
unstable_mb:         123
osc.lustrefs-OST000a-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    34304
unstable_mb:         134
osc.lustrefs-OST000b-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    37888
unstable_mb:         148
osc.lustrefs-OST000c-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000d-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000e-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:        0
unstable_mb:           0
osc.lustrefs-OST000f-osc-ffff8801e1440800.unstable_stats=
unstable_check:        1
unstable_pages:    36864
unstable_mb:         144
Comment by Dmitry Eremin (Inactive) [ 04/Apr/14 ]
  1. lctl set_param osc.*.unstable_stats=0
    • throughput for 8 initial writers = 710342.12 KB/sec
  2. lctl set_param osc.*.unstable_stats=1
    • throughput for 8 initial writers = 47701.20 KB/sec
  3. lctl set_param osc.*.max_dirty_mb=1924
    • throughput for 8 initial writers = 46364.41 KB/sec
Comment by Prakash Surya (Inactive) [ 05/Apr/14 ]

I'd need to verify with the code, but I think even with lctl set_param osc.*.unstable_stats=0 we increment the NFS_Unstable field, the the kernel interprets that counter as "dirty" space. So even though Lustre's unstable "waiting" will be disabled, the kernel might still slow us down. Again, I'm going off of memory here, so I might be mistaken.

With that said, having just over 1/8th of total RAM un-reclaimable due to Lustre's network IO traffic seems high. As Oleg said, there's going to be real applications running on the client which could (probably will) consume a lot of RAM as well. I'm not sure how to properly address this off hand.

We're pinning these pages in the client cache for recovery purposes, but I wonder if that's strictly necessary. Obviously it wouldn't be "safe" in the face of a server crash, but for the common case, I don't think anybody would notice. Perhaps a mount flag stating pages can be dropped once they hit the server cache; evictions to due server failures would become more common but it could potentially "help" the client and allow greater performace. Idk... Just brainstorming here...

Comment by Christopher Morrone [ 05/Apr/14 ]

Obviously it wouldn't be "safe" in the face of a server crash, but for the common case, I don't think anybody would notice.

Actually, that part is pretty essential to Lustre's recovery mechanism. Unfortunately filesystem users are not satisfied with "we won't corrupt your data most of the time"; they insist that we avoid corruption and data loss by design. Our users would definitely catch on eventually and raise a ruckus if we took that part out of Lustre.

Comment by Prakash Surya (Inactive) [ 05/Apr/14 ]

I don't think anybody should be satisfied with "we won't corrupt your data most of the time", I definitely would not be. Although, would it be OK to return an IO error? What happens now when a client does a write, gets back a successful completion, server fails before it is committed, and then the client is evicted by the server and can't replay that outstanding write? I was envisioning a similar case, except the client would, in a sense, evict itself (i.e. it can't perform the replay because it dropped those pages).

I think I need to better understand exactly what is the bottleneck here and why.

Dmitry, did you say the servers were running with the soft-sync server-side portion of that patch set? I can't recall, and didn't see you mention it quickly flipping back through the comments.

Comment by Dmitry Eremin (Inactive) [ 09/Apr/14 ]

No, servers were just regular 2.5.57 and 2.4.3 as I remember. I didn't patch them.

Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ]

I'm working on this issue.

Comment by Andreas Dilger [ 17/Apr/14 ]

Prakash, I believe the servers are running the soft-sync code, since that is landed on master.

Dmitry, in your latest test results:

lctl set_param osc.*.unstable_stats=0
throughput for 8 initial writers = 710342.12 KB/sec

According to the graphs, 700MB is where the performance was for 2.5 with 8 writers. It wasn't clear from your comment whether you thought this means the patch is doing its job correctly to disable the unstable page throlling, or if you thought this was a regression from the 2.5 performance?

Comment by Dmitry Eremin (Inactive) [ 17/Apr/14 ]

There are two versions of servers were. One was latest master the second was 2.4.3. The behavior was the same.

To be clear with Jinshan's patch and unsettling unstable_stats

lctl set_param osc.*.unstable_stats=0

The performance become the same as in 2.5 branch. So, the patch fix performance regression and can enable unstable page throttling if need.

Comment by Andreas Dilger [ 17/Apr/14 ]

Assign to Jinshan, since he reported he is working on this issue.

Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ]

After taking a further look, it turned out that the attempt to tally UNSTABLE_NFS count kills performance. In osc_[inc|dec]_unstable_pages, it does:

        for (i = 0; i < page_count; i++)
                [inc|dec]_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);

Therefore, to handle a normal 256-page RPC, it has to operate atomic operations for 512 times. After I comment that piece of code out, I can't see any noticeable performance change.

There is another problem with the current implementation of unstable page accounting - it decides if to send the SOFT_SYNC flag in osc_over_unstable_soft_limit() where it checks how many dirty pages existing:

        obd_upages = atomic_read(&obd_unstable_pages);
        obd_dpages = atomic_read(&obd_dirty_pages);

        osc_upages = atomic_read(&cli->cl_unstable_count);

        return osc_upages != 0 &&
               obd_upages >= (obd_max_dirty_pages - obd_dpages) / 2; 

In the default configuration, obd_max_dirty_pages is set to 1/2 of total RAM and per client dirty limit is set to 32MB. This will cause two problems:
1. the SOFT_SYNC flag will be sent only when the unstable_pages reaches to almost 1/2 total RAM
2. cache control mechanism is not working because Lustre is actually using max_cache_mb + mb of unstable_pages in total.

Therefore, I'd like to control unstable pages in cache machanism. Prakash, how do you think?

Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ]

patch to disable kernel unstable page accounting is at: http://review.whamcloud.com/10003

Comment by Jinshan Xiong (Inactive) [ 18/Apr/14 ]

Patch 10003 is revised to implement my idea above. Please take a look.

Dmitry, please benchmark this patch. Thanks.

Comment by Prakash Surya (Inactive) [ 21/Apr/14 ]

After taking a further look, it turned out that the attempt to tally UNSTABLE_NFS count kills performance. In osc_[inc|dec]_unstable_pages, it does:

for (i = 0; i < page_count; i++)
[inc|dec]_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);

Therefore, to handle a normal 256-page RPC, it has to operate atomic operations for 512 times. After I comment that piece of code out, I can't see any noticeable performance change.

So, all that is needed to get the performance back, is to remove these kernel calls?

If that's the case, I don't completely agree with Jinshan's patch as it stands now (revision 2).

First off, we should try to maintain the kernel's NR_UNSTABLE_NFS value as best as possible. That's there for a reason, and it will allow the kernel to throttle (i.e. halt) IO if necessary. In general, we need to make an effort to work with the kernel and use the interfaces it exposes to us. If we ignore NR_UNSTABLE_NFS, then we run the risk of OOM-ing if the server decides not to sync the client's data. Think about a case where an OSTs storage goes offline, but it's still accepting network traffic; the client can potentially fill all of its RAM with unstable pages, right? I don't see anything to stop this from happening, but maybe I'm over looking something.

1. the SOFT_SYNC flag will be sent only when the unstable_pages reaches to almost 1/2 total RAM

Actually, SOFT_SYNC will be set at obd_max_dirty_pages / 2. So with obd_max_dirty_pages = 1/2 of RAM, then SOFT_SYNC will be set at 1/4 of RAM. Why is this a problem?? How does keying off of "max_cached_mb" make things any better?

2. cache control mechanism is not working because Lustre is actually using max_cache_mb + mb of unstable_pages in total.

What exactly is the point of the "cache control mechanism"? I think we need to clearly and explicitly document what its intended purpose is. I thought it was to limit clean pages on the client. Am I mistaken? Does it have a different purpose?

Honestly, I'm against relying on "max_cached_mb" than is strictly necessary. If the lustre client was more linux friendly none of that infrastructure would be needed at all; so, I'd rather not use it here.

Which brings me back to my original question, is the performance degradation completely attributed to my sloppy handling of the [inc|dec]_zone_page_state calls? I did most of my performance testing of this patch on PPC nodes with 64K pages, so the overhead would have been less for me on those nodes.

Comment by Jinshan Xiong (Inactive) [ 21/Apr/14 ]

If we ignore NR_UNSTABLE_NFS, then we run the risk of OOM-ing if the server decides not to sync the client's data. Think about a case where an OSTs storage goes offline, but it's still accepting network traffic; the client can potentially fill all of its RAM with unstable pages, right? I don't see anything to stop this from happening, but maybe I'm over looking something.

Actually this won't happen. If server refuses, or whatever reasons, to flush transactions, the client will stop generating more caching pages and wait for more allowance of free LRU slots. Actually before this patch is introduced, it may indeed cause the OOM problem you mentioned - in the case that # of unstable pages + caching pages are close to total RAM.

The reason for having max_cached_mb is that the caching pages of Lustre is considered as overhead. In another word, under the condition of keeping Lustre working properly, customers usually tend to cache minimum pages by Lustre so that applications can use as much memory as possible.

Honestly, I'm against relying on "max_cached_mb" than is strictly necessary. If the lustre client was more linux friendly none of that infrastructure would be needed at all; so, I'd rather not use it here.

This was exactly what we thought before. We had ever relied on linux kernel to release Lustre pages, which was exactly why we removed cache control(i.e., LRU) of Lustre in early versions of Lustre 2.x. It turned out not working so I revived a cache control machanism for 2.x. The reason why it didn't work was linux kernel uses per page zone page releasing thread so mostly a single thread is in charge of dropping clean pages. This would work for slow IO case but Lustre can produce several gigabytes per second IO. Obviously it was overkilled.

Which brings me back to my original question, is the performance degradation completely attributed to my sloppy handling of the [inc|dec]_zone_page_state calls?

Yes, this is what I have seen on OpenSFS cluster. After I removed these lines, I couldn't see any noticeable performance regression. However, there were some problems with the network on OpenSFS cluster and the QDR speed can't exceed 2GB/s, so I didn't have a chance to try the patch on faster environment.

Comment by Prakash Surya (Inactive) [ 22/Apr/14 ]

Actually this won't happen. If server refuses, or whatever reasons, to flush transactions, the client will stop generating more caching pages and wait for more allowance of free LRU slots.

Is this because unstable pages also count against the "max_cached_mb" allowance?

Actually before this patch is introduced, it may indeed cause the OOM problem you mentioned - in the case that # of unstable pages + caching pages are close to total RAM.

Well, the "caching pages" should be able to be pitched when the system gets low on memory, while only the unstable pages are pinned. Also, since we're using NR_UNSTABLE_NFS, the kernel will prevent us from an OOM. The NR_UNSTABLE_NFS value is counted as "dirty" pages and the kernel will halt IO in balance_dirty_pages() with a call trace like so:

    fio           S 00000fffae72633c     0 59338  59283 0x00000000
    Call Trace:
    [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
    [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
    [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
    [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
    [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
    [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
    [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
    [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
    [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
    [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
    [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
    [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
    [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
    [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
    [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
    [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
    [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
    [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c

The reason for having max_cached_mb is that the caching pages of Lustre is considered as overhead. In another word, under the condition of keeping Lustre working properly, customers usually tend to cache minimum pages by Lustre so that applications can use as much memory as possible.

> Honestly, I'm against relying on "max_cached_mb" than is strictly necessary. If the lustre client was more linux friendly none of that infrastructure would be needed at all; so, I'd rather not use it here.

This was exactly what we thought before. We had ever relied on linux kernel to release Lustre pages, which was exactly why we removed cache control(i.e., LRU) of Lustre in early versions of Lustre 2.x. It turned out not working so I revived a cache control machanism for 2.x. The reason why it didn't work was linux kernel uses per page zone page releasing thread so mostly a single thread is in charge of dropping clean pages. This would work for slow IO case but Lustre can produce several gigabytes per second IO. Obviously it was overkilled.

OK, thanks for the clarification. I just wanted to make sure I understood the intention of max_cached_mb. I'm well aware that it's a necessary evil for the time being due to the high cost of lustre client page reclamation.

Since we both seem to agree that in an ideal world max_cached_mb would not exist, I suggest we try to refrain from using it as best as possible. It'd be really nice to eventually get to a position where it can be removed, and we'll never get there if we start to rely on it more and more as the client is worked on.

> Which brings me back to my original question, is the performance degradation completely attributed to my sloppy handling of the [inc|dec]_zone_page_state calls?

Yes, this is what I have seen on OpenSFS cluster. After I removed these lines, I couldn't see any noticeable performance regression. However, there were some problems with the network on OpenSFS cluster and the QDR speed can't exceed 2GB/s, so I didn't have a chance to try the patch on faster environment.

If that's the case, can we just make the *_zone_page_state calls smarter instead of making larger (potentially disruptive) changes?

Maybe there's a way to group the per-page calls into a single (or a few) call(s) per RPC? I'm looking at the implementation of the *zone_page_state calls, though, and that might be tougher than I thought.... I'd like to keep that tracking around, but doing a local_irq[save|restore] and a couple atomic_long_add()'s per page doesn't seem ideal either. Maybe we can leverage mod_zone_page_state() instead? Something like:

/* This need to be implemented */
int get_index_from_zone(struct zone *);
zone *get_zone_from_index(int index);

int zone_deltas[MAX_NR_ZONES] = {0};

for (i = 0; i < page_count; i++) {
    int zone_idx = get_index_from_zone(page_zone(desc->bd_iov[i].kiov_page))
    zone_deltas[zone_idx] += PAGE_SIZE; /* the size of the unstable page */
}

for (i = 0; i < MAX_NR_ZONES; i++) {
    struct zone *z = get_zone_from_index(i)
    mod_zone_page_state(z, NR_UNSTABLE_NFS, zone_deltas[i])
}

Do you think something like that would be OK (or even possible)?

Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ]

It's worth a shot. I will try it out tomorrow. Anyway, we should track Unstable pages in LRU.

Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ]

Hi Prakash,

The reason I want Unstable pages under control is that max_cache_mb should be seriously complied. It means how much memory Lustre can use (right now) for data cache. Unstable pages may be several gigabytes which I think it's unacceptable in some cases.

Accounting Unstable pages may help a little bit though - the major problem w/ it is that if the dirty pages are over allowance, Linux kernel can only wait there to throttle writing; ideally it should provide and call a BDI specific callback to flush disk cache. That makes a problem for Lustre because we don't have a chance to issue SOFT_SYNC therefore the client will have to wait for transaction to commit, which will be ~10 seconds typically. Anyway, if it won't affect performance, I will agree to still account Unstable pages. I will give it a try tomorrow.

Comment by Jinshan Xiong (Inactive) [ 23/Apr/14 ]

Hi Prakash,

With the patch below which can be applied upon patch 10003, I can see about 5% performance regression. I did the test on a x86_64 machine with only one NUMA node so mostly the pages belong to the same zone.

diff --git a/lustre/osc/osc_page.c b/lustre/osc/osc_page.c
index 0f43417..7657909 100644
--- a/lustre/osc/osc_page.c
+++ b/lustre/osc/osc_page.c
@@ -891,6 +891,36 @@ out:
        RETURN(rc);
 }
 
+#ifdef __KERNEL__
+static inline void unstable_page_accounting(struct ptlrpc_bulk_desc *desc, int fact)
+{
+       obd_count page_count = desc->bd_iov_count;
+       void *zone = NULL;
+       int count = 0;
+       int i;
+
+       for (i = 0; i < page_count; i++) {
+               void *pz = page_zone(desc->bd_iov[i].kiov_page);
+
+               if (pz == zone) {
+                       ++count;
+                       continue;
+               }
+
+               if (count > 0) {
+                       mod_zone_page_state(zone, NR_UNSTABLE_NFS, fact * count);
+                       count = 0;
+               }
+               zone = pz;
+               ++count;
+       }
+       if (count > 0)
+               mod_zone_page_state(zone, NR_UNSTABLE_NFS, fact * count);
+}
+#else
+#define unstable_page_accounting(desc, fact)
+#endif
+
 /* Performs "unstable" page accounting. This function balances the
  * increment operations performed in osc_inc_unstable_pages. It is
  * registered as the RPC request callback, and is executed when the
@@ -907,6 +937,7 @@ void osc_dec_unstable_pages(struct ptlrpc_request *req)
                return;
 
        LASSERT(page_count >= 0);
+       unstable_page_accounting(desc, -1);
 
        atomic_sub(page_count, &cli->cl_cache->ccc_unstable_nr);
        LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
@@ -933,6 +964,7 @@ void osc_inc_unstable_pages(struct ptlrpc_request *req)
                return;
 
        LASSERT(page_count >= 0);
+       unstable_page_accounting(desc, 1);
 
        LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
        atomic_add(page_count, &cli->cl_cache->ccc_unstable_nr);
Comment by Jodi Levi (Inactive) [ 29/Apr/14 ]

http://review.whamcloud.com/#/c/10003/

Comment by Dmitry Eremin (Inactive) [ 16/May/14 ]


This is comparison of performance 2.5.1.-crt1 and master branch with patch set #3. Writers are still bounded.

Comment by Prakash Surya (Inactive) [ 20/May/14 ]

Dmitry, can you get stacks of the client threads and see if you're getting hung up in balance_dirty_pages()?

I set up a small test cluster of VMs (4 clients, 2 OSS, 1 MDS, lustre 2.4.0-28chaos), and it's very clear that I'm getting throttled back by the kernel due to the NR_UNSTABLE_NFS accounting. The interesting bit is that if I take that accounting out of Lustre, I haven't seen any OOM events on the clients. I have a feeling the Lustre client just isn't set up to play nice the Linux kernel, and we're going to have to make serious changes to get there; although that's just a hunch.

It might make sense to make the accounting of NR_UNSTABLE_NFS tunable, and potentially disabled by default. It works well on our Sequoia clients, but I think it needs some work to work well in general.

Comment by Andreas Dilger [ 22/May/14 ]

Prakash, we could go further and set the default to "on" for PPC nodes (or something more specific to BGL clients that can be determined at compile time), and "off" for non-PPC nodes. This should probably only be a module parameter, so that it cannot be changed at runtime to avoid complexity.

Comment by Christopher Morrone [ 22/May/14 ]

I don't like that solution. The main difference on the BG/Q client (vs the x86_64 nodes you have more experience with) is that there are many low power cores. We've already hit the transistor scaling wall, and all architectures are moving to a many-low-power-cores model. This is the harbinger of things to come, not a one-off issue that can be brushed under the carpet.

Setting the default based on architecture would be the wrong way to do it. Lower count and load nodes (like our PPC login nodes) are going to behave very much like the x86_64 nodes, in all likelyhood.

Lustre should make good design scaling decisions rather than forcing compile time options.

Comment by Andreas Dilger [ 23/May/14 ]

Chris, my comment was only in reply to Prakash's previous one where he was suggesting that the NR_UNSTABLE_NFS accounting should be off by default. I was proposing that this be improved so that you do not need to manually configure it for your BGL nodes.

I'm not sure that the number of cores is the primary issue, but rather the amount of memory on those IO nodes compared to how quickly the memory is dirtied. Jinshan's latest patch reduces the overhead with NR_UNSTABLE_NFS accounting enabled, which is good even for BGL, but AFAIK it still does not return performance to the same level as without this accounting.

Obviously I'd rather that this issue be resolved with a single solution that works optimally everywhere, but not at the cost of a serious performance regression on all the other systems Lustre runs on. I'd hope that we can find some better long-term solution for this, but the time has run out for major surgery in 2.6.

Comment by Christopher Morrone [ 23/May/14 ]

I'm not sure that the number of cores is the primary issue, but rather the amount of memory on those IO nodes compared to how quickly the memory is dirtied.

We probably have more memory for IO on the BGQ IONs than normal compute nodes, but yes, we fill it fast and are frequently churning through pages.

Our non-BG-related data mover nodes are x86_64, but will have pretty much the same issues. But I suppose we are probably saved by the fact that single client metadata change rates are terrible, so bulk IO throughput rarely comes under pressure.

Real compute nodes are highly constrained for IO memory because they have real applications running on them that want most of the memory for themselves.

My point is that constrained memory resources are actually the norm in the real world, not the exception.

But I can see that I'm fighting a losing battle. So ok, we'll concede to a custom tuning. Now how do you propose to ensure that Lustre continues to be tested and function with that tuning enabled?

Comment by Christopher Morrone [ 23/May/14 ]

Actually, we'll never use Lustre 2.6. So if you want to take it out, you can just do that. But you will need to open a new ticket and commit to a more general fix in time for 2.7. Does that sound reasonable?

Comment by Jinshan Xiong (Inactive) [ 23/May/14 ]

I value this piece of work because it fixes a data corruption issue when a client is umounted but still has uncommitted RPC. I'm going to work with our guys to do some performance tune.

Comment by Dmitry Eremin (Inactive) [ 26/May/14 ]


This is comparison of patch set 2 and patch set 3. Also catching usual sleep stacks during test of patch set 2.

Writers stack:

[<ffffffffa09cce44>] osc_enter_cache+0x48d/0x999 [osc]
[<ffffffffa09c6138>] osc_queue_async_io+0xa58/0x17b0 [osc]
[<ffffffffa09b10b7>] osc_page_cache_add+0x47/0x120 [osc]
[<ffffffffa09b82ea>] osc_io_commit_async+0x10a/0x3b0 [osc]
[<ffffffffa0450b17>] cl_io_commit_async+0x77/0x130 [obdclass]
[<ffffffffa082f581>] lov_io_commit_async+0x271/0x4c0 [lov]
[<ffffffffa0450b17>] cl_io_commit_async+0x77/0x130 [obdclass]
[<ffffffffa0905796>] vvp_io_write_commit+0xf6/0x7e0 [lustre]
[<ffffffffa0905fb6>] vvp_io_write_start+0x136/0x370 [lustre]
[<ffffffffa0451de5>] cl_io_start+0x65/0x130 [obdclass]
[<ffffffffa0455294>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa08ab1d8>] ll_file_io_generic+0x238/0x6d0 [lustre]
[<ffffffffa08ab8f5>] ll_file_aio_write+0xe5/0x1f0 [lustre]
[<ffffffffa08abb30>] ll_file_write+0x130/0x240 [lustre]
[<ffffffff810dfb9d>] vfs_write+0xac/0xf3
[<ffffffff810dfd8c>] sys_write+0x4a/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

Readers stack:

[<ffffffff8109f851>] sync_page+0x4d/0x51
[<ffffffff810a012f>] sync_page_killable+0xe/0x3b
[<ffffffff8109f788>] __lock_page_killable+0x66/0x68
[<ffffffff810a1c2f>] generic_file_aio_read+0x657/0x864
[<ffffffffa0904066>] vvp_io_read_start+0x2f6/0x460 [lustre]
[<ffffffffa0451de5>] cl_io_start+0x65/0x130 [obdclass]
[<ffffffffa0455294>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa08ab1d8>] ll_file_io_generic+0x238/0x6d0 [lustre]
[<ffffffffa08abd22>] ll_file_aio_read+0xe2/0x1f0 [lustre]
[<ffffffffa08abf60>] ll_file_read+0x130/0x240 [lustre]
[<ffffffff810dfc8d>] vfs_read+0xa9/0xf0
[<ffffffff810dfd1e>] sys_read+0x4a/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
Comment by Dmitry Eremin (Inactive) [ 26/May/14 ]

For pure master client and client with patch set 3 the stacks are following:

Writers stack:

[<ffffffff810a86dd>] balance_dirty_pages_ratelimited_nr+0x329/0x3a4
[<ffffffff810a2dc1>] generic_file_buffered_write+0x975/0xa9e
[<ffffffff810a3129>] __generic_file_aio_write+0x23f/0x270
[<ffffffff810a31b2>] generic_file_aio_write+0x58/0xaa
[<ffffffffa08f710d>] vvp_io_write_start+0xfd/0x390 [lustre]
[<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass]
[<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa0896be9>] ll_file_io_generic+0x3b9/0x860 [lustre]
[<ffffffffa08972f6>] ll_file_aio_write+0xd6/0x1e0 [lustre]
[<ffffffffa089751c>] ll_file_write+0x11c/0x230 [lustre]
[<ffffffff810dfb9d>] vfs_write+0xac/0xf3
[<ffffffff810dfd8c>] sys_write+0x4a/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

Readers stack:

[<ffffffff8109f851>] sync_page+0x4d/0x51
[<ffffffff810a012f>] sync_page_killable+0xe/0x3b
[<ffffffff8109f788>] __lock_page_killable+0x66/0x68
[<ffffffff810a1c2f>] generic_file_aio_read+0x657/0x864
[<ffffffffa08f5216>] vvp_io_read_start+0x2f6/0x460 [lustre]
[<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass]
[<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa0896be9>] ll_file_io_generic+0x3b9/0x860 [lustre]
[<ffffffffa0897703>] ll_file_aio_read+0xd3/0x1e0 [lustre]
[<ffffffffa089792c>] ll_file_read+0x11c/0x230 [lustre]
[<ffffffff810dfc8d>] vfs_read+0xa9/0xf0
[<ffffffff810dfd1e>] sys_read+0x4a/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
Comment by Dmitry Eremin (Inactive) [ 28/May/14 ]


This is performance test of patch set 5.

Writers stack:

[<ffffffffa09ba014>] osc_enter_cache+0x48d/0x999 [osc]
[<ffffffffa09b32f8>] osc_queue_async_io+0xa58/0x1780 [osc]
[<ffffffffa099e1d7>] osc_page_cache_add+0x47/0x120 [osc]
[<ffffffffa09a54fa>] osc_io_commit_async+0x10a/0x3b0 [osc]
[<ffffffffa03242f7>] cl_io_commit_async+0x77/0x130 [obdclass]
[<ffffffffa081d581>] lov_io_commit_async+0x271/0x4c0 [lov]
[<ffffffffa03242f7>] cl_io_commit_async+0x77/0x130 [obdclass]
[<ffffffffa08f38d6>] vvp_io_write_commit+0xf6/0x7e0 [lustre]
[<ffffffffa08f40d4>] vvp_io_write_start+0x114/0x390 [lustre]
[<ffffffffa03255c5>] cl_io_start+0x65/0x130 [obdclass]
[<ffffffffa0328a74>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa0893be9>] ll_file_io_generic+0x3b9/0x860 [lustre]
[<ffffffffa08942f6>] ll_file_aio_write+0xd6/0x1e0 [lustre]
[<ffffffffa089451c>] ll_file_write+0x11c/0x230 [lustre]
[<ffffffff810dfb9d>] vfs_write+0xac/0xf3
[<ffffffff810dfd8c>] sys_write+0x4a/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
Comment by Jinshan Xiong (Inactive) [ 29/May/14 ]

Then let's use patch set 5 for 2.6

Comment by Jodi Levi (Inactive) [ 02/Jun/14 ]

http://review.whamcloud.com/#/c/10003/

Comment by Jodi Levi (Inactive) [ 12/Jun/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Jian Yu [ 07/Nov/14 ]

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12615
The patch depends on the patches for LU-3274 and LU-2139 on Lustre b2_5 branch.

Comment by Christopher Morrone [ 07/Nov/14 ]

Folks, can someone explain to me the rationale behind these changes in change 10003?

1. Remove kernel NFS unstable pages tracking because it killed performance
2. Track unstable pages as part of LRU cache. Otherwise Lustre can use much more memory than max_cached_mb
3. Remove obd_unstable_pages tracking to avoid using global atomic counter

Was not the entire point of the unstable page tracking pages to use the NFS unstable pages tracking? The idea was that it gives the Linux kernel a back pressure mechanism when memory is low. max_cached_mb has absolutely no regard for the actual memory usage of a running system; it only works in its own little lustre world.

And why would it be correct to make unstable pages part of the page LRU? Those pages cannot be freed. They are still in the limbo state where the server has acknowledged them, but not necessarily yet committed them to disk. When memory needs to be freed, we don't send them again, so there is no reason to walk them as part of the LRU.

I am rather concerned that we have thrown away correctness in pursuit of performance.

Comment by Jinshan Xiong (Inactive) [ 09/Nov/14 ]

And why would it be correct to make unstable pages part of the page LRU? Those pages cannot be freed. They are still in the limbo state where the server has acknowledged them, but not necessarily yet committed them to disk. When memory needs to be freed, we don't send them again, so there is no reason to walk them as part of the LRU.

Yes, I agree that it's more reasonable to take UNSTABLE pages out of LRU. However, if we didn't do that, we need to find the corresponding cl_page in brw_commit() callback and that will use an atomic operation. Actually the current mechanism is not that bad because usually there are only small part of pages are in UNSTABLE state, and those pages should stay at the tail LRU list; therefore, they shouldn't be scanned at all.

One of my concerns with using UNSTABLE pages is that it will make OST to commit transactions frequently. Therefore a problematic client will send lots of SOFT_SYNC RPCs to OSTs it talks to, which will degrade system performance significantly.

Anyway, this patch didn't remove the feature of UNSTABLE pages tracking. Users can still enable it if performance is not a problem for them.

Comment by Christopher Morrone [ 10/Nov/14 ]

Did you, or did you not, do the following?

Remove kernel NFS unstable pages tracking

That was a critical part of the design. Can you explain how you fixed the memory management issues without that part of the design?

Comment by Jinshan Xiong (Inactive) [ 10/Nov/14 ]

I didn't remove that code but just make it optional, there is a lprocfs entry to control NFS unstable pages tracking, and it's off by default.

Comment by Christopher Morrone [ 10/Nov/14 ]

Wow. In that case, that was a really terrible commit message. And probably too many things done in a single commit. So what about these two things:

2. Track unstable pages as part of LRU cache. Otherwise Lustre
can use much more memory than max_cached_mb
3. Remove obd_unstable_pages tracking to avoid using global
atomic counter

So item 2 only happens if unstable change tracking is enabled? Or all the time?

And 3, is that only removed when unstable change tracking is disabled? Or is that removed when it is enabled as well?

Comment by Christopher Morrone [ 10/Nov/14 ]

Also, I might wonder if the unstable page tracking setting is backwards. One offers correctness under fairly typical HPC workloads, and the other sacrifices correctness for speed. Shouldn't correctness be the default, and the speed-but-patholigically-bad-in-low-memory-situations be optional?

Comment by Jinshan Xiong (Inactive) [ 11/Nov/14 ]

So item 2 only happens if unstable change tracking is enabled? Or all the time?

Only when unstable pages tracking is enabled.

And 3, is that only removed when unstable change tracking is disabled? Or is that removed when it is enabled as well?

It's removed.

Also, I might wonder if the unstable page tracking setting is backwards. One offers correctness under fairly typical HPC workloads, and the other sacrifices correctness for speed. Shouldn't correctness be the default, and the speed-but-patholigically-bad-in-low-memory-situations be optional?

Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed. I don't know what you mean by correctness. Actually in current implementation, it doesn't get any feedback for system memory pressure, it sends SOFT_SYNC only when it's low on available LRU slots.

Comment by Christopher Morrone [ 11/Nov/14 ]
Unstable pages tracking should be turned off on I/O nodes with plenty of memory installed.

That statement is puzzling. How much memory is "plenty"? Our I/O nodes have 64GiB of RAM, which I would have thought would be considered "plenty".

But it also kind of misses the point. In the real world, it doesn't matter how much memory is installed on the node. The people who designed the system probably intended the memory to actually be used not just sit idle all the time because Lustre has no sane memory management.

On an "I/O" node, that memory needs might need to be shared by function shipping buffers, system debuggers, system management tools, and other filesystem software. On normal HPC compute nodes the memory is going to be under contention with actual user applications, other filesystems, etc.

My point is that memory contention is a normal situation in the real world. It is not a corner case. If we treat it as a corner case, we'll be putting out a subpar product.

Comment by Jinshan Xiong (Inactive) [ 11/Nov/14 ]

By plenty, I meant there are plenty of available memory. The memory can be temporarily `lost' when the write has completed but the transaction is not committed. Therefore, if the client has the extra memory available to hold UNSTABLE pages between two transactions, it should be able to maintain the highest write speed sustainable. Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST.

Comment by Gerrit Updater [ 01/Dec/14 ]

N/A

Comment by Christopher Morrone [ 04/Jun/15 ]

By plenty, I meant there are plenty of available memory.

Yes, but how much memory is "plenty"? In the real world, memory is a finite resource. We can not program with the assumption that there is always free memory. Lustre must behave reasonably by default when memory is under contention.

Unfortunately, the exact amount of the extra memory highly depends on the performance and configuration of the OST.

No, it does not. Client memory is far faster than disk on a remote OST over the network pretty much by definition. Client memory under normal use cases is also under contention by actual applications, which are not represented by the naive tests that were used to create the graphs in this ticket.

In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.

Memory contention is normal in the real world, and Lustre's defaults should be selected to meet and function reasonably under real world usage.

Comment by Jinshan Xiong (Inactive) [ 06/Jun/15 ]

In the real world, people buy client memory sized to fit their application. No one has the budget to buy double or triple the amount of ram for all their clients just to leave Lustre more buffer space.

I didn't mean the memory is for Lustre client to use for buffer space. For write RPC, clients have to hold those writing pages until the OST commits the corresponding transaction. Therefore, clients have to have extra memory to pin those pages in memory and applications can not use them. For ZFS, the typical txg timeout is 5 seconds, which means the clients will pin 5 seconds of writing data in memory; depending on the writing throughput on the client side, this can be a lot.

There is really nothing we can do on the client side. Probably we can do some tune for ZFS. The I/O generated by Lustre is different from I/O of generic workload, so we may look into the timeout of txg or restrict the memory of write cache.

Comment by Christopher Morrone [ 08/Jun/15 ]

This is not a problem that can possibly be fixed on the server side. Memory is fundamentally much faster than disk. We can easily exhaust memory on the client side, and there is no possible way that anyone can afford a system that is so fast that both the network and the the servers can always keep up.

The 5 second txg timeout in ZFS is is the typical txg maximum. Under normal load, ZFS will sync to disk much faster. It only waits the full 5 seconds when load is light. Under typical HPC loads, the speed at which servers can land data is bound by the disk speed, not by the txg limit.

And so the only reasonable solution is for the client to be well behaved under the very normal real world situation where data is generated faster than it can be written to disk. Part of dealing with that situation is tracking "unstable" pages. Tracking those pages allows the Linux kernel to pause IO under low memory situations so that memory contention does not go into pathological modes.

Comment by Andreas Dilger [ 08/Jul/15 ]

Jinshan, we need to revisit the changes in patch http://review.whamcloud.com/10003 and the original patches to see what can be done to avoid excessive memory throttling when the client doesn't have memory pressure (e.g. interactive nodes), but does have IO throttling on clients when they are short of memory.

Comment by Jinshan Xiong (Inactive) [ 09/Jul/15 ]

There are a few things to look at for this topic:
1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

Comment by Jinshan Xiong (Inactive) [ 09/Jul/15 ]

There are a few things to look at for this topic:
1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

2. Investigate the efficiency of SOFT_SYNC. The idea of SOFT_SYNC is good, but probably due to the policy, it caused the problem of saturating OSTs. This is why patch 10003 is introduced to disable unstable page tracking at user's discretion;

3. Memory cache and readahead buffer. It lacks a way for the readahead code to know the current status of memory pressure. This causes the problem that useful pages are evicted by readahead, or readhead pages themselves are evicted by new readahead pages. We need a feedback mechanism to throttle readahead window size when memory is under pressure.

Comment by Christopher Morrone [ 09/Jul/15 ]

1. it used to register a cache shrinker for cl_page, but the performance of destroying cl_page was too slow so memory was easy to run out if the application is I/O intensive. max_cached_mb on the OSC layer is revived to solve this problem. Now the performance of destroying cl_page is improved significantly, probably we can revisit the cache shrinker option;

Yes, that is a serious memory management bug. We hit it at LLNL quite recently. Please open a separate ticket on that bug.

Comment by Jinshan Xiong (Inactive) [ 13/Jul/15 ]

new ticket filed at LU-6842

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

LU-6842

Generated at Sat Feb 10 01:46:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.