[LU-8515] OSC: Send RPCs with full extents Created: 18/Aug/16 Updated: 08/Mar/18 Resolved: 17/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
In Lustre 2.7 and newer, single node multi-process single-shared-file write performance is significantly slower than in Lustre 2.5. This is due to a problem in deciding when to make an RPC. (IE, the decisions made in osc_makes_rpc) Currently, Lustre decides to send an RPC under a number of In this case, the "count dirty pages method" will see there Instead, we remove this check and add extents to a special With a good I/O pattern, like usually used in benchmarking, In IOR tests with multiple writers to a single file, Here's some specific data: IOR: aprun -n 1 $(IOR) -w -t 4m -b 16g -C -e -E -k -u -v Unmodified: Modified (full extent): Here's an example of the improvement available. We're using 8 threads and 1 GB of data per thread. (Results are similar with a larger amount of data per thread.) Modified: Note the above is actually a shade higher than the single thread performance, despite being at essentially the limit for the target (from this node, with these settings). 2 stripes: 1 thread, unmodified: 1 thread, modified: 8 threads, unmodified: 8 threads, modified: 8 stripes: 1 thread, unmodified: 1 thread, modified: 8 threads, unmodified: 8 threads, modified: 16 stripes: 8 threads, unmodified: 8 threads, modified: |
| Comments |
| Comment by Gerrit Updater [ 18/Aug/16 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/22012 |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/16 ] |
|
Were these tests running with 4MB transfer size, and the max_pages_per_rpc is set to the default value that is 256? |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
|
Yes, they were all done with 4 MiB transfer size and 1 MiB RPCs (So, 256 max_pages_per_rpc). I can't increase the RPC size on this system, but I can reduce the transfer size to 1 MiB, so they're matched. (Note that stripe size is 1 MiB.) Here's (a few of) those tests repeated with 1 MiB transfer size. If you have specific tests you'd like run, let me know and I can try to get time later. I can also get some rpc_stats data for a few tests if needed. For the modified version, the results are the same. For the unmodified version, speed is up a bit across the board, but still a lot slower than the modified version. This is expected, since with 4 MiB transfers, every write touches multiple stripes (all but guaranteeing multiple active extents in each osc object at the same time, which causes us to send small RPCs). With 1 MiB transfers, this doesn't happen all the time - But it does happen some of the time. (When writers, stripe sizes, and stripe counts match up, like in the 8 stripe case, I'm not clear on why we still we have problems. I would except the writers not to interfere in that case - But the results below show they seem to. Probably needs separate investigation.) Singly striped: 1 thread, 1 MiB transfer size. (These results should be the same. They tend to be, within the margin of error.) Unmodified: Modified: Here's where we see the difference. 8 threads, 1 MiB transfer size: Modified: 8 stripes: 8 threads, 1 MiB transfer size: Modified: 16 stripes: 8 threads, 1 MiB transfer size: Modified: |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
|
By the way, from SIDE NOTE: As an aside, I ran into what appeared to be poor RPC formation when writing to a single shared file from many different threads, and simultaneously hitting the per OSC dirty page limit. During "Test 5" (and to a lesser extent, "Test 2" as well) I began to see many non 1M RPCs being sent to the OSS nodes, whereas with the other tests, nearly all of the RPCs were 1M in size. This affect got worse as the number of tasks increased. What I think was happening is this
I believe this affect was only apparent in Test 2 and Test 5, because the other tests just weren't able to push data to the server fast enough to bump up against the dirty limit. It would be nice if the RPC formation engine would keep these small buffers around, waiting for them to reach a full 1M, before flushing them out to the server. This is especially harmful on ZFS backends because it can force read-modify-write operations, as opposed to only performing writes when the RPC is properly aligned at 1M. {/quote}He was partly right. It turns out the main reason for non-optimal RPC sizes was actually this choice about when to send in osc_makes_rpc. |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/16 ] |
|
The client chose to send RPC earlier is because there is no more grant, or dirty pages has reached its limit therefore it can't cache more dirty data. The problem with your patch is that it may cause livelock - server is in short of space and there is no chance for this client to make full RPC, and then individual threads will hold their own partial extents, and then nobody can move forward. |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
|
Jinshan, Hmmm. I don't think this change affects the behavior of the client in the case where we're out of grant/dirty pages has reached its limit. Those cases are handled separately in osc_makes_rpc. In that case, we send an RPC due to cl_cache_waiters. The check I replaced is this one: Which is just to send an RPC when we have enough dirty pages for one. We already decide to send an RPC when there are cache waiters. Are you saying there is a case where we must write out some data, but we do not generate hp exts, urgent exts, or have cache waiters? I think that's what would be required to get a livelock. If so, then the existing check won't make us safe either. It's a per-object check that won't fire unless there are >= cl_max_pages_per_rpc in that object. So with the current code, a large number of objects could cause the livelock you describe, if all of them had a small amount of data written to them. (Since this check wouldn't catch that.) |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/16 ] |
The function osc_makes_rpc() just tells the I/O engine it could be possible to issue an BRW RPC, and then the I/O engine scans the OSC page cache to try to compose an RPC. It doesn't guarantee that it can make one. With this in mind, I think we should keep the code: if (atomic_read(&osc->oo_nr_writes) >= cli->cl_max_pages_per_rpc)
but it's a really good idea to have a full extent list and compose RPCs from that list first. Anyway, I think your observation is valuable, and your patch makes a lot of sense to me, just need to tweak a little bit. |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/16 ] |
|
btw, is it possible to write a few test cases based on your findings so that we won't break this in the future? |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
|
About test cases: |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
Right.
I don't think we can do that. I can try adding it back in and running some benchmarks (let me know if you want me to do that), but we call osc_makes_rpc for every page we dirty. So we'll try to make an RPC, and (in get_write_extents) if there are no full extents, we'll still send some data. So we'll send our extents before they get to full size. |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/16 ] |
It's a bug if this happens - we should only call this whenever it's possible to make an RPC, for example, releasing a osc_extent, brw_interpret(), or other urgent cases. Yes, please do the tests with my suggestions, it will help me understand the code better. |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/16 ] |
|
Ah, right, sorry. I misread the code (there are a lot of ways to call osc_makes_rpc...). You're right about when we call it. So it probably is safe (in terms of performance) to add back that check, as long as we track full extents and try to send them first in get_write_extents. I'll try it - it would just replace checking list_empty in osc_makes_rpc. If it works, then it doesn't really matter which check we have there, the current one or checking the full_ext list directly. |
| Comment by Patrick Farrell (Inactive) [ 22/Aug/16 ] |
|
With this check, I think we're probably getting a few more non-optimally sized RPCs. This is because at the end of brw_interpret, osc_io_unplug is called, and that results in calling osc_check_rpcs. So each time a ptlrpcd thread completes a brw write, it calls osc_check_rpcs, which calls osc_makes_rpc. So that's one case we call osc_makes_rpc when we don't know if anything is ready for us. So I think with the oo_nr_writes check, we will sometimes send small RPCs with incomplete extents (that we're still writing to). Also, I think the oo_nr_writes >= cl_max_pages_per_rpc check is specifically trying to send when we expect to send out a full size RPC. So I think it's better to just check for 'full extents' - I think it describes what we're really doing better. But if we do keep the oo_nr_writes check, I think it will work fine. |
| Comment by Patrick Farrell (Inactive) [ 22/Aug/16 ] |
|
Sorry about deleting that comment. It turns out I ran the tests above in FPP node, not SSF mode. Ugh. Rerunning the tests, I see a significant difference in favor of not using the oo_nr_writes check. Until we hit the bandwidth limit of the node (16 stripes), !list_empty(full_ext) is significantly faster. My reason for why is still the call to osc_makes_rpc we get every time we call brw_interpret. I think that's generating a lot of non-optimal RPCs. Here's some data, using 1 MiB transfer sizes (4 MiB was very similar): !list_empty(full_ext): 8 stripes, 8 processes: !list_empty(full_ext): 16 stripes, 8 processes: !list_empty(full_ext): 16 stripes, 16 processes: !list_empty(full_ext): |
| Comment by Andreas Dilger [ 20/Sep/16 ] |
|
Patrick, do you have any before/after testing done with clients doing sub-RPC write size (e.g. interleaved 256KB per client so that they don't merge into a single full RPC)? One problem that we had in the past was that sub-sized dirty pages filling up memory without being flushed in a timely manner, so I just want to make sure that the new code that is deferring RPCs until they are full is not over-zealous in delaying dirty data that don't make up a full RPC. |
| Comment by Patrick Farrell (Inactive) [ 20/Sep/16 ] |
|
No, but I will try to get some. Should be quick once machines are available today. (I am pretty confident this should be OK. The new code only waits longer than the old code in the case of multiple writers per stripe. I don't know what ensures pages go out in a timely manner, but I don't think I've modified it.) Can you clarify what you mean by interleaved? Other than by doing direct and buffered I/O, I can't think of how to prevent extents in that situation being packaged in to one RPC. I suppose I/O from two different clients could sort of do that, but I think we'd actually get data written out via ldlm lock cancellation. (I could use group locks to avoid that.) |
| Comment by Andreas Dilger [ 20/Sep/16 ] |
|
By interleaved writes I mean having client A write [0,256KB), client B write [256KB,512KB), ... so that the writes cannot be merged on the client into a single 1MB RPC. That doesn't test the issue I'm wondering about if these writes are just being done by different threads on the same client. |
| Comment by Patrick Farrell (Inactive) [ 20/Sep/16 ] |
|
OK. My concern with that is that due to ldlm lock exchange (think lock ahead), those bytes will be written out by lock cancellation, so they won't have a chance to hang around anyway. I'll see about running such a job and checking the results. |
| Comment by Jinshan Xiong (Inactive) [ 20/Sep/16 ] |
In that case, write back daemon should be started to writeback pages, and those pages should appear in urgent list that shouldn't be affected by this patch. |
| Comment by Gerrit Updater [ 17/Dec/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22012/ |
| Comment by Peter Jones [ 17/Dec/16 ] |
|
Landed for 2.10 |