[LU-1431] Support for larger than 1MB sequential I/O RPCs Created: 22/May/12 Updated: 08/Dec/20 Resolved: 14/May/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Bugzilla ID: | 16,900 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Epic: | lnet | ||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 4038 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Currently Lustre maximum buffer size for a RPC sending I/O is 1MB. This work looks to change the amount of data transfer to allow the data sent to be a size to achieve peak performance with large I/O transfers to the back end disk. Also an additional benefit is the reduction in the round trip time to send the data. |
| Comments |
| Comment by James A Simmons [ 22/May/12 ] |
|
Initial patch at http://review.whamcloud.com/#change,2872 |
| Comment by Nathan Rutman [ 22/May/12 ] |
|
Thanks James. I have asked Sergii to keep this bug updated with our progress. |
| Comment by Andreas Dilger [ 04/Oct/12 ] |
|
Nathan, Shadow, If you have a patch that allows larger than 1MB bulk RPCs, it would be great to refresh http://review.whamcloud.com/2872 with a working patch. In order to avoid potential issues with performance decrease under some workloads, it might make sense to land the patch initially with the default maximum bulk RPC size still at 1MB. This will allow users to test with large RPCs (load test and performance) and provide feedback, with minimal risk. Alternately, if you have some performance metrics that compare 1MB/4MB performance under different loads (FPP, SSF, single client, many clients) that would be great. |
| Comment by Nathan Rutman [ 05/Oct/12 ] |
|
Xyratex MRP-319 |
| Comment by Nathan Rutman [ 05/Oct/12 ] |
|
This should be finished up in the next two weeks. We split it into a second cleanup patch: Clean-up the layering in current code by eliminating direct usage of |
| Comment by Andreas Dilger [ 05/Oct/12 ] |
|
Nathan, thanks for the update. Splitting the patch up definitely makes sense, since I recall there are a number of places that don't differentiate between PTLRPC_BRW_MAX_SIZE and cl_max_pages_per_rpc properly. I assume you know the MRP-nnn URLs don't work outside of the Xyratex intranet, which is fine as long as the LU ticket gets updated with relevant information when the patches are submitted. |
| Comment by Sergii Glushchenko (Inactive) [ 25/Oct/12 ] |
|
Andreas, For the second patch (the actual one that changes BRW size to 4MB) we need a new connect flag OBD_CONNECT_MULTIBULK. Comment in lustre_idl.h suggests that such changes must be approved by senior engineers before even sending the patch that reserves it for future use. So, I'm asking you for the approval and will push review request right after it. Thanks. |
| Comment by Andreas Dilger [ 26/Oct/12 ] |
|
Sergii, could you please explain more fully why the new connect flag is needed. I thought there was already an existing OBD_CONNECT_BRW_SIZE feature which allows the client and server to negotiate the maximum BRW RPC size already... As for requesting a flag assignment, this is to avoid conflicting users of OBD_CONNECT flags, which would render the feature bits useless. |
| Comment by James A Simmons [ 09/Nov/12 ] |
|
Will this patch be submitted to Gerrit soon for inspection and testing? |
| Comment by Andreas Dilger [ 07/Dec/12 ] |
|
Could you please update this bug with the current status of the patch. |
| Comment by Alexey Lyashkov [ 08/Dec/12 ] |
|
We have finished internal inspection - I will ask deen to upload last version in Monday. |
| Comment by Peter Jones [ 20/Dec/12 ] |
| Comment by Sergii Glushchenko (Inactive) [ 20/Dec/12 ] |
|
PTLRPC_MAX_BRW_SIZE cleanup: http://review.whamcloud.com/#change,4876 1. Instead of using one PTLRPC_MAX_BRW_SIZE all over the place, introduce MD_MAX_BRW_SIZE, OSC_MAX_BRW_SIZE, FILTER_MAX_BRW_SIZE (should it be OFD_MAX_BRW_SIZE now?) and auxiliary ONE_MB_BRW_SIZE. 2. From the original 4MB IO patch (bz16900), take in code which embeds obd_connect_data into obd_export in order to store the other node's ocd_brw_size value. The idea is to have a "floating" brw size across the cluster: during connection, nodes need to decide on the suitable brw size for both of them, which is min(node1_brw_size, node2_brw_size). |
| Comment by Sergii Glushchenko (Inactive) [ 21/Dec/12 ] |
|
Updated patch has been pushed to Gerrit. |
| Comment by Sergii Glushchenko (Inactive) [ 10/Jan/13 ] |
|
The actual 4MB I/O patch: http://review.whamcloud.com/#change,4993 |
| Comment by Sergii Glushchenko (Inactive) [ 10/Jan/13 ] |
|
As part of the task, I've developed a couple of tests. The first one, test_231a, checks that in case of a single 4MB transfer only one BRW RPC is sent. It depends on the information from /proc/fs/lustre/obdfilter/$OST/brw_stats and it seems that there are some issues with this statiscics in your codebase, because despite the fact of the I/O the file is empty (only rows/columns names are shown). I haven't looked into this yet, so maybe you will have some ideas. Thanks. |
| Comment by Sergii Glushchenko (Inactive) [ 14/Jan/13 ] |
|
Andreas, thank you for the hint regarding osd-ldiskfs. I've fixed the test and have just pushed an updated patch to the Gerrit. It works on my local setup, so I think there will be no problems. |
| Comment by Oleg Drokin [ 30/Jan/13 ] |
|
I see that these patches fail testing at random places that never failed before, so this is worrysome. |
| Comment by Sergii Glushchenko (Inactive) [ 30/Jan/13 ] |
|
I don't think that these patches fail at random places completely. The latest test runs for both patches fail in common places: replay-single and sanity-quota. As for the testing, the thing is that both patches differ from our original ones due to requests from Andreas, so I don't think that our testing results for the original patches are relevant. |
| Comment by Nathan Rutman [ 30/Jan/13 ] |
|
Sergii, please see if you can reproduce the failures; it's our responsibility to fix any problems here. |
| Comment by Andreas Dilger [ 31/Jan/13 ] |
|
I noticed that the llite readahead window increment (RAS_INCREASE_STEP) was also based on PTLRPC_MAX_BRW_SIZE, but this is too large for PTLRPC_MAX_BRW_SIZE of 32MB when the actual cl_max_pages_per_rpc is only 1MB. Instead, limit the readahead window growth to match the inode->i_blkbits (current default min(PTLRPC_MAX_BRW_SIZE * 2, 4MB)), which is still reasonable regardless of the blocksize. This also allows tuning the readahead on a per-inode basis in the future, depending on which OSTs the file is striped over, by fixing the i_blkbits value. |
| Comment by Andreas Dilger [ 05/Feb/13 ] |
|
Deen, sorry about the problems with the 32MB RPC size. That would have allowed us much better flexibility for testing and updated hardware in the future. As stated in the patch, there are a number of issues that still need to be addressed:
The current test result for this patch shows good improvement on FPP write, but a net loss for other IO loads. IOR Single-shared file Date RPC size clients write read 2013/02/03 4MB 105 7153 8200 2013/02/03 1mb 105 7996 9269 IOR File-per-process Date RPC size clients write read 2013/02/03 4MB 105 9283 6000 2013/02/03 1mb 106 7233 6115 If this is the case, we could still e.g. default to sending 4MB write RPCs if there is enough data in cache and the client holds an exclusive DLM lock. |
| Comment by Andreas Dilger [ 14/May/13 ] |
|
|
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ] |
|
Our testing system shows, that there is failed test eplay-ost-single.test_5 Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673) Lustre: Failing over lustre-OST0000 LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19 LustreError: Skipped 1 previous similar message Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: shutting down for failover; client state will be preserved. Lustre: OST lustre-OST0000 has stopped. Lustre: server umount lustre-OST0000 complete LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322 Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697 Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo) Lustre: Skipped 1 previous similar message LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) Lustre: lustre-OST0000: received MDS connection from 0@lo Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans Lustre: DEBUG MARKER: iozone rc=1 Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed This messages looks related to 4mb IO patch LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing test_5 SKIP 0 0 skipping SLOW test 5 Could you, please, start this test (it marked as SLOW) and check if it failed? |
| Comment by Peter Jones [ 05/Jun/13 ] |
|
Artem Could you please open a new ticket for this failure so we can track it? Thanks Peter |
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ] |
|
|
| Comment by Peter Jones [ 05/Jun/13 ] |
|
Thanks Artem! |
| Comment by Artem Blagodarenko (Inactive) [ 14/Jun/13 ] |