[LU-1431] Support for larger than 1MB sequential I/O RPCs Created: 22/May/12  Updated: 08/Dec/20  Resolved: 14/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: New Feature Priority: Major
Reporter: James A Simmons Assignee: Peter Jones
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Related
is related to LUDOC-80 4MB RPC Doc Changes Resolved
is related to LU-2424 add memory limits for ptlrpc service Resolved
is related to LU-4533 rpc_stats histogram does not support ... Open
is related to LU-3308 large readdir chunk size slows unlink... Reopened
is related to LU-2791 Stuck client on server OOM/lost message Resolved
is related to LU-2816 sanity-benchmark test_bonnie slow aft... Resolved
is related to LU-2598 obdfilter-survey LBUG ASSERTION( iobu... Resolved
is related to LU-2790 Failure to allocated osd keys leads t... Resolved
is related to LU-3438 replay-ost-single test_5 failed with ... Resolved
is related to LU-2748 OSD uses kmalloc with high order to a... Resolved
is related to LUDOC-80 4MB RPC Doc Changes Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2702 client read RPCs do not generate read... Technical task Resolved Andreas Dilger  
LU-2756 increase OST_BUFSIZE to allow multipl... Technical task Resolved Liang Zhen  
LU-2748 OSD uses kmalloc with high order to a... Technical task Resolved Alex Zhuravlev  
Bugzilla ID: 16,900
Epic: lnet
Rank (Obsolete): 4038

 Description   

Currently Lustre maximum buffer size for a RPC sending I/O is 1MB. This work looks to change the amount of data transfer to allow the data sent to be a size to achieve peak performance with large I/O transfers to the back end disk. Also an additional benefit is the reduction in the round trip time to send the data.



 Comments   
Comment by James A Simmons [ 22/May/12 ]

Initial patch at http://review.whamcloud.com/#change,2872

Comment by Nathan Rutman [ 22/May/12 ]

Thanks James. I have asked Sergii to keep this bug updated with our progress.

Comment by Andreas Dilger [ 04/Oct/12 ]

Nathan, Shadow,
any update on this ticket? The last time we spoke about this, the patch from James wasn't working properly due to bad interaction with the 1MB bulk readdir RPCs on the MDS, and some other issues.

If you have a patch that allows larger than 1MB bulk RPCs, it would be great to refresh http://review.whamcloud.com/2872 with a working patch. In order to avoid potential issues with performance decrease under some workloads, it might make sense to land the patch initially with the default maximum bulk RPC size still at 1MB. This will allow users to test with large RPCs (load test and performance) and provide feedback, with minimal risk. Alternately, if you have some performance metrics that compare 1MB/4MB performance under different loads (FPP, SSF, single client, many clients) that would be great.

Comment by Nathan Rutman [ 05/Oct/12 ]

Xyratex MRP-319

Comment by Nathan Rutman [ 05/Oct/12 ]

This should be finished up in the next two weeks.

We split it into a second cleanup patch:
MRP-687 PTLRPC_BRW_MAX_SIZE usage cleanup.

Clean-up the layering in current code by eliminating direct usage of
PTLRPC_BRW_MAX_SIZE macro outside of the ptlrpc module. This should help
us acheive "floating" max brw size value across the cluster, which in
turn should help with 4MB IO task.

Comment by Andreas Dilger [ 05/Oct/12 ]

Nathan, thanks for the update.

Splitting the patch up definitely makes sense, since I recall there are a number of places that don't differentiate between PTLRPC_BRW_MAX_SIZE and cl_max_pages_per_rpc properly.

I assume you know the MRP-nnn URLs don't work outside of the Xyratex intranet, which is fine as long as the LU ticket gets updated with relevant information when the patches are submitted.

Comment by Sergii Glushchenko (Inactive) [ 25/Oct/12 ]

Andreas,

For the second patch (the actual one that changes BRW size to 4MB) we need a new connect flag OBD_CONNECT_MULTIBULK. Comment in lustre_idl.h suggests that such changes must be approved by senior engineers before even sending the patch that reserves it for future use. So, I'm asking you for the approval and will push review request right after it. Thanks.

Comment by Andreas Dilger [ 26/Oct/12 ]

Sergii, could you please explain more fully why the new connect flag is needed. I thought there was already an existing OBD_CONNECT_BRW_SIZE feature which allows the client and server to negotiate the maximum BRW RPC size already...

As for requesting a flag assignment, this is to avoid conflicting users of OBD_CONNECT flags, which would render the feature bits useless.

Comment by James A Simmons [ 09/Nov/12 ]

Will this patch be submitted to Gerrit soon for inspection and testing?

Comment by Andreas Dilger [ 07/Dec/12 ]

Could you please update this bug with the current status of the patch.

Comment by Alexey Lyashkov [ 08/Dec/12 ]

We have finished internal inspection - I will ask deen to upload last version in Monday.

Comment by Peter Jones [ 20/Dec/12 ]

http://review.whamcloud.com/#change,4876

Comment by Sergii Glushchenko (Inactive) [ 20/Dec/12 ]

PTLRPC_MAX_BRW_SIZE cleanup: http://review.whamcloud.com/#change,4876

1. Instead of using one PTLRPC_MAX_BRW_SIZE all over the place, introduce MD_MAX_BRW_SIZE, OSC_MAX_BRW_SIZE, FILTER_MAX_BRW_SIZE (should it be OFD_MAX_BRW_SIZE now?) and auxiliary ONE_MB_BRW_SIZE.
ptlrpc still uses its own PTLRPC_MAX_BRW_SIZE, while other subsystems now use their corresponding macros. The actual 4MB IO patch will change only PTLRPC_MAX_BRW_SIZE and OSC_MAX_BRW_SIZE to 4MB, leaving other subsystems intact.

2. From the original 4MB IO patch (bz16900), take in code which embeds obd_connect_data into obd_export in order to store the other node's ocd_brw_size value. The idea is to have a "floating" brw size across the cluster: during connection, nodes need to decide on the suitable brw size for both of them, which is min(node1_brw_size, node2_brw_size).

Comment by Sergii Glushchenko (Inactive) [ 21/Dec/12 ]

Updated patch has been pushed to Gerrit.

Comment by Sergii Glushchenko (Inactive) [ 10/Jan/13 ]

The actual 4MB I/O patch: http://review.whamcloud.com/#change,4993

Comment by Sergii Glushchenko (Inactive) [ 10/Jan/13 ]

As part of the task, I've developed a couple of tests. The first one, test_231a, checks that in case of a single 4MB transfer only one BRW RPC is sent. It depends on the information from /proc/fs/lustre/obdfilter/$OST/brw_stats and it seems that there are some issues with this statiscics in your codebase, because despite the fact of the I/O the file is empty (only rows/columns names are shown). I haven't looked into this yet, so maybe you will have some ideas. Thanks.

Comment by Sergii Glushchenko (Inactive) [ 14/Jan/13 ]

Andreas, thank you for the hint regarding osd-ldiskfs. I've fixed the test and have just pushed an updated patch to the Gerrit. It works on my local setup, so I think there will be no problems.
So, the current state of the task is that both patches successfully pass all the tests (Well, to be 100% sure we need to wait for the test results for the latest push, but I'm pretty sure that there will be no issues).

Comment by Oleg Drokin [ 30/Jan/13 ]

I see that these patches fail testing at random places that never failed before, so this is worrysome.
What sort of testing did these patches see on your end I wonder?

Comment by Sergii Glushchenko (Inactive) [ 30/Jan/13 ]

I don't think that these patches fail at random places completely. The latest test runs for both patches fail in common places: replay-single and sanity-quota.

As for the testing, the thing is that both patches differ from our original ones due to requests from Andreas, so I don't think that our testing results for the original patches are relevant.

Comment by Nathan Rutman [ 30/Jan/13 ]

Sergii, please see if you can reproduce the failures; it's our responsibility to fix any problems here.

Comment by Andreas Dilger [ 31/Jan/13 ]

I noticed that the llite readahead window increment (RAS_INCREASE_STEP) was also based on PTLRPC_MAX_BRW_SIZE, but this is too large for PTLRPC_MAX_BRW_SIZE of 32MB when the actual cl_max_pages_per_rpc is only 1MB. Instead, limit the readahead window growth to match the inode->i_blkbits (current default min(PTLRPC_MAX_BRW_SIZE * 2, 4MB)), which is still reasonable regardless of the blocksize. This also allows tuning the readahead on a per-inode basis in the future, depending on which OSTs the file is striped over, by fixing the i_blkbits value.

http://review.whamcloud.com/5230

Comment by Andreas Dilger [ 05/Feb/13 ]

Deen, sorry about the problems with the 32MB RPC size. That would have allowed us much better flexibility for testing and updated hardware in the future. As stated in the patch, there are a number of issues that still need to be addressed:

Sorry, it seems that the 32MB RPC size is too large to handle with the current code. Sorry for the confusion. It makes sense to revert PTLRPC_MAX_BRW_SIZE to 4MB for this patch, and we can resolve the problems with larger RPC size in follow-on patches.

There are a number of problems found at 32MB that can be fixed independently:

osd_thread_info.osd_iobuf.dr_blocks[] is 512kB
osd_thread_info.osd_iobuf.dr_pages[] is 64kB
osd_thread_info.oti_created[] is 32kB and is unused and can be removed
oti_thread_info uses OBD_ALLOC() instead of OBD_ALLOC_LARGE()
all OST RPC threads allocate the same large OST_MAXREQSIZE buffers, but this is only needed for the OST_IO_PORTAL
osd_thread_info.osd_iobuf is only needed for OST_IO_PORTAL and does not need to be allocated for other threads
with the larger RPC buffers, there should be fewer total buffers allocated, see comments in http://review.whamcloud.com/4940

The current test result for this patch shows good improvement on FPP write, but a net loss for other IO loads.
Do you have any IO performance data that confirms or contradicts the below results?

IOR 	Single-shared file			
Date	RPC size	clients	write	read
2013/02/03	4MB	105	7153	8200
2013/02/03	1mb	105	7996	9269
				
				
IOR 	File-per-process			
Date	RPC size	clients	write	read
2013/02/03	4MB	105	9283	6000
2013/02/03	1mb	106	7233	6115

If this is the case, we could still e.g. default to sending 4MB write RPCs if there is enough data in cache and the client holds an exclusive DLM lock.

Comment by Andreas Dilger [ 14/May/13 ]

LUDOC-80 landed, closing bug.

Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ]

Our testing system shows, that there is failed test eplay-ost-single.test_5

Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673)
Lustre: Failing over lustre-OST0000
LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19
LustreError: Skipped 1 previous similar message
Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0000: shutting down for failover; client state will be preserved.
Lustre: OST lustre-OST0000 has stopped.
Lustre: server umount lustre-OST0000 complete
LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target)
LustreError: Skipped 1 previous similar message
LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322
Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled
Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697
Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages
Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo)
Lustre: Skipped 1 previous similar message
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
Lustre: lustre-OST0000: received MDS connection from 0@lo
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
Lustre: DEBUG MARKER: iozone rc=1
Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed

This messages looks related to 4mb IO patch

LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)

I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing
https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c

test_5	SKIP	0	0	skipping SLOW test 5

Could you, please, start this test (it marked as SLOW) and check if it failed?

Comment by Peter Jones [ 05/Jun/13 ]

Artem

Could you please open a new ticket for this failure so we can track it?

Thanks

Peter

Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ]

LU-3438 is created.

Comment by Peter Jones [ 05/Jun/13 ]

Thanks Artem!

Comment by Artem Blagodarenko (Inactive) [ 14/Jun/13 ]

Xyratex-bug-id: MRP-687
Xyratex-bug-id: MRP-319

Generated at Sat Feb 10 01:16:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.