Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1431

Support for larger than 1MB sequential I/O RPCs

Details

    • New Feature
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 16,900
    • 4038

    Description

      Currently Lustre maximum buffer size for a RPC sending I/O is 1MB. This work looks to change the amount of data transfer to allow the data sent to be a size to achieve peak performance with large I/O transfers to the back end disk. Also an additional benefit is the reduction in the round trip time to send the data.

      Attachments

        Issue Links

          Activity

            [LU-1431] Support for larger than 1MB sequential I/O RPCs

            Xyratex-bug-id: MRP-687
            Xyratex-bug-id: MRP-319

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - - edited Xyratex-bug-id: MRP-687 Xyratex-bug-id: MRP-319
            pjones Peter Jones added a comment -

            Thanks Artem!

            pjones Peter Jones added a comment - Thanks Artem!

            LU-3438 is created.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - - edited LU-3438 is created.
            pjones Peter Jones added a comment -

            Artem

            Could you please open a new ticket for this failure so we can track it?

            Thanks

            Peter

            pjones Peter Jones added a comment - Artem Could you please open a new ticket for this failure so we can track it? Thanks Peter

            Our testing system shows, that there is failed test eplay-ost-single.test_5

            Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673)
            Lustre: Failing over lustre-OST0000
            LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19
            LustreError: Skipped 1 previous similar message
            Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
            Lustre: Skipped 1 previous similar message
            Lustre: lustre-OST0000: shutting down for failover; client state will be preserved.
            Lustre: OST lustre-OST0000 has stopped.
            Lustre: server umount lustre-OST0000 complete
            LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target)
            LustreError: Skipped 1 previous similar message
            LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
            LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
            Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322
            Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled
            Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697
            Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages
            Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
            Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
            Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo)
            Lustre: Skipped 1 previous similar message
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
            Lustre: lustre-OST0000: received MDS connection from 0@lo
            Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
            Lustre: DEBUG MARKER: iozone rc=1
            Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed
            

            This messages looks related to 4mb IO patch

            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
            

            I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing
            https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c

            test_5	SKIP	0	0	skipping SLOW test 5
            

            Could you, please, start this test (it marked as SLOW) and check if it failed?

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - Our testing system shows, that there is failed test eplay-ost-single.test_5 Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673) Lustre: Failing over lustre-OST0000 LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19 LustreError: Skipped 1 previous similar message Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: shutting down for failover; client state will be preserved. Lustre: OST lustre-OST0000 has stopped. Lustre: server umount lustre-OST0000 complete LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322 Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697 Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo) Lustre: Skipped 1 previous similar message LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) Lustre: lustre-OST0000: received MDS connection from 0@lo Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans Lustre: DEBUG MARKER: iozone rc=1 Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed This messages looks related to 4mb IO patch LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c test_5 SKIP 0 0 skipping SLOW test 5 Could you, please, start this test (it marked as SLOW) and check if it failed?

            LUDOC-80 landed, closing bug.

            adilger Andreas Dilger added a comment - LUDOC-80 landed, closing bug.

            Deen, sorry about the problems with the 32MB RPC size. That would have allowed us much better flexibility for testing and updated hardware in the future. As stated in the patch, there are a number of issues that still need to be addressed:

            Sorry, it seems that the 32MB RPC size is too large to handle with the current code. Sorry for the confusion. It makes sense to revert PTLRPC_MAX_BRW_SIZE to 4MB for this patch, and we can resolve the problems with larger RPC size in follow-on patches.

            There are a number of problems found at 32MB that can be fixed independently:

            osd_thread_info.osd_iobuf.dr_blocks[] is 512kB
            osd_thread_info.osd_iobuf.dr_pages[] is 64kB
            osd_thread_info.oti_created[] is 32kB and is unused and can be removed
            oti_thread_info uses OBD_ALLOC() instead of OBD_ALLOC_LARGE()
            all OST RPC threads allocate the same large OST_MAXREQSIZE buffers, but this is only needed for the OST_IO_PORTAL
            osd_thread_info.osd_iobuf is only needed for OST_IO_PORTAL and does not need to be allocated for other threads
            with the larger RPC buffers, there should be fewer total buffers allocated, see comments in http://review.whamcloud.com/4940

            The current test result for this patch shows good improvement on FPP write, but a net loss for other IO loads.
            Do you have any IO performance data that confirms or contradicts the below results?

            IOR 	Single-shared file			
            Date	RPC size	clients	write	read
            2013/02/03	4MB	105	7153	8200
            2013/02/03	1mb	105	7996	9269
            				
            				
            IOR 	File-per-process			
            Date	RPC size	clients	write	read
            2013/02/03	4MB	105	9283	6000
            2013/02/03	1mb	106	7233	6115
            

            If this is the case, we could still e.g. default to sending 4MB write RPCs if there is enough data in cache and the client holds an exclusive DLM lock.

            adilger Andreas Dilger added a comment - Deen, sorry about the problems with the 32MB RPC size. That would have allowed us much better flexibility for testing and updated hardware in the future. As stated in the patch, there are a number of issues that still need to be addressed: Sorry, it seems that the 32MB RPC size is too large to handle with the current code. Sorry for the confusion. It makes sense to revert PTLRPC_MAX_BRW_SIZE to 4MB for this patch, and we can resolve the problems with larger RPC size in follow-on patches. There are a number of problems found at 32MB that can be fixed independently: osd_thread_info.osd_iobuf.dr_blocks[] is 512kB osd_thread_info.osd_iobuf.dr_pages[] is 64kB osd_thread_info.oti_created[] is 32kB and is unused and can be removed oti_thread_info uses OBD_ALLOC() instead of OBD_ALLOC_LARGE() all OST RPC threads allocate the same large OST_MAXREQSIZE buffers, but this is only needed for the OST_IO_PORTAL osd_thread_info.osd_iobuf is only needed for OST_IO_PORTAL and does not need to be allocated for other threads with the larger RPC buffers, there should be fewer total buffers allocated, see comments in http://review.whamcloud.com/4940 The current test result for this patch shows good improvement on FPP write, but a net loss for other IO loads. Do you have any IO performance data that confirms or contradicts the below results? IOR Single-shared file Date RPC size clients write read 2013/02/03 4MB 105 7153 8200 2013/02/03 1mb 105 7996 9269 IOR File-per-process Date RPC size clients write read 2013/02/03 4MB 105 9283 6000 2013/02/03 1mb 106 7233 6115 If this is the case, we could still e.g. default to sending 4MB write RPCs if there is enough data in cache and the client holds an exclusive DLM lock.

            I noticed that the llite readahead window increment (RAS_INCREASE_STEP) was also based on PTLRPC_MAX_BRW_SIZE, but this is too large for PTLRPC_MAX_BRW_SIZE of 32MB when the actual cl_max_pages_per_rpc is only 1MB. Instead, limit the readahead window growth to match the inode->i_blkbits (current default min(PTLRPC_MAX_BRW_SIZE * 2, 4MB)), which is still reasonable regardless of the blocksize. This also allows tuning the readahead on a per-inode basis in the future, depending on which OSTs the file is striped over, by fixing the i_blkbits value.

            http://review.whamcloud.com/5230

            adilger Andreas Dilger added a comment - I noticed that the llite readahead window increment (RAS_INCREASE_STEP) was also based on PTLRPC_MAX_BRW_SIZE, but this is too large for PTLRPC_MAX_BRW_SIZE of 32MB when the actual cl_max_pages_per_rpc is only 1MB. Instead, limit the readahead window growth to match the inode->i_blkbits (current default min(PTLRPC_MAX_BRW_SIZE * 2, 4MB)), which is still reasonable regardless of the blocksize. This also allows tuning the readahead on a per-inode basis in the future, depending on which OSTs the file is striped over, by fixing the i_blkbits value. http://review.whamcloud.com/5230

            Sergii, please see if you can reproduce the failures; it's our responsibility to fix any problems here.

            nrutman Nathan Rutman added a comment - Sergii, please see if you can reproduce the failures; it's our responsibility to fix any problems here.

            I don't think that these patches fail at random places completely. The latest test runs for both patches fail in common places: replay-single and sanity-quota.

            As for the testing, the thing is that both patches differ from our original ones due to requests from Andreas, so I don't think that our testing results for the original patches are relevant.

            deen Sergii Glushchenko (Inactive) added a comment - I don't think that these patches fail at random places completely. The latest test runs for both patches fail in common places: replay-single and sanity-quota. As for the testing, the thing is that both patches differ from our original ones due to requests from Andreas, so I don't think that our testing results for the original patches are relevant.

            People

              pjones Peter Jones
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: