Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1431

Support for larger than 1MB sequential I/O RPCs

Details

    • New Feature
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 16,900
    • 4038

    Description

      Currently Lustre maximum buffer size for a RPC sending I/O is 1MB. This work looks to change the amount of data transfer to allow the data sent to be a size to achieve peak performance with large I/O transfers to the back end disk. Also an additional benefit is the reduction in the round trip time to send the data.

      Attachments

        Issue Links

          Activity

            [LU-1431] Support for larger than 1MB sequential I/O RPCs

            Xyratex-bug-id: MRP-687
            Xyratex-bug-id: MRP-319

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - - edited Xyratex-bug-id: MRP-687 Xyratex-bug-id: MRP-319
            pjones Peter Jones added a comment -

            Thanks Artem!

            pjones Peter Jones added a comment - Thanks Artem!

            LU-3438 is created.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - - edited LU-3438 is created.
            pjones Peter Jones added a comment -

            Artem

            Could you please open a new ticket for this failure so we can track it?

            Thanks

            Peter

            pjones Peter Jones added a comment - Artem Could you please open a new ticket for this failure so we can track it? Thanks Peter

            Our testing system shows, that there is failed test eplay-ost-single.test_5

            Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673)
            Lustre: Failing over lustre-OST0000
            LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19
            LustreError: Skipped 1 previous similar message
            Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
            Lustre: Skipped 1 previous similar message
            Lustre: lustre-OST0000: shutting down for failover; client state will be preserved.
            Lustre: OST lustre-OST0000 has stopped.
            Lustre: server umount lustre-OST0000 complete
            LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target)
            LustreError: Skipped 1 previous similar message
            LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
            LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
            Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322
            Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled
            Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697
            Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages
            Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
            Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
            Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo)
            Lustre: Skipped 1 previous similar message
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
            Lustre: lustre-OST0000: received MDS connection from 0@lo
            Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
            Lustre: DEBUG MARKER: iozone rc=1
            Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed
            

            This messages looks related to 4mb IO patch

            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
            LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
            

            I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing
            https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c

            test_5	SKIP	0	0	skipping SLOW test 5
            

            Could you, please, start this test (it marked as SLOW) and check if it failed?

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - Our testing system shows, that there is failed test eplay-ost-single.test_5 Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673) Lustre: Failing over lustre-OST0000 LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19 LustreError: Skipped 1 previous similar message Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: shutting down for failover; client state will be preserved. Lustre: OST lustre-OST0000 has stopped. Lustre: server umount lustre-OST0000 complete LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322 Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697 Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo) Lustre: Skipped 1 previous similar message LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) Lustre: lustre-OST0000: received MDS connection from 0@lo Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans Lustre: DEBUG MARKER: iozone rc=1 Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed This messages looks related to 4mb IO patch LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) I believe, that this test is failed in intel's master branch, but they skip it as SLOW during testing https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c test_5 SKIP 0 0 skipping SLOW test 5 Could you, please, start this test (it marked as SLOW) and check if it failed?

            LUDOC-80 landed, closing bug.

            adilger Andreas Dilger added a comment - LUDOC-80 landed, closing bug.

            People

              pjones Peter Jones
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: