Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6370

Read performance degrades with increasing read block size.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,
    • 3
    • 9223372036854775807

    Description

      We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

      ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
      ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

      4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
      1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
      4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
      16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
      32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
      64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
      128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
      256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
      512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
      1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
      2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
      3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

      This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
      3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

      Another test case was used with C code rather than dd that provided similar results based on an fread call

      int read_size = 256*1024*1024*2;
      fread(buffer, sizeof(float), read_size, fp_in);

      Also, file striping information on production and tds filesystems:
      ext8:/lustre # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 833
      obdidx objid objid group
      833 5978755 0x5b3a83 0
      834 5953949 0x5ad99d 0
      835 5958818 0x5aeca2 0
      836 5966400 0x5b0a40 0

      ext8:/lustretds # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 51
      obdidx objid objid group
      51 1451231 0x1624df 0
      52 1452258 0x1628e2 0
      53 1450278 0x162126 0
      54 1444772 0x160ba4 0

      So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

      • 8.74% [kernel] [k] _spin_lock - _spin_lock
      • 22.23% osc_ap_completion
        osc_extent_finish
        brw_interpret
        ptlrpc_check_set
        ptlrpcd_check
        ptlrpcd
        kthread
        child_rip
        + 13.76% cl_env_put
        + 12.37% cl_env_get
        + 7.10% vvp_write_complete
        + 6.51% kfree
        + 4.62% osc_teardown_async_page
        + 3.96% osc_page_delete
        + 3.89% osc_lru_add_batch
        + 2.69% kmem_cache_free
        + 2.23% osc_page_init
        + 1.71% sptlrpc_import_sec_ref
        + 1.64% osc_page_transfer_add
        + 1.57% osc_io_submit
        + 1.43% cfs_percpt_lock

      Attachments

        1. LU6370_1GB_BS.lctldk.out.gz
          0.2 kB
        2. LU6370_2GB_BS.lctldk.out.gz
          0.2 kB
        3. LU6370_cpu_log_20150402.out.gz
          20 kB
        4. LU6370_max_cached_mb_20150402.out.gz
          521 kB
        5. lu-6370_perf_data.tgz
          5 kB
        6. lu-6370-perf.tgz
          4.03 MB

        Issue Links

          Activity

            [LU-6370] Read performance degrades with increasing read block size.

            In that case, I just need the log after it has been running for a few minutes.

            jay Jinshan Xiong (Inactive) added a comment - In that case, I just need the log after it has been running for a few minutes.

            Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

            lewisj John Lewis (Inactive) added a comment - Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.
            yujian Jian Yu added a comment -

            Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

            Hi John,
            With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

            yujian Jian Yu added a comment - Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically? Hi John, With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

            Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

            jay Jinshan Xiong (Inactive) added a comment - Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?
            lewisj John Lewis (Inactive) added a comment - - edited

            We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to

            {1,16,56}

            alleviates the huge performance degradation.

            The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%.

            1008 OST fs
                            bs=512M bs=1G   bs=2G
            Count 1         15      16      23
            Count 4         5       12      384
            Count 16        4       2       2
            Count 56        4       2       2
            
            56 OST fs
                            bs=512M bs=1G   bs=2G
            Count 1         2       2       3
            Count 4         2       2       2
            Count 16        2       2       2
            Count 56        2       2       4
            

            Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

            lewisj John Lewis (Inactive) added a comment - - edited We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to {1,16,56} alleviates the huge performance degradation. The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%. 1008 OST fs bs=512M bs=1G bs=2G Count 1 15 16 23 Count 4 5 12 384 Count 16 4 2 2 Count 56 4 2 2 56 OST fs bs=512M bs=1G bs=2G Count 1 2 2 3 Count 4 2 2 2 Count 16 2 2 2 Count 56 2 2 4 Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

            Jinshan, please see attached reports. count=1 was used for both of these files.

            lewisj John Lewis (Inactive) added a comment - Jinshan, please see attached reports. count=1 was used for both of these files.

            I can't parse the results on my node:

            [root@mercury ~]# perf report -v -i perf.1G 
            legacy perf.data format
            magic/endian check failed
            incompatible file format (rerun with -v to learn more)[root@mercury ~]# 
            

            Can you do it on your side and post the result here?

            Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

            jay Jinshan Xiong (Inactive) added a comment - I can't parse the results on my node: [root@mercury ~]# perf report -v -i perf.1G legacy perf.data format magic/endian check failed incompatible file format (rerun with -v to learn more)[root@mercury ~]# Can you do it on your side and post the result here? Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

            perf results from bs=

            {1G,2G}

            testing

            ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1
            ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G
            3+0 records in
            3+0 records out
            3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s
            [ perf record: Woken up 1 times to write data ]
            [ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ]
            titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt
            total 3145745
            rw-rr- 1 root root 3221225472 Mar 16 14:21 3gurandomdata
            rw------ 1 root root 445452 Mar 17 13:05 perf.1G
            ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G
            0+2 records in
            0+2 records out
            3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s
            [ perf record: Woken up 61 times to write data ]
            [ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]

            lewisj John Lewis (Inactive) added a comment - perf results from bs= {1G,2G} testing ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1 ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G 3+0 records in 3+0 records out 3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ] titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt total 3145745 rw-r r - 1 root root 3221225472 Mar 16 14:21 3gurandomdata rw ------ 1 root root 445452 Mar 17 13:05 perf.1G ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G 0+2 records in 0+2 records out 3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s [ perf record: Woken up 61 times to write data ] [ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]

            Jinshan - They're referring to LU-3321. I wouldn't call it CLIO work either.

            paf Patrick Farrell (Inactive) added a comment - Jinshan - They're referring to LU-3321 . I wouldn't call it CLIO work either.

            From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem.

            From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?

            jay Jinshan Xiong (Inactive) added a comment - From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem. From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?

            This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL.

            I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan.

            I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data.

            paf Patrick Farrell (Inactive) added a comment - This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL. I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan. I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data.

            People

              jay Jinshan Xiong (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: