Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6370

Read performance degrades with increasing read block size.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,
    • 3
    • 9223372036854775807

    Description

      We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

      ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
      ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

      4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
      1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
      4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
      16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
      32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
      64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
      128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
      256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
      512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
      1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
      2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
      3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

      This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
      3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

      Another test case was used with C code rather than dd that provided similar results based on an fread call

      int read_size = 256*1024*1024*2;
      fread(buffer, sizeof(float), read_size, fp_in);

      Also, file striping information on production and tds filesystems:
      ext8:/lustre # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 833
      obdidx objid objid group
      833 5978755 0x5b3a83 0
      834 5953949 0x5ad99d 0
      835 5958818 0x5aeca2 0
      836 5966400 0x5b0a40 0

      ext8:/lustretds # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 51
      obdidx objid objid group
      51 1451231 0x1624df 0
      52 1452258 0x1628e2 0
      53 1450278 0x162126 0
      54 1444772 0x160ba4 0

      So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

      • 8.74% [kernel] [k] _spin_lock - _spin_lock
      • 22.23% osc_ap_completion
        osc_extent_finish
        brw_interpret
        ptlrpc_check_set
        ptlrpcd_check
        ptlrpcd
        kthread
        child_rip
        + 13.76% cl_env_put
        + 12.37% cl_env_get
        + 7.10% vvp_write_complete
        + 6.51% kfree
        + 4.62% osc_teardown_async_page
        + 3.96% osc_page_delete
        + 3.89% osc_lru_add_batch
        + 2.69% kmem_cache_free
        + 2.23% osc_page_init
        + 1.71% sptlrpc_import_sec_ref
        + 1.64% osc_page_transfer_add
        + 1.57% osc_io_submit
        + 1.43% cfs_percpt_lock

      Attachments

        1. LU6370_1GB_BS.lctldk.out.gz
          0.2 kB
        2. LU6370_2GB_BS.lctldk.out.gz
          0.2 kB
        3. LU6370_cpu_log_20150402.out.gz
          20 kB
        4. LU6370_max_cached_mb_20150402.out.gz
          521 kB
        5. lu-6370_perf_data.tgz
          5 kB
        6. lu-6370-perf.tgz
          4.03 MB

        Issue Links

          Activity

            [LU-6370] Read performance degrades with increasing read block size.

            Hi Dustin,

            Thanks for the debug information. From the 2G output, I can see some strange things were happening.

            The reading process is 11235.

            At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15.

            From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds.

            00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0)
            00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            ....
            00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++
            00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving
            

            This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why.

            Have you ever run this kind of reading 2G block size job before on pre 2.6 client?

            I'm still reading the log and trying to dig more things. Stay tuned.

            jay Jinshan Xiong (Inactive) added a comment - Hi Dustin, Thanks for the debug information. From the 2G output, I can see some strange things were happening. The reading process is 11235. At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15. From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds. 00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0) 00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ .... 00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++ 00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why. Have you ever run this kind of reading 2G block size job before on pre 2.6 client? I'm still reading the log and trying to dig more things. Stay tuned.

            I have gathered the requested debug data for LU6370.

            I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds.

            Each output file is roughly 30MB, compressed.

            Thanks,
            Dustin

            dustb100 Dustin Leverman added a comment - I have gathered the requested debug data for LU6370. I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds. Each output file is roughly 30MB, compressed. Thanks, Dustin
            yujian Jian Yu added a comment -

            Thank you John and Dustin.

            yujian Jian Yu added a comment - Thank you John and Dustin.

            My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today.

            lewisj John Lewis (Inactive) added a comment - My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today.

            In that case, I just need the log after it has been running for a few minutes.

            jay Jinshan Xiong (Inactive) added a comment - In that case, I just need the log after it has been running for a few minutes.

            Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

            lewisj John Lewis (Inactive) added a comment - Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.
            yujian Jian Yu added a comment -

            Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

            Hi John,
            With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

            yujian Jian Yu added a comment - Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically? Hi John, With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

            Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

            jay Jinshan Xiong (Inactive) added a comment - Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?
            lewisj John Lewis (Inactive) added a comment - - edited

            We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to

            {1,16,56}

            alleviates the huge performance degradation.

            The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%.

            1008 OST fs
                            bs=512M bs=1G   bs=2G
            Count 1         15      16      23
            Count 4         5       12      384
            Count 16        4       2       2
            Count 56        4       2       2
            
            56 OST fs
                            bs=512M bs=1G   bs=2G
            Count 1         2       2       3
            Count 4         2       2       2
            Count 16        2       2       2
            Count 56        2       2       4
            

            Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

            lewisj John Lewis (Inactive) added a comment - - edited We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to {1,16,56} alleviates the huge performance degradation. The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%. 1008 OST fs bs=512M bs=1G bs=2G Count 1 15 16 23 Count 4 5 12 384 Count 16 4 2 2 Count 56 4 2 2 56 OST fs bs=512M bs=1G bs=2G Count 1 2 2 3 Count 4 2 2 2 Count 16 2 2 2 Count 56 2 2 4 Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

            Jinshan, please see attached reports. count=1 was used for both of these files.

            lewisj John Lewis (Inactive) added a comment - Jinshan, please see attached reports. count=1 was used for both of these files.

            I can't parse the results on my node:

            [root@mercury ~]# perf report -v -i perf.1G 
            legacy perf.data format
            magic/endian check failed
            incompatible file format (rerun with -v to learn more)[root@mercury ~]# 
            

            Can you do it on your side and post the result here?

            Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

            jay Jinshan Xiong (Inactive) added a comment - I can't parse the results on my node: [root@mercury ~]# perf report -v -i perf.1G legacy perf.data format magic/endian check failed incompatible file format (rerun with -v to learn more)[root@mercury ~]# Can you do it on your side and post the result here? Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

            People

              jay Jinshan Xiong (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: