Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6370

Read performance degrades with increasing read block size.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,
    • 3
    • 9223372036854775807

    Description

      We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

      ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
      ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

      4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
      1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
      4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
      16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
      32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
      64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
      128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
      256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
      512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
      1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
      2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
      3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

      This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
      3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

      Another test case was used with C code rather than dd that provided similar results based on an fread call

      int read_size = 256*1024*1024*2;
      fread(buffer, sizeof(float), read_size, fp_in);

      Also, file striping information on production and tds filesystems:
      ext8:/lustre # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 833
      obdidx objid objid group
      833 5978755 0x5b3a83 0
      834 5953949 0x5ad99d 0
      835 5958818 0x5aeca2 0
      836 5966400 0x5b0a40 0

      ext8:/lustretds # lfs getstripe 3gurandomdata
      3gurandomdata
      lmm_stripe_count: 4
      lmm_stripe_size: 1048576
      lmm_pattern: 1
      lmm_layout_gen: 0
      lmm_stripe_offset: 51
      obdidx objid objid group
      51 1451231 0x1624df 0
      52 1452258 0x1628e2 0
      53 1450278 0x162126 0
      54 1444772 0x160ba4 0

      So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

      • 8.74% [kernel] [k] _spin_lock - _spin_lock
      • 22.23% osc_ap_completion
        osc_extent_finish
        brw_interpret
        ptlrpc_check_set
        ptlrpcd_check
        ptlrpcd
        kthread
        child_rip
        + 13.76% cl_env_put
        + 12.37% cl_env_get
        + 7.10% vvp_write_complete
        + 6.51% kfree
        + 4.62% osc_teardown_async_page
        + 3.96% osc_page_delete
        + 3.89% osc_lru_add_batch
        + 2.69% kmem_cache_free
        + 2.23% osc_page_init
        + 1.71% sptlrpc_import_sec_ref
        + 1.64% osc_page_transfer_add
        + 1.57% osc_io_submit
        + 1.43% cfs_percpt_lock

      Attachments

        1. LU6370_1GB_BS.lctldk.out.gz
          0.2 kB
        2. LU6370_2GB_BS.lctldk.out.gz
          0.2 kB
        3. LU6370_cpu_log_20150402.out.gz
          20 kB
        4. LU6370_max_cached_mb_20150402.out.gz
          521 kB
        5. lu-6370_perf_data.tgz
          5 kB
        6. lu-6370-perf.tgz
          4.03 MB

        Issue Links

          Activity

            [LU-6370] Read performance degrades with increasing read block size.

            We can certainly do that (the GPL requires we do it ), but I can't personally without permission from my boss (who's not available for a few days). As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...? If the ORNL staff feel it's important, I can look in to getting permission.

            So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective.

            Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then LU-3321 is unlikely to be the culprit.

            paf Patrick Farrell (Inactive) added a comment - - edited We can certainly do that (the GPL requires we do it ), but I can't personally without permission from my boss (who's not available for a few days). As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...? If the ORNL staff feel it's important, I can look in to getting permission. So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective. Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then LU-3321 is unlikely to be the culprit.
            ezell Matt Ezell added a comment -

            > You mentioned that you've back ported LU-3321, is it possible for me to get access to the code?

            Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes LU-3321, or with b2_7/master that also includes this patch (plus many more things).

            ezell Matt Ezell added a comment - > You mentioned that you've back ported LU-3321 , is it possible for me to get access to the code? Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes LU-3321 , or with b2_7/master that also includes this patch (plus many more things).

            Hi Dustin,

            Thanks for the debug info. You mentioned that you've back ported LU-3321, is it possible for me to get access to the code?

            jay Jinshan Xiong (Inactive) added a comment - Hi Dustin, Thanks for the debug info. You mentioned that you've back ported LU-3321 , is it possible for me to get access to the code?

            Jinshan,
            I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested.

            Please let me know if you need additional info.

            Thanks,
            Dustin

            dustb100 Dustin Leverman added a comment - Jinshan, I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested. Please let me know if you need additional info. Thanks, Dustin

            Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin?

            The latest Cray clients on your mainframe have the performance improvements from LU-3321 back ported in to their version of 2.5, so if you see it not in Intel's 2.5, but in Cray's 2.5 and Intel's 2.6 and 2.7, that would suggest a place to look.

            paf Patrick Farrell (Inactive) added a comment - Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin? The latest Cray clients on your mainframe have the performance improvements from LU-3321 back ported in to their version of 2.5, so if you see it not in Intel's 2.5, but in Cray's 2.5 and Intel's 2.6 and 2.7, that would suggest a place to look.

            I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it).

            I am working on gathering the requested CPU and LRU info. I will have it for you soon.

            -Dustin

            dustb100 Dustin Leverman added a comment - I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it). I am working on gathering the requested CPU and LRU info. I will have it for you soon. -Dustin

            Can you see the same problem for even bigger block size, for example 4G?

            jay Jinshan Xiong (Inactive) added a comment - Can you see the same problem for even bigger block size, for example 4G?

            btw, Dustin, you don't need to wait the test to finish, just clear the log beforehand and let it run for a few minutes, then collect the log. Thanks again.

            jay Jinshan Xiong (Inactive) added a comment - btw, Dustin, you don't need to wait the test to finish, just clear the log beforehand and let it run for a few minutes, then collect the log. Thanks again.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            After taking a further look, the kiblnd_cq_completion() was running on CPU 15 for even longer time, before the reading process was rescheduled to that CPU. It had been using idle process's stack so kernel scheduler considered that CPU as idle, this is why the reading process was scheduled. If a task can take so long time, it should really be using process context.

            It's still unclear why this is not happening in 1G block size. Not sure if this is related to LRU management.

            Hi Dustin,

            If you have another chance to run the test, can you collect some log for CPU usage info and the log for Lustre LRU stats(the output of `lctl get_param llite.*.max_cached_mb and osc.*.osc_cached_mb').

            Thanks,

            jay Jinshan Xiong (Inactive) added a comment - - edited After taking a further look, the kiblnd_cq_completion() was running on CPU 15 for even longer time, before the reading process was rescheduled to that CPU. It had been using idle process's stack so kernel scheduler considered that CPU as idle, this is why the reading process was scheduled. If a task can take so long time, it should really be using process context. It's still unclear why this is not happening in 1G block size. Not sure if this is related to LRU management. Hi Dustin, If you have another chance to run the test, can you collect some log for CPU usage info and the log for Lustre LRU stats(the output of `lctl get_param llite.*.max_cached_mb and osc.*.osc_cached_mb'). Thanks,

            Jinshan,
            We did the same testing on the 2.5.3 client and do not see the issue.

            The data I gave you was from a client that was IB attached, but we are seeing this on our Cray systems as well (GNI attached). I wanted to give you a heads up in case that helps you with tracking down the issue.

            -Dustin

            dustb100 Dustin Leverman added a comment - Jinshan, We did the same testing on the 2.5.3 client and do not see the issue. The data I gave you was from a client that was IB attached, but we are seeing this on our Cray systems as well (GNI attached). I wanted to give you a heads up in case that helps you with tracking down the issue. -Dustin

            Hi Dustin,

            Thanks for the debug information. From the 2G output, I can see some strange things were happening.

            The reading process is 11235.

            At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15.

            From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds.

            00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0)
            00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            ....
            00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
            00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
            00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++
            00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving
            

            This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why.

            Have you ever run this kind of reading 2G block size job before on pre 2.6 client?

            I'm still reading the log and trying to dig more things. Stay tuned.

            jay Jinshan Xiong (Inactive) added a comment - Hi Dustin, Thanks for the debug information. From the 2G output, I can see some strange things were happening. The reading process is 11235. At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15. From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds. 00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0) 00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ .... 00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++ 00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why. Have you ever run this kind of reading 2G block size job before on pre 2.6 client? I'm still reading the log and trying to dig more things. Stay tuned.

            People

              jay Jinshan Xiong (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: