Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20199

A large RPC is queued as numerous bios

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • Lustre 2.17.0
    • Lustre 2.17.0, Lustre 2.16.1, Lustre 2.15.7
    • None
    • 3
    • 9223372036854775807

    Description

      Starting with LU-11526 (https://github.com/lustre/lustre-release/commit/1a9be0046b1f1772d3f934c2146dc5233c391377  , commit 1a9be0046b1f1772d3f934c2146dc5233c391377 ), Lustre has technical capability to serve RPCs as large as 64 MiB, while Lustre manual at the time of writing still only advertises 16 MiB. Large RPC come very handy in the era of Big Data and expanding use of transparent huge pages(usually 2 MiB for x86_64 architecture). It is also pretty helpful to submit as much sequential data as possible for slower storage mediums, like HDD.

       

      To illustrate the observed issue, the example is given using virtio for ease of reproduction but it is technically relevant for any architecture with 4 KiB page size and RPCs larger than 1 MiB when the underlying block device for Lustre OST is capable of handling requests equal to or larger than the set RPC size. The illustration utilizes ldiskfs but the underlying caveat might affect ZFS-based server storage too. Given the following parameters:

      [root@localhost t]# mount -t lustre | awk '{ print $1; }' | xargs -I{} bash -c 'strings -f -n1 /sys/class/block/$(basename {})/queue/max*' | sort
      /sys/class/block/vdb/queue/max_discard_segments: 1
      /sys/class/block/vdb/queue/max_hw_sectors_kb: 2147483647
      /sys/class/block/vdb/queue/max_integrity_segments: 0
      /sys/class/block/vdb/queue/max_sectors_kb: 2147483647
      /sys/class/block/vdb/queue/max_segments: 254
      /sys/class/block/vdb/queue/max_segment_size: 4294967295
      /sys/class/block/vdc/queue/max_discard_segments: 1
      /sys/class/block/vdc/queue/max_hw_sectors_kb: 2147483647
      /sys/class/block/vdc/queue/max_integrity_segments: 0
      /sys/class/block/vdc/queue/max_sectors_kb: 2147483647
      /sys/class/block/vdc/queue/max_segments: 254
      /sys/class/block/vdc/queue/max_segment_size: 4294967295

      it is clear that we can submit pretty much any imaginable request, as long as the memory pages storing the data are arranged within it sequentially enough to fit queue/max_segment_size with the total number of segments no larger than queue/max_segments. The overall request size can be bound by admin, varying queue/max_sectors_kb value up to queue/max_hw_sectors_kb. Depending on the block device driver implementation, especially virtual ones, many of these parameters can be ignored by it in the end but that is overall contract/expectations.

      Starting with kernel 5.0.x release and consequent introduction of multi-page bvec, it became easier than ever to compose requests to block device as large as one needs to, varying from megabytes to gigabytes, as long as one has enough physically continuous memory. For Lustre server edition kernel 4.18.x the largest bio possible is still 2 MiB, unlike the typical holy grail of 1 MiB, due the number of pages in a single bvec being increased from typical 256 to 512 to accommodate transparent 2 MiB huge-page swap capabilities.


      Unfortunately, it appears that while Lustre RPC capable of transferring up to 64 MiB per request, the bio requests chaotically vary in size, even when on the filesystem level the space allocation is favorable and can fit the RPC whole sequentially. To simplify the illustration and to gain some performance in production environments, we’ve disabled the page cache use on the server side and increased the RPC size to the maximum:

      lctl set_param 'obdfilter.*-OST*.brw_size=64'
      lctl set_param osd-ldiskfs.*.read_cache_enable=0
      lctl set_param osd-ldiskfs.*.writethrough_cache_enable=0

      Then we used obdfilter-survey script to illustrate the behavior without a client:

      cd $(dirname $(which obdfilter-survey))
      export nobjhi=1
      export thrhi=1
      export size=$((1 * 1024))
      export rszlo=$((64 * 1024))
      export rszhi=$((64 * 1024))
      export rszmax=$((64 * 1024))
      export case=disk
      export targets=bigreq-OST0000
      export tests_str='write'
      exec sh obdfilter-survey 

      along with logging relevant debug information:

      # cat reproduce.sh 
      lctl debug_daemon start ./lustre-dbg.txt
      lctl set_param debug=+inode
      perf record -e 'block:block_bio_queue' -- bash ~/alx/survey.sh
      lctl debug_daemon stop
      lctl set_param debug=-inode
      lctl debug_file ./lustre-dbg.txt > lustre-dbg-plain.txt 

      Then one can observe something like:

      [root@localhost t]# mount -t lustre | awk '{ print $1; }' | xargs -I{} bash -c 'echo $((0x$(stat -c %t {}))),$((0x$(stat -c %T {})))'
      253,16
      253,32
      # grep osd_submit_bio lustre-dbg-plain.txt | head 00080000:00000002:3.0:1777377656.939748:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 3592192 vcnt 256(256) sectors 7016(2560) psg 877(254) 00080000:00000002:3.0:1777377656.940427:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 63516672 vcnt 241(256) sectors 124056(2560) psg 15507(254) 00080000:00000002:3.0:1777377656.950661:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 3592192 vcnt 256(256) sectors 7016(2560) psg 877(254) 00080000:00000002:3.0:1777377656.951319:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 63516672 vcnt 241(256) sectors 124056(2560) psg 15507(254) 00080000:00000002:3.0:1777377656.961277:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 3592192 vcnt 256(256) sectors 7016(2560) psg 877(254) 00080000:00000002:3.0:1777377656.961935:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 63516672 vcnt 241(256) sectors 124056(2560) psg 15507(254) 00080000:00000002:3.0F:1777377656.971789:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 3592192 vcnt 256(256) sectors 7016(2560) psg 877(254) 00080000:00000002:3.0:1777377656.972433:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 63516672 vcnt 241(256) sectors 124056(2560) psg 15507(254) 00080000:00000002:3.0:1777377656.982480:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 3592192 vcnt 256(256) sectors 7016(2560) psg 877(254) 00080000:00000002:3.0:1777377656.983150:0:3370:0:(osd_io.c:326:osd_submit_bio()) bio++ sz 63516672 vcnt 241(256) sectors 124056(2560) psg 15507(254)
      
      grep -f majmin.txt perf-script.txt | head
                   lctl    3358 [000] 70943.896093: block:block_bio_queue: 253,32 RM 1532656 + 8 [lctl]
                   lctl    3358 [000] 70943.896709: block:block_bio_queue: 253,32 RM 1536032 + 8 [lctl]
                   lctl    3358 [000] 70943.896864: block:block_bio_queue: 253,32 RM 1527848 + 8 [lctl]
                   lctl    3362 [003] 70943.905005: block:block_bio_queue: 253,16 FWS 0 + 0 [lctl]
                   lctl    3362 [003] 70943.905147: block:block_bio_queue: 253,32 FWS 0 + 0 [lctl]
                   lctl    3370 [003] 70943.922705: block:block_bio_queue: 253,32 W 20709376 + 7016 [lctl]
                   lctl    3370 [003] 70943.923379: block:block_bio_queue: 253,32 W 20716392 + 124056 [lctl]
                   lctl    3370 [003] 70943.933613: block:block_bio_queue: 253,32 W 20840448 + 7016 [lctl]
                   lctl    3370 [003] 70943.934270: block:block_bio_queue: 253,32 W 20847464 + 124056 [lctl]             lctl    3370 [003] 70943.944229: block:block_bio_queue: 253,32 W 20447232 + 7016 [lctl]
      

      So even after cranking up the RPC size to the max, the BIOs that are queued in are 3508 KiB and 62028 KiB respectively in this run, and given that the smaller request has 256 out of 256 possible bvec entries, it is clear that the issue is related to how the pages for the bio are allocated. The pattern can vary from boot to boot and even between RPCs, as there are numerous OSS threads(and IIRC at least 2).

      For kernel’s prior to 5.0 the issue is less direct, as one needs to rely on the luck that multiple 1-2 MiB bios(depending on which kernel the server uses) happen to merge together in, for example, blk-mq stack and thus handled by the driver as one larger request. Thus, the I/O pattern will still vary significantly between client’s RPC size and the completed bios, even when filesystem parameters, be it cluster ones(like RPC) or architectural(like ldiskfs/ext4 bigalloc options and such) are favorable.

      The issue can be reproduced from a client but requires extra steps, like using fio or other hugepage-capable I/O source and preallocating large(preferably 1 GiB) pages, along with setting client-side RPC size limit to the max:

      # depends on your architecture's PAGE_SIZE
      lctl set_param osc.*.max_pages_per_rpc=$((64 * 1024 / 4096)) 

      Attachments

        Activity

          People

            aleksandr_dyadyushkin Aleksandr Dyadyushkin
            aleksandr_dyadyushkin Aleksandr Dyadyushkin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: