Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19193

osd-zfs: data block size of ZFS objects inappropriately set to 4k

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.15.7
    • zfs-osd OSTs with HDD pools
      lustre-2.15.7_1.llnl-1.t4.x86_64
      zfs-2.2.8_1llnl-1.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Our test:
      Test pool contains 106 HDDs in two draid2:11d:1s:53c vdevs with two NVMe special devices.
      Run obdfilter-survey on the OST with rszlo=128k and rszhi=4M
      During the run, monitor ZFS IOs with `zpool iostat -r 10`

      We observed the overwhelming majority of IOs to disk were size 4k, for all record sizes obdfilter-survey tested. Write rates were much worse than we observed in the past on the same system.

      For rsz 128K - 4K, we saw (updated results)

      ost  1 sz 268435456K rsz  128K obj  128 thr 1024 write 1245.37 [ 246.13, 6871.38] read 3627.31 [2415.46, 22124.38] 
      ost  1 sz 268435456K rsz  256K obj  128 thr 1024 write  419.23 [ 104.22, 9325.36] read 3000.64 [ 491.82, 11144.38] 
      ost  1 sz 268435456K rsz  512K obj  128 thr 1024 write  166.58 [   0.00, 4366.21] read 1829.00 [ 237.22, 5730.30] 
      ost  1 sz 268435456K rsz 1024K obj  128 thr 1024 write  299.09 [   0.00, 8173.13] read 3357.30 [1417.14, 7969.38] 
      ost  1 sz 268435456K rsz 2048K obj  128 thr 1024 write  161.40 [   0.00, 12826.61] read 2221.06 [ 375.63, 6200.53] 
      ost  1 sz 268435456K rsz 4096K obj  128 thr 1024 write  123.78 [   0.00, 4330.61] read 1784.79 [ 347.93, 3969.97]  

      vs earlier performance

      ost  1 sz 268435456K rsz  128K obj  128 thr 1024 write 2528.98 [1197.09, 10672.67] read 11173.41 [3197.39, 17061.16] 
      ost  1 sz 268435456K rsz  256K obj  128 thr 1024 write 3714.67 [2467.70, 3841.77] read 10762.76 [3691.20, 17293.81] 
      ost  1 sz 268435456K rsz  512K obj  128 thr 1024 write 5686.06 [3543.69, 6544.91] read 10420.38 [2314.15, 17217.78] 
      ost  1 sz 268435456K rsz 1024K obj  128 thr 1024 write 8539.14 [5295.25, 13158.89] read 12566.79 [5110.33, 17110.34] 
      ost  1 sz 268435456K rsz 2048K obj  128 thr 1024 write 13092.97 [7478.98, 13707.40] read 15171.22 [10413.99, 17550.10] 
      ost  1 sz 268435456K rsz 4096K obj  128 thr 1024 write 16943.89 [9380.09, 20107.31] read 14525.17 [5994.41, 16746.13] 
      

      We also observed performance measured via IOR was much worse. The files created via IOR had a 4k data block size, per ZDB, like this:

      The above statement was incorrect - the example file below with a 4K block sizes wasn't created via IOR, it was created by the standard RHEL 8 utility "cp"  

      Dataset kern4/ost1 [ZPL], ID 644, cr_txg 28, 13.1T, 6380645 objects
      
          Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
           99201    3   128K     4K   247M     512  23.1M  100.00  ZFS plain file 

      We have also observed the issue with files created by mpifileutils utility "dsync" and "dd".

      We tested with older ZFS and Lustre versions, and identified this patch as the culprit:
      https://review.whamcloud.com/c/fs/lustre-release/+/47768 "LU-15963 osd-zfs: use contiguous chunk to grow blocksize"

      Testing on the same system with 2.15.7_2.llnl, which had patch 47768 reverted, restored performance and showed the expected data block sizes.

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: