Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16501

QOS allocator not balancing space enough

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      The LOD QOS object allocator is not giving enough priority to free space on the OSTs. Even when the OST free space is significantly imbalanced (in the below example OST0001 has 1/4 or 1/5 of the free space as other OSTs) it is still being used for allocations far more often than it should:

      # lfs df
      UUID                   1K-blocks        Used   Available Use% Mounted on
      testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
      testfs-OST0000_UUID       313104       86656      199288  31% /mnt/testfs[OST:0]
      testfs-OST0001_UUID       313104      242304       43640  85% /mnt/testfs[OST:1]
      testfs-OST0002_UUID       313104       77444      208500  28% /mnt/testfs[OST:2]
      testfs-OST0003_UUID       313104      118404      167540  42% /mnt/testfs[OST:3]
      
      filesystem_summary:      1252416      524808      618968  46% /mnt/testfs
      
      # for F in /mnt/testfs/junk3.{21..30}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
      # lfs getstripe /mnt/testfs/junk3.{21..30} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
            8 0,
            6 1,
            8 2,
            8 3,
      

      When the free space is even more imbalanced (1/11, 1/17, 1/20 of the other OSTs, and after at least qos_maxage) the OST selection continues to select the nearly full OST0001 and eventually runs out of space even though the 3-stripe layout could be satisfied from the other OSTs:

      # lfs df
      UUID                   1K-blocks        Used   Available Use% Mounted on
      testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
      testfs-OST0000_UUID       313104      109184      176760  39% /mnt/testfs[OST:0]
      testfs-OST0001_UUID       313104      277120        8824  97% /mnt/testfs[OST:1]
      testfs-OST0002_UUID       313104      128644      157300  45% /mnt/testfs[OST:2]
      testfs-OST0003_UUID       313104      183940      102004  65% /mnt/testfs[OST:3]
      
      filesystem_summary:      1252416      698888      444888  62% /mnt/testfs
      
      # for F in /mnt/testfs/junk3.{31..40}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
      fallocate: fallocate failed: No space left on device
      # lfs getstripe /mnt/testfs/junk3.{31..40} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
           11 0,
            5 1,
            6 2,
            8 3,
      

      I think there are a few contributing factors here that need to be fixed:

      • the inode usage on all of the OSTs is virtually identical and very low (only 1%), so the patch https://review.whamcloud.com/35219 "LU-11213 lod: share object alloc QoS code with LMV" is may be confusing the proper OST selection when the average object size is much larger than what is needed to fill the OST. However, having both inode and space weighting makes sense for DoM, and ldiskfs OSTs where the average file size is at or below the bytes/inode ratio (where selecting OSTs with free inodes become more important than free blocks). For ZFS the free blocks and free inodes are directly proportional to each other so this has no effect.
      • having multiple OSTs on the same OSS (as in this case) is using part of the QOS weighting from the total OSS free space, so this artificially boosts the weight of the full OST when it should not be. It might make sense to change the QOS algorithm to keep the OSS weight separate from the OST weight, and then select first the OSS and then the OST within the OSS. That would maximize the per-OSS bandwidth while avoiding OSTs on that OSS that are full.

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: