Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.16.0, Lustre 2.15.2
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The LOD QOS object allocator is not giving enough priority to free space on the OSTs. Even when the OST free space is significantly imbalanced (in the below example OST0001 has 1/4 or 1/5 of the free space as other OSTs) it is still being used for allocations far more often than it should:

# lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
testfs-OST0000_UUID       313104       86656      199288  31% /mnt/testfs[OST:0]
testfs-OST0001_UUID       313104      242304       43640  85% /mnt/testfs[OST:1]
testfs-OST0002_UUID       313104       77444      208500  28% /mnt/testfs[OST:2]
testfs-OST0003_UUID       313104      118404      167540  42% /mnt/testfs[OST:3]

filesystem_summary:      1252416      524808      618968  46% /mnt/testfs

# for F in /mnt/testfs/junk3.{21..30}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
# lfs getstripe /mnt/testfs/junk3.{21..30} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
      8 0,
      6 1,
      8 2,
      8 3,

When the free space is even more imbalanced (1/11, 1/17, 1/20 of the other OSTs, and after at least qos_maxage) the OST selection continues to select the nearly full OST0001 and eventually runs out of space even though the 3-stripe layout could be satisfied from the other OSTs:

# lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
testfs-OST0000_UUID       313104      109184      176760  39% /mnt/testfs[OST:0]
testfs-OST0001_UUID       313104      277120        8824  97% /mnt/testfs[OST:1]
testfs-OST0002_UUID       313104      128644      157300  45% /mnt/testfs[OST:2]
testfs-OST0003_UUID       313104      183940      102004  65% /mnt/testfs[OST:3]

filesystem_summary:      1252416      698888      444888  62% /mnt/testfs

# for F in /mnt/testfs/junk3.{31..40}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
fallocate: fallocate failed: No space left on device
# lfs getstripe /mnt/testfs/junk3.{31..40} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
     11 0,
      5 1,
      6 2,
      8 3,

I think there are a few contributing factors here that need to be fixed:

the inode usage on all of the OSTs is virtually identical and very low (only 1%), so the patch https://review.whamcloud.com/35219 "LU-11213 lod: share object alloc QoS code with LMV" is may be confusing the proper OST selection when the average object size is much larger than what is needed to fill the OST. However, having both inode and space weighting makes sense for DoM, and ldiskfs OSTs where the average file size is at or below the bytes/inode ratio (where selecting OSTs with free inodes become more important than free blocks). For ZFS the free blocks and free inodes are directly proportional to each other so this has no effect.
having multiple OSTs on the same OSS (as in this case) is using part of the QOS weighting from the total OSS free space, so this artificially boosts the weight of the full OST when it should not be. It might make sense to change the QOS algorithm to keep the OSS weight separate from the OST weight, and then select first the OSS and then the OST within the OSS. That would maximize the per-OSS bandwidth while avoiding OSTs on that OSS that are full.

Attachments

Issue Links

is related to

LU-16522 "lfs setstripe -i N" with deactivated OST(s) always picks next active OST

Open

LU-16588 lod doesn't include local MDT

Open

LU-17614 improve LOD QOS OST selection

Open

is related to

LU-9 Optimize weighted QOS Round-Robin allocator

Open

Activity

People

Assignee:: Sergey Cheremencev

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Jan/23 1:17 AM

Updated:: 06/Mar/24 11:09 PM

Resolved:: 21/Mar/23 11:31 PM