Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.16.0, Lustre 2.15.2
-
None
-
3
-
9223372036854775807
Description
The LOD QOS object allocator is not giving enough priority to free space on the OSTs. Even when the OST free space is significantly imbalanced (in the below example OST0001 has 1/4 or 1/5 of the free space as other OSTs) it is still being used for allocations far more often than it should:
# lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 125056 2276 111544 2% /mnt/testfs[MDT:0] testfs-OST0000_UUID 313104 86656 199288 31% /mnt/testfs[OST:0] testfs-OST0001_UUID 313104 242304 43640 85% /mnt/testfs[OST:1] testfs-OST0002_UUID 313104 77444 208500 28% /mnt/testfs[OST:2] testfs-OST0003_UUID 313104 118404 167540 42% /mnt/testfs[OST:3] filesystem_summary: 1252416 524808 618968 46% /mnt/testfs # for F in /mnt/testfs/junk3.{21..30}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done # lfs getstripe /mnt/testfs/junk3.{21..30} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c 8 0, 6 1, 8 2, 8 3,
When the free space is even more imbalanced (1/11, 1/17, 1/20 of the other OSTs, and after at least qos_maxage) the OST selection continues to select the nearly full OST0001 and eventually runs out of space even though the 3-stripe layout could be satisfied from the other OSTs:
# lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 125056 2276 111544 2% /mnt/testfs[MDT:0] testfs-OST0000_UUID 313104 109184 176760 39% /mnt/testfs[OST:0] testfs-OST0001_UUID 313104 277120 8824 97% /mnt/testfs[OST:1] testfs-OST0002_UUID 313104 128644 157300 45% /mnt/testfs[OST:2] testfs-OST0003_UUID 313104 183940 102004 65% /mnt/testfs[OST:3] filesystem_summary: 1252416 698888 444888 62% /mnt/testfs # for F in /mnt/testfs/junk3.{31..40}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done fallocate: fallocate failed: No space left on device # lfs getstripe /mnt/testfs/junk3.{31..40} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c 11 0, 5 1, 6 2, 8 3,
I think there are a few contributing factors here that need to be fixed:
- the inode usage on all of the OSTs is virtually identical and very low (only 1%), so the patch https://review.whamcloud.com/35219 "
LU-11213lod: share object alloc QoS code with LMV" is may be confusing the proper OST selection when the average object size is much larger than what is needed to fill the OST. However, having both inode and space weighting makes sense for DoM, and ldiskfs OSTs where the average file size is at or below the bytes/inode ratio (where selecting OSTs with free inodes become more important than free blocks). For ZFS the free blocks and free inodes are directly proportional to each other so this has no effect. - having multiple OSTs on the same OSS (as in this case) is using part of the QOS weighting from the total OSS free space, so this artificially boosts the weight of the full OST when it should not be. It might make sense to change the QOS algorithm to keep the OSS weight separate from the OST weight, and then select first the OSS and then the OST within the OSS. That would maximize the per-OSS bandwidth while avoiding OSTs on that OSS that are full.