Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
Testing for LU-13417 showed that "lfs setdirstripe -D -c 1 -i -1 /mnt/testfs" now caused subdirectories to be created on different MDTs when the qos_threshold_rr was reduced. However, there were still errors hit when one MDT ran out of space, when there are free inodes (and also reported multiple kernel errors). For mkdir this is a real problem because each directory needs at least one block, so the QOS code should completely avoid selection of MDTs with little free space (e.g. below 5% of the average MDT free space).
# lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 125368 9508 104624 9% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 125368 93560 20572 82% /mnt/testfs[MDT:1] # lfs df -i UUID Inodes IUsed IFree IUse% Mounted on testfs-MDT0000_UUID 100000 20295 79705 21% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 100000 40580 59420 41% /mnt/testfs[MDT:1] # ./createmany -d /mnt/testfs/dir 1000 total: 1000 mkdir in 0.57 seconds: 1768.49 ops/second [root@centos7 tests]# lfs getdirstripe -m /mnt/testfs/dir* | sort | uniq -c 871 0 129 1 # ./createmany -d /mnt/testfs/dsub/d 1000 total: 1000 mkdir in 1.64 seconds: 608.97 ops/second # lfs getdirstripe -m /mnt/testfs/dsub/d[0-9]* | sort | uniq -c 860 0 140 1
These showed a reasonable distribution of directories, over 85% of directories going to MDT0000.
However, when creating more directories the space balance doesn't change very much:
# ./createmany -d /mnt/testfs/dsub/d 1000 9000 - mkdir 5742 (time 1586376135.48 total 10.00 last 574.16) mkdir(/mnt/testfs/dsub/d9621) error: No space left on device total: 8621 mkdir in 15.30 seconds: 563.46 ops/second # lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 125368 48572 65560 43% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 125368 125368 0 100% /mnt/testfs[MDT:1] # lfs df -i UUID Inodes IUsed IFree IUse% Mounted on testfs-MDT0000_UUID 100000 29764 70236 30% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 100000 50330 49670 51% /mnt/testfs[MDT:1]
This shows that the mkdir is failing with -ENOSPC in the "-i -1" directory even though MDT0000 is still having a lot of free blocks and space. Checking the distribution of files that were created show that the distribution didn't change very much:
# lfs getdirstripe -m /mnt/testfs/dsub/d1[0-9][0-9][0-9] | sort | uniq -c 891 0 109 1 # lfs getdirstripe -m /mnt/testfs/dsub/d2[0-9][0-9][0-9] | sort | uniq -c 882 0 118 1 # lfs getdirstripe -m /mnt/testfs/dsub/d3[0-9][0-9][0-9] | sort | uniq -c 887 0 113 1 # lfs getdirstripe -m /mnt/testfs/dsub/d4[0-9][0-9][0-9] | sort | uniq -c 881 0 119 1 # lfs getdirstripe -m /mnt/testfs/dsub/d5[0-9][0-9][0-9] | sort | uniq -c 884 0 116 1 # lfs getdirstripe -m /mnt/testfs/dsub/d6[0-9][0-9][0-9] | sort | uniq -c 862 0 138 1 # lfs getdirstripe -m /mnt/testfs/dsub/d7[0-9][0-9][0-9] | sort | uniq -c 884 0 116 1 # lfs getdirstripe -m /mnt/testfs/dsub/d8[0-9][0-9][0-9] | sort | uniq -c 886 0 114 1 # lfs getdirstripe -m /mnt/testfs/dsub/d9[0-9][0-9][0-9] | sort | uniq -c 554 0 67 1
I figure that this may be related to the "qos_maxage=60" on the client causing it not to get a new space update while "createmany -d" is running, and the relatively small amount of space on the MDTs. However, even if I waited a long time it is not allowing files to create on the empty MDT:
# ./createmany -d /mnt/testfs/dsub/d 10000 1000 mkdir(/mnt/testfs/dsub/d10001) error: No space left on device total: 1 mkdir in 0.01 seconds: 104.60 ops/second
I think two improvements are needed:
- the QOS code should avoid allocating on an MDT before it becomes too full. We should limit the space/inode used to minimum ~10% of the average free space across all MDTs. This will avoid hitting -ENOSPC during creation, either from the directory or the llogs. Since directories take space, we should consider either free blocks or inodes much lower than average as a reason not to use the MDT.
- the default "qos_threshold_rr=17%" is too high to start balancing directory creation across MDTs. This might mean that a large MDT0000 is used for many millions of files and top-level directories before any balancing is even started. At that point it will be harder to return the balance of the MDTs because so many top-level directories and subdirectories have been created on MDT0000. I think it would be better to have a smaller "qos_threshold_rr=5%" or "=10%" by default, to avoid the MDTs becoming too imbalanced before starting QOS.