[LU-12335] mb_prealloc_table table read/write code is racy Created: 24/May/19 Updated: 10/Sep/19 Resolved: 10/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Artem Blagodarenko (Inactive) | Assignee: | Artem Blagodarenko (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Preallocation table read/write code is racy. There is a possibility of accessing memory outside of allocated table. |
| Comments |
| Comment by Gerrit Updater [ 24/May/19 ] |
|
Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/34950 |
| Comment by Andreas Dilger [ 24/May/19 ] |
|
I guess the first question is whether the preallocation table settings are even useful? We've been carrying that patch for many years without submitting it upstream, because I'm not sure whether it actually improves performance or functionality or is just overhead for patch maintenance? Do you have a real test system where you could measure performance under load to see if removing ext4-prealloc.patch improves or hurts performance or allocation behaviour? If there is data that shows the patch improves performance noticeably under at least some non-Lustre workloads, and doesn't hurt performance, then it would make sense to push the patch upstream finally. |
| Comment by Artem Blagodarenko (Inactive) [ 24/May/19 ] |
|
I used preallocation table to solve allocator problems on aged systems(
We have test results for third solution. 140TB ldiskfs partition. Will share results to Here is bash script that build prealloc table based on mb_groups output: [root@localhost cray-lustre]# cat build_prealloc.sh #!/bin/bash INPUT_FILE=$1 #columes from 9 to 21 shows how free fragments available for index in {9..21} do PARAMS="'NR>1 {if (\$$index > 0) { print }}'" REGS=`eval awk "$PARAMS" $INPUT_FILE | wc -l` VAL=$((2 ** ($index-8))) [ $REGS -gt 0 ] && PREALLOC_TABLE="$PREALLOC_TABLE $VAL" done echo "prealloc table: $PREALLOC_TABLE" Example how to use it: cat /proc/fs/ldiskfs/loop1/mb_groups > table.dat sh build_prealloc.sh table.dat > prealloc.txt cat prealloc.txt > /proc/fs/ldiskfs/loop1/prealloc_table tar -xf fsxfs-n24.img.tgz cp fsxfs-n24.img fsxfs-n24-2.img And run test that 1) make large preallocation table 2) start dd 3) adjust preallocation table using script above 4) start dd
start_mb_stats()
{
echo "1" > /sys/fs/ldiskfs/loop1/mb_stats
echo "0" > /sys/fs/ldiskfs/loop1/mb_c1_threshold
echo "0" > /sys/fs/ldiskfs/loop1/mb_c2_threshold
echo "0" > /sys/fs/ldiskfs/loop1/mb_c3_threshold
}
mount_image()
{
local IMAGE=$1
mount -t xfs -o loop $IMAGE /mnt/fs2xfs/
mount -t ldiskfs -o loop /mnt/fs2xfs/n24.raw /mnt/fs2ost/
}
umount_image()
{
umount /mnt/fs2ost/
umount /mnt/fs2xfs/
}
1. Set too large preallocation table and estimate write speed LOAD=yes lustre/tests/llmount.sh mount_image /lustre/mnt/staff/CAST-19722/fsxfs-n24.img echo "256 512 1024 2048 4096 8192 16384" > /proc/fs/ldiskfs/loop1/prealloc_table start_mb_stats dd if=/dev/zero of=/mnt/fs2ost/O/foofile bs=1048576 count=1024 conv=fsync cat /proc/fs/ldiskfs/loop1/mb_alloc echo "clear" > /proc/fs/ldiskfs/loop1/mb_alloc umount_image mount_image /lustre/mnt/staff/CAST-19722/fsxfs-n24-2.img 2. Adjast preallocation table based on mb_groups output cat /proc/fs/ldiskfs/loop1/mb_groups > $TMP/table.dat sh build_prealloc.sh $TMP/table.dat > $TMP/prealloc.txt cat $TMP/prealloc.txt > /proc/fs/ldiskfs/loop1/prealloc_table 3. Estimate preformance again dd if=/dev/zero of=/mnt/fs2ost/O/foofile bs=1048576 count=1024 conv=fsync cat /proc/fs/ldiskfs/loop1/mb_alloc echo "clear" > /proc/fs/ldiskfs/loop1/mb_alloc umount_image [root@localhost cray-lustre]# sh start.sh Loading modules from /lustre/mnt/orig/cray-lustre/lustre/tests/.. detected 8 online CPUs by sysfs libcfs will create CPU partition based on online CPUs 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 11.2427 s, 95.5 MB/s mballoc: 262144 blocks 153 reqs (137 success) mballoc: 2046 extents scanned, 127 goal hits, 1 2^N hits, 10 breaks, 0 lost mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (0, 0, 0) skipped c(0,1,2) loops 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 9.22825 s, 116 MB/s mballoc: 262143 blocks 243 reqs (240 success) mballoc: 141 extents scanned, 113 goal hits, 129 2^N hits, 0 breaks, 0 lost mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (0, 0, 0) skipped c(0,1,2) loops [root@localhost cray-lustre]# test passed and shows ~18% speed improvement [root@localhost cray-lustre]# sh start.sh Loading modules from /lustre/mnt/orig/cray-lustre/lustre/tests/.. detected 8 online CPUs by sysfs libcfs will create CPU partition based on online CPUs 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 11.2427 s, 95.5 MB/s mballoc: 262144 blocks 153 reqs (137 success) mballoc: 2046 extents scanned, 127 goal hits, 1 2^N hits, 10 breaks, 0 lost mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (0, 0, 0) skipped c(0,1,2) loops 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 9.22825 s, 116 MB/s mballoc: 262143 blocks 243 reqs (240 success) mballoc: 141 extents scanned, 113 goal hits, 129 2^N hits, 0 breaks, 0 lost mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (0, 0, 0) skipped c(0,1,2) loops [root@localhost cray-lustre]# I am going test this approach on 140TB ldiskfs OST soon. |
| Comment by Andreas Dilger [ 24/May/19 ] |
How hard would it be to include this into the mballoc code in the kernel directly? Having a userspace tool is OK, but suffers from a number of limitations (hard for most users to configure, can die if there are problems (e.g. OOM), become CPU starved if the server is busy, needs extra scanning to learn current filesystem state and may become out of sync with the kernel). |
| Comment by Artem Blagodarenko (Inactive) [ 25/May/19 ] |
|
The algorithm is not difficult, as you can see in script. So, can be added to kernel. The most difficult diction - moment then we need to reconfigure preallocation table. With script, administrator decide, then change configuration.
> (hard for most users to configure, can die if there are problems (e.g. OOM), My suggestion, add to cluster scripts and adjust automatically. >become CPU starved if the server is busy Preallocation table changing is quite fast operation, and with patch https://review.whamcloud.com/34950, safe and lockless. >needs extra scanning to learn current filesystem state and may become out of sync with the kernel). Scanning is made by kernel. Script use "/proc/fs/ldiskfs/loop1/mb_groups" output. This statistic is perfect data for such decision. Anyway, even in kernel we need use this statistic.
|
| Comment by Gerrit Updater [ 29/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34950/ |
| Comment by Peter Jones [ 29/May/19 ] |
|
Landed for 2.13 |
| Comment by James A Simmons [ 16/Jun/19 ] |
|
This fix was only every applied to RHEL platforms. SLES and Ubuntu lack this fix. |
| Comment by Peter Jones [ 10/Sep/19 ] |
|
As per recent LWG discussion this ticket should be marked as RESOLVED and anyone wanting to keep SLES/Ubuntu servers in sync should do that under a separate ticket |
| Comment by James A Simmons [ 10/Sep/19 ] |
|
Please make sure this is push to the ext4 maintainers. |
| Comment by Artem Blagodarenko (Inactive) [ 10/Sep/19 ] |
|
No need send this patch to ext4 upstream because no such bug there. Bug was introduced in our ldiskfs patches. |