Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
An fallocate operation on large disk devices may be very slow.
Here is an example, I created a 100T sparse file and formatted it as ldiskfs fs, then mounted as ldiskfs:
[root@rocky9 wc-master]# df -h /mnt/ldiskfs/ Filesystem Size Used Avail Use% Mounted on /dev/vdb 100T 8.0T 87T 9% /mnt/ldiskfs [root@rocky9 wc-master]#
the fs was fragmented special way, i.e. each block group has about 8% used and 92% of free space:
time for x in {0..819200}; do fallocate -o $((x * 10))M -l 10M /mnt/ldiskfs/filler.ldiskfs ; echo $x > /proc/fs/ldiskfs/vdb/mb_last_group ; (( $x % 10000 == 0 )) && { echo -n $x -- ; date; }; done
after that, an attempt to fallocate a 100G file takes more than 1 min:
[root@rocky9 wc-master]# rmmod ldiskfs [root@rocky9 wc-master]# insmod ldiskfs/ldiskfs.ko [root@rocky9 wc-master]# mount -t ldiskfs /dev/vdb /mnt/ldiskfs/ [root@rocky9 wc-master]# echo 1 > /sys/fs/ldiskfs/vdb/mb_stats [root@rocky9 wc-master]# rm -f /mnt/ldiskfs/big [root@rocky9 wc-master]# time fallocate -o 0 -l 100G /mnt/ldiskfs/big real 1m24.296s user 0m0.001s sys 1m24.142s [root@rocky9 wc-master]#
here is the mb_stats for the operation:
[root@rocky9 wc-master]# cat /proc/fs/ldiskfs/vdb/mb_stats mballoc: reqs: 875 success: 4 groups_scanned: 175074 cr0_stats: hits: 0 groups_considered: 0 useless_loops: 0 bad_suggestions: 0 cr1_stats: hits: 4 groups_considered: 713509254 useless_loops: 0 bad_suggestions: 0 cr2_stats: hits: 0 groups_considered: 713523200 useless_loops: 871 cr3_stats: hits: 871 groups_considered: 176811 useless_loops: 0 extents_scanned: 175096 goal_hits: 0 2^n_hits: 0 breaks: 871 lost: 0 buddies_generated: 820270/819200 buddies_time_used: 1904314776 preallocated: 0 discarded: 0 [root@rocky9 wc-master]#
the mb alloc stats shows that c2 loop failed 800+ times, i.e. it failed for each allocation request. i.e. there were 800+ useless loops across all block groups in the fs.
( 800+ is from max block allocation request which about 32k blocks (128MB) and the total fallocated space of 100GB ~= 128MB x 800 ).
the statistics for cr1 is also not good:
cr1_stats: hits: 4 groups_considered: 713509254 useless_loops: 0 bad_suggestions: 0
"groups considered=713509254" means the allocator also tried all block groups for each of 800+ requests, but just didn't count those attempts as "useless loops".
My test fs has 819200, real systems may have ~5M+ groups, meaning the allocation requests would take 6-8 min or more.
it is rather a generic problem with ldiskfs block allocator, fallocate just makes it clearly visible because unlike writes , fallocate requests are not limited by BRW size.
the problem existed in RHEL8.x and the new improved block allocator from RHEL9.x / (linux-5.14) made it no better.
it would be useful to know what exact requests were coming to mballoc from fallocate. probably the requests were too big and I'd suggest to turn them into smaller ones and to be 2^n.
given the filesystem was just mounted, mballoc had to re-initialize all those groups:
another fallocate wouldn't take that long AFAIU.