Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19023

slow fallocate on large ldiskfs devices

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      An fallocate operation on large disk devices may be very slow.
      Here is an example, I created a 100T sparse file and formatted it as ldiskfs fs, then mounted as ldiskfs:

      [root@rocky9 wc-master]# df -h /mnt/ldiskfs/
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/vdb        100T  8.0T   87T   9% /mnt/ldiskfs
      [root@rocky9 wc-master]# 
      

      the fs was fragmented special way, i.e. each block group has about 8% used and 92% of free space:

      time for x in {0..819200}; do fallocate -o $((x * 10))M -l 10M /mnt/ldiskfs/filler.ldiskfs ; echo $x > /proc/fs/ldiskfs/vdb/mb_last_group ; (( $x % 10000 == 0 )) && { echo -n $x -- ; date; };   done
      

      after that, an attempt to fallocate a 100G file takes more than 1 min:

      [root@rocky9 wc-master]# rmmod ldiskfs
      [root@rocky9 wc-master]# insmod ldiskfs/ldiskfs.ko 
      [root@rocky9 wc-master]# mount -t ldiskfs /dev/vdb /mnt/ldiskfs/
      [root@rocky9 wc-master]# echo 1 > /sys/fs/ldiskfs/vdb/mb_stats 
      [root@rocky9 wc-master]# rm -f /mnt/ldiskfs/big 
      [root@rocky9 wc-master]# time fallocate -o 0  -l 100G /mnt/ldiskfs/big 
      
      real	1m24.296s
      user	0m0.001s
      sys	1m24.142s
      [root@rocky9 wc-master]# 
      

      here is the mb_stats for the operation:

      [root@rocky9 wc-master]# cat /proc/fs/ldiskfs/vdb/mb_stats 
      mballoc:
      	reqs: 875
      	success: 4
      	groups_scanned: 175074
      	cr0_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      		bad_suggestions: 0
      	cr1_stats:
      		hits: 4
      		groups_considered: 713509254
      		useless_loops: 0
      		bad_suggestions: 0
      	cr2_stats:
      		hits: 0
      		groups_considered: 713523200
      		useless_loops: 871
      	cr3_stats:
      		hits: 871
      		groups_considered: 176811
      		useless_loops: 0
      	extents_scanned: 175096
      		goal_hits: 0
      		2^n_hits: 0
      		breaks: 871
      		lost: 0
      	buddies_generated: 820270/819200
      	buddies_time_used: 1904314776
      	preallocated: 0
      	discarded: 0
      [root@rocky9 wc-master]# 
      

      the mb alloc stats shows that c2 loop failed 800+ times, i.e. it failed for each allocation request. i.e. there were 800+ useless loops across all block groups in the fs.

      ( 800+ is from max block allocation request which about 32k blocks (128MB) and the total fallocated space of 100GB ~= 128MB x 800 ).

      the statistics for cr1 is also not good:

      	cr1_stats:
      		hits: 4
      		groups_considered: 713509254
      		useless_loops: 0
      		bad_suggestions: 0
      

      "groups considered=713509254" means the allocator also tried all block groups for each of 800+ requests, but just didn't count those attempts as "useless loops".

      My test fs has 819200, real systems may have ~5M+ groups, meaning the allocation requests would take 6-8 min or more.

      it is rather a generic problem with ldiskfs block allocator, fallocate just makes it clearly visible because unlike writes , fallocate requests are not limited by BRW size.

      the problem existed in RHEL8.x and the new improved block allocator from RHEL9.x / (linux-5.14) made it no better.

      Attachments

        Activity

          People

            zam Alexander Zarochentsev
            zam Alexander Zarochentsev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: