[LU-12988] improve mount time on huge ldiskfs filesystem Created: 20/Nov/19  Updated: 27/Feb/21  Resolved: 11/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: LTS12

Attachments: Text File 0001-debugfs-add-fake_fill_fs-to-fill-fs-for-testing.patch    
Issue Links:
Related
is related to LU-12970 improve mballoc for huge filesystems Open
is related to LU-12103 Improve block allocation for large pa... Resolved
is related to LU-14453 LDISKFS-fs error (device xxx) in ldis... Resolved
is related to LU-13290 Write performance regression in ldisk... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

during Lustre server startup few small files need to be updated (e.g. config backup).
at this point buddy/bitmap cache is empty but mballoc wants to find a big chunk of free space for group preallocation and reads bitmaps one by one.
sometimes this can take a very long time.
one possisble workaround is to disable preallocation during mount.
a long term plan is to limit scanning and prefetch bitmaps, but this needs more efforts and will be tracked separately (LU-12970)



 Comments   
Comment by Li Xi [ 20/Nov/19 ]

not sure why following patch doesn't commit on this ticket automatically:

https://review.whamcloud.com/#/c/36704/
LU-12988 osd: do not use preallocation during mount

Comment by Andreas Dilger [ 21/Nov/19 ]

not sure why following patch doesn't commit on this ticket automatically:

because it was originally submitted with LU-0000.

Comment by Gerrit Updater [ 27/Nov/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36880
Subject: LU-12988 osd: tune group preallocation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 564583d0a2ac02281c7264e0eed722c26d3db4d9

Comment by Gerrit Updater [ 28/Nov/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36891
Subject: LU-12988 ldiskfs: skip uninitialized groups at cr=0
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ab9189c9fac6e80c55983cd0629ef889ca444576

Comment by Gerrit Updater [ 28/Nov/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36893
Subject: LU-12988 ldiskfs: mballoc to prefetch groups
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 17f0691a32427a7806e81f735a3c4e65680b04d7

Comment by Gerrit Updater [ 16/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36704/
Subject: LU-12988 osd: do not use preallocation during mount
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ae21fce625ec6cd134fa4764683f00bc692132cb

Comment by Artem Blagodarenko (Inactive) [ 25/Dec/19 ]

Probably, rather then groups uploading to Lustre code, we could start groups reading in parallel while the partition is being mounted. I tried this patch to accelerate mounting, but haven't got good results

#!/bin/bash
device=$1
mount_point=$2
dev_mame=$3
echo "Script is started"
mount -t lustre $device $mount_point&
mount_pid=$!
echo "Mounting is  started ..."
while  [ ! -e /proc/fs/ldiskfs/$dev_name/mb_groups ];
do
        echo "Watting for mb_groups ..."
        sleep 1
done
echo "mb_groups is ready"
cat /proc/fs/ldiskfs/$dev_name/mb_groups >/dev/null 2>&1&
cat_pid=$!
echo "cat is started ..."
wait $mount_pid
echo "Mounting is done"
wait $cat_pid
echo "Cat is finished"
echo "Script is finished

Alex, if you believe this can be helpful, could you please check scipt (with appropriate modifications) for your case?

Comment by Artem Blagodarenko (Inactive) [ 27/Dec/19 ]

BTW, there is another one approach of tuning block allocator behaviour. I created the issue LU-13104 and uploaded a script there. This not exactly about mount time block allocator behaviour, but about options we have to tune the allocator.

Comment by Gerrit Updater [ 07/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37155
Subject: LU-12988 osd: do not use preallocation during mount
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 984a616268ed12c8c733dae4f9d0f39b159b995d

Comment by Sarah Liu [ 09/Jan/20 ]

Thank you Andreas, I will find other node to run the test.

Comment by Sarah Liu [ 15/Jan/20 ]

set up the env on spirit-3(MDS) and spirit-4(OSS)
without the patches, using lustre-master-ib #360, mounting time of empty system is

[root@spirit-4 ~]# time mount -t lustre -o loop,noinit_itable,force_over_512tb /mnt/xfs/ostfile /mnt/ost/

real    0m5.874s
user    0m0.552s
sys     0m4.116s

with patches(https://review.whamcloud.com/#/c/37163/) mounting of empty system is

[root@spirit-4 ~]# time mount -t lustre -o loop,noinit_itable,force_over_512tb /mnt/xfs/ostfile /mnt/ost

real	0m15.136s
user	0m0.495s
sys	0m13.653s

I have installed the e2fsprogs of https://review.whamcloud.com/37159 and ran the command "debugfs -w -R 'fake_fill_fs 80'" against master build, it ran for several hours(600T) without any messages and no sign of finish. Then I stopped it, haven't tried with a smaller size system.

Comment by Alex Zhuravlev [ 16/Jan/20 ]

I think a single system with just ldiskfs (no Lustre) should be enough, it's just needs big enough device.
then.. and you try to fill with fake_fill_fs again and collect vmstat 1 output for few couple minutes?
and please collect vmstat 1 output for mouting w/ and w/o the patches.
like we discsussed before the prefecth is probably too aggressive in prefetching.

Comment by Alex Zhuravlev [ 16/Jan/20 ]

on my side I'm trying to play with a virual sparse-file based device.. in the end, I think non-huge device isn't fatal for verification - essentially we just have to check that mballoc doesn't try to get too many bitmaps for a single allocation. that can be done using in-memory counters.

Comment by Sarah Liu [ 16/Jan/20 ]

here is the vmstat 1 output for mounting without patch, I recorded twice

[root@spirit-4 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 61875204  57488 2008944    0    0    19    27   19   25  0  0 99  0  0
 0  0      0 61875204  57488 2008944    0    0     0     0  115  124  0  0 100  0  0
 0  0      0 61875204  57488 2008944    0    0     0     0  116  188  0  0 100  0  0
 2  0      0 61035732 344640 2135316    0    0 287152    21 4115 4702  0  2 98  0  0
 1  0      0 61451116 343672 2142732    0    0 286237     8 4538 5627  0  2 98  0  0
 1  0      0 61450588 343672 2142732    0    0     0     0 3361  755  0  1 99  0  0
 1  0      0 60749716 343708 2796888    0    0    36     8 2529  899  0  1 99  0  0
 2  0      0 60233744 344372 2796976    0    0   608     9 2607 1834  0  1 99  0  0
 1  0      0 60270068 345408 2796700    0    0  1024    88 1325  404  0  1 99  0  0
 0  0      0 60270736 345408 2796700    0    0     0     0  531  164  0  0 99  0  0
 0  0      0 60270736 345408 2796700    0    0     0     0  138  191  0  0 100  0  0
 0  0      0 60270736 345408 2796700    0    0     0     0   95   95  0  0 100  0  0
 0  0      0 60271240 345416 2796692    0    0     0   290  124  143  0  0 100  0  0
 0  0      0 60271240 345416 2796692    0    0     0     0   90  111  0  0 100  0  0
[root@spirit-4 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 61871648  57532 2008592    0    0    23    26   18   24  0  0 99  0  0
 0  0      0 61872152  57532 2008592    0    0     0     0  128  114  0  0 100  0  0
 0  0      0 61872152  57532 2008592    0    0     0     0   98  105  0  0 100  0  0
 0  0      0 61872152  57540 2008584    0    0     0    44   96  124  0  0 100  0  0
 2  0      0 61265024 344692 2135400    0    0 287152     5 3813 10299  0  2 98  0  0
 1  0      0 61449080 343724 2141280    0    0 293065     8 4278 13637  0  2 97  0  0
 1  0      0 61449080 343724 2141280    0    0     0     0 7401  718  0  1 99  0  0
 1  0      0 60773660 343724 2781000    0    0     0     4 6915  738  0  1 99  0  0
 1  0      0 60171640 344376 2795788    0    0   596    33 4322 2158  0  1 99  0  0
 1  0      0 60267444 345460 2795880    0    0  1072    52 1356  412  0  1 99  0  0
 0  0      0 60267076 345460 2795880    0    0     0     4  943  179  0  1 99  0  0
 0  0      0 60267076 345460 2795880    0    0     0     0   74   84  0  0 100  0  0
 0  0      0 60267076 345460 2795880    0    0     0     0  137  214  0  0 100  0  0
 0  0      0 60267580 345460 2795880    0    0     0     0  125  136  0  0 100  0  0

vmstat 1 for build with patches

[root@spirit-4 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 61921204  57496 1976232    0    0    21    33   21   36  0  0 99  1  0
 0  0      0 61921220  57496 1976232    0    0     0     0   79   86  0  0 100  0  0
 0  0      0 61921228  57496 1976232    0    0     0     0  153  222  0  0 100  0  0
 0  0      0 61921188  57496 1976264    0    0     0     4   50   76  0  0 100  0  0
 0  0      0 61921188  57496 1976264    0    0     0     0  216  326  0  0 100  0  0
 0  0      0 61921312  57496 1976264    0    0     0     0   62   82  0  0 100  0  0
 2  0      0 61272784 344640 2103684    0    0 287136    17 1337 2609  0  1 99  0  0
 2  0      0 61755576  80216 2113748    0    0 23377     8 3220  824  1  2 97  0  0
 1  0      0 61494584 343688 2111568    0    0 262876   104 4211 3659  0  2 98  0  0
 1  0      0 60971776 343688 2608192    0    0     0     4 3026  768  0  1 99  0  0
 2  0      0 60061228 598680 2772604    0    0  1536     0 3129 2462  0  1 99  0  0
 1  0      0 58317612 2293528 2819280    0    0  6620     0 2821 4044  0  1 99  0  0
 1  0      0 56318252 4239048 2874104    0    0  7600    64 2471 4546  0  1 99  0  0
 1  0      0 54333652 6170160 2927156    0    0  7544    12 3213 4520  0  1 99  0  0
 1  0      0 52397664 8053880 2979980    0    0  7360     0 4455 4551  0  1 99  0  0
 1  0      0 50428256 9970412 3033596    0    0  7500     4 2915 4482  0  1 99  0  0
 1  0      0 48428000 11916112 3088112    0    0  7680     0 2470 4528  0  1 99  0  0
 1  0      0 46449880 13840528 3141300    0    0  7516     0 2805 4496  0  1 99  0  0
 1  0      0 44481712 15755016 3194816    0    0  7484     0 2998 4613  0  1 99  0  0
 1  0      0 42483944 17699192 3248924    0    0  7596     4 2441 4539  0  1 99  0  0
 1  0      0 41247444 18654860 3292740    0    0  3776   139 1877 2764  0  1 99  0  0
 1  0      0 41479916 18655428 3293088    0    0   568     4 1375  464  0  1 99  0  0
 0  0      0 41480056 18655428 3292860    0    0     0     4  995  339  0  1 99  0  0
 0  0      0 41480864 18655428 3292856    0    0     0     0   44   77  0  0 100  0  0
 0  0      0 41480864 18655428 3292856    0    0     0     0   78   81  0  0 100  0  0
 0  0      0 41480864 18655428 3292856    0    0     0    28  107  121  0  0 100  0  0
 0  0      0 41480980 18655448 3292836    0    0     0   374  160  260  0  0 100  0  0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 61921060  57556 1978800    0    0    27    31   21   35  0  0 99  0  0
 0  0      0 61920920  57556 1978800    0    0     0     0   76   93  0  0 100  0  0
 0  0      0 61920920  57556 1978800    0    0     0     0   83   87  0  0 100  0  0
 0  2      0 61917904  58592 1978520    0    0  1036     5  333  430  0  0 100  0  0
 2  0      0 61473176 344760 2104920    0    0 286168     8 2823 4226  1  2 97  0  0
 1  0      0 61496612 343740 2112496    0    0 286185     0 3012 4234  0  2 98  0  0
 1  0      0 61496392 343740 2112828    0    0     0     0 3384  717  0  1 99  0  0
 1  0      0 60807324 343740 2766700    0    0     0     4 3420  859  0  1 99  0  0
 1  0      0 59683340 967620 2783800    0    0  2996    32 4277 3053  0  1 99  0  0
 1  0      0 57865844 2735664 2833728    0    0  6904    48 3143 4228  0  1 99  0  0
 2  0      0 55972644 4577108 2884804    0    0  7196     0 3290 4358  0  1 99  0  0
 1  0      0 54088372 6410568 2936196    0    0  7160     0 6443 4436  0  1 99  0  0
 1  0      0 52091712 8352952 2989924    0    0  7600     0 3994 4557  0  1 99  0  0
 1  0      0 50088892 10301692 3044272    0    0  7616     0 2449 4539  0  1 99  0  0
 2  0      0 48083304 12253436 3099344    0    0  7704     4 2449 4552  0  1 99  0  0
 1  0      0 46080644 14201732 3153072    0    0  7612    13 2512 4685  0  1 99  0  0
 1  0      0 44079644 16148800 3207212    0    0  7608    12 2521 4612  0  1 99  0  0
 1  0      0 42080288 18094156 3261152    0    0  7600     0 3226 4576  0  1 99  0  0
 1  0      0 41481468 18654980 3293928    0    0  2284   143 2036 1935  0  1 98  0  0
 1  0      0 41228908 18655500 3294316    0    0   520     0 1349  507  0  1 99  0  0
 0  0      0 41480456 18655500 3294080    0    0     0     4  708  291  0  0 99  0  0
 1  0      0 41481156 18655500 3294084    0    0     0     0   89  139  0  0 100  0  0
 0  0      0 41481528 18655500 3294084    0    0     0     0  114  137  0  0 100  0  0
 0  0      0 41481676 18655512 3294072    0    0     0    60  182  268  0  0 100  0  0
 0  0      0 41481676 18655520 3294064    0    0     0   282  120  181  0  0 100  0  0
 0  0      0 41482436 18655520 3294088    0    0     0     0   74  123  0  0 100  0  0
 0  0      0 41483460 18655692 3293916    0    0   164    38  274  500  0  0 100  0  0
 0  0      0 41483108 18655692 3294184    0    0     0     0  167  263  0  0 100  0  0
 0  0      0 41483232 18655692 3294184    0    0     0     0   96  142  0  0 100  0  0
 0  0      0 41483232 18655692 3294184    0    0     0     0   74  122  0  0 100  0  0
Comment by Andreas Dilger [ 16/Jan/20 ]

I have installed the e2fsprogs of https://review.whamcloud.com/37159 and ran the command "debugfs -w -R 'fake_fill_fs 80'" against master build, it ran for several hours(600T) without any messages and no sign of finish.

Sarah, can you please paste the output of "dumpe2fs -h" for the filesystem (it doesn't matter if it is before or after the fake_fill command). I want to see the filesystem blocks count to correlate it to the disk IO count and memory usage, and filesystem features enabled. Having the meta_bg feature enabled in the filesystem for the test is important (this should be the default for filesystems over 256TB or 512TB, I can't remember), since this reduces the speed of loading the metadata from disk.

I think a single system with just ldiskfs (no Lustre) should be enough, it's just needs big enough device.

Alex, I think we do need to have Lustre mount the filesystem, otherwise it doesn't try to do any block allocations at mount time trying to write out a new copy of the config llog.

It is interesting to see that not only did the amount of IO increase for the patched case (this is expected, due to aggressively reading the block bitmaps), but also the amount of memory used increased significantly (1.8GB for patched vs. 0.3GB for unpatched). That memory increase couldn't be explained by only the extra IO, which is only about 70MB of extra reads.

Comment by Alex Zhuravlev [ 17/Jan/20 ]

I can't access spirit4, waiting for help on that..
this is what I was playing with instead:

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      9.7T  9.5T     0 100% /mnt/huge
sparse image with filler:  754M /mnt/large/sparse-loop.img

it takes few seconds on SSD in KVM using fallocate. unfortunately fallocate doesn't control allocation..

trying https://review.whamcloud.com/#/c/37159/ now..

Comment by Alex Zhuravlev [ 17/Jan/20 ]

I can confirm that patch is not something we really can use..

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 5186808 172636 280044    0    0     0     0  126   30 50  1 50  0  0
 1  0      0 5181436 172636 280044    0    0     0     0  121   23 50  0 50  0  0
 1  0      0 5175848 172636 280044    0    0     0    28  125   27 50  0 50  0  0
 1  0      0 5170252 172636 280048    0    0     0     0  121   25 50  0 50  0  1
 1  0      0 5164456 172636 280048    0    0     0     0  120   21 50  0 50  0  0
 1  0      0 5158752 172636 280048    0    0     0     0  119   21 50  0 50  0  0
 1  0      0 5153312 172636 280048    0    0     0     0  121   25 50  0 50  0  0
 1  0      0 5147732 172636 280048    0    0     0     0  120   25 50  0 50  0  0
 1  0      0 5142308 172636 280048    0    0     0     0  125   27 50  0 50  0  0
 1  0      0 5136604 172636 280048    0    0     0     0  118   25 50  0 50  0  0
 1  0      0 5131296 172636 280048    0    0     0     0  134   38 50  0 50  0  0
 1  0      0 5125832 172636 280048    0    0     0     0  121   25 49  1 50  0  0

so it's CPU bound, doing almost no IO and progressing very slowly.. will try to modify it a bit

Comment by Alex Zhuravlev [ 17/Jan/20 ]

so I rewrote debugfs patch a bit, that saved just a minute (3m57s before, 2m51s after).. but still, it's 10TB filesystem created with uninit_bg option:

# df -h /mnt/huge/
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      9.7T  4.9T  4.4T  53% /mnt/huge

notice filling to 97% takes 40s with fallocate

/dev/loop0      9.7T  9.5T     0 100% /mnt/huge
Comment by Alex Zhuravlev [ 18/Jan/20 ]

tried with a bit larger fileystem (using xfs to store the image):

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      128T   24K  122T   1% /mnt/huge
fresh sparse image:  2.1G /mnt/large/sparse-loop.img

but debugfs's fake_fill_fs crashed..

Comment by Alex Zhuravlev [ 19/Jan/20 ]

got another version os fake_fill_fs patch:

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       63T   50T  9.3T  85% /mnt/huge
# du -hs /mnt/large/sparse-loop.img 
2.8G	/mnt/large/sparse-loop.img
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                         
 3552 root      20   0 7821172   7.5g   2112 R 100.0  38.2   2:05.89 debugfs

debugfs -w -R 'fake_fill_fs 80' took 2m12.748s which extrapolates to ~35mins for 1PB fs give a spare image is stored on SSD (NVME in my case)

Comment by Alex Zhuravlev [ 19/Jan/20 ]

the updated patch has been attached to this ticket.

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      125T  100T   19T  85% /mnt/huge
fresh sparse image:  5.7G /mnt/large/sparse-loop.img
real	4m26.627s

can't test with a larger fs due to limited RAM

Comment by Alex Zhuravlev [ 19/Jan/20 ]

now some results for the mballoc pathes..

so with fs filled as above (debugfs) where basically we get very fragmented filesystem (20 free blocks followed by 80 busy blocks):

# time dd if=/dev/zero of=/mnt/huge/f11 bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.521538 s, 15.7 kB/s
real	0m0.524s

and extra debugging from mballoc:

[ 5762.522831] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 2174 pref [ 166911 90072 384 ]

i.e. mballoc requested 1 block, set 512 as goal, prefetched 2174 bitmaps, then found 201 extents and preallocated 20 blocks.
it took 166 uses to issue IO to prefetch those groups (from SSD) and skip all uninitialized groups at cr=0.
then it took 90 usec to skip all uninitialized groups at cr=1.
and then few cycles to scan one group and return something.

so that shouldn't get stuck at Lustre mount..

but I think this level of fragmentation exposes another problem very well. say, all groups have been initialized finally.
now we try to write 8MB:

# time dd if=/dev/zero of=/mnt/huge/f10 bs=8M count=1
1+0 records in
1+0 records out
8388608 bytes (8.4 MB, 8.0 MiB) copied, 11.4156 s, 735 kB/s
real	0m11.418s

notice it's 11s ..

[ 5541.664107] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 76235 73909 13 ]
[ 5541.814086] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75747 73771 13 ]
[ 5541.964049] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75727 73776 12 ]
[ 5542.114082] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75681 73883 13 ]
[ 5542.269864] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75796 79530 13 ]
[ 5542.420171] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75875 73870 13 ]

i.e. it scans all groups at cr=0 and cr=1 taking ~75 usec each. and that repeats 256 times - 2048 blocks in 256 allocations, each finding 20 blocks as this is the largest chunk we can allocate from this filesystem.
that's a clear sign mballoc has to be able to remove groups from even checking based on some fragmentation criteria? like few lists containing groups.

Comment by Andreas Dilger [ 19/Jan/20 ]

Are these results with or without Artem's patch https://review.whamcloud.com/35180 "LU-12103 ldiskfs: don't search large block range if disk full"? That would skip cr=0 and cr=1 immediately because of filesystem fullness. That may not be ideal for a real-world filesystem because the 80-20 block full-free pattern is unlikely to be seen in real life, but at the same time it would also find the groups that have free chunks and allocate more there.

The one improvement that would be possible with Artem's patch would be to not make it strictly based on filesystem fullness, but also using cX_failed to decide when the filesystem is too fragmented.

Comment by Alex Zhuravlev [ 19/Jan/20 ]

Artem's patch wasn't used. I'm thinking of few policies.. will try to make an overview soon.

Comment by Artem Blagodarenko (Inactive) [ 21/Jan/20 ]

bzzz, do you have results of comparing mount time with https://review.whamcloud.com/#/c/36704/ and without it? Please, share if you have.

Comment by Alex Zhuravlev [ 27/Jan/20 ]
echo "0 134217728000  delay /dev/loop0 0 5" | dmsetup create dm-slow

so now:

# dd if=/dev/loop0 of=/dev/null bs=4k count=12800
12800+0 records in
12800+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0.0662961 s, 791 MB/s
# dd if=/dev/mapper/dm-slow of=/dev/null bs=4k count=12800
12800+0 records in
12800+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 4.05419 s, 12.9 MB/s

trying to start Lustre on this virtual device..

Comment by Alex Zhuravlev [ 27/Jan/20 ]

on 85% filled fs using "slow" devmapper with 5ms IO delay:

/dev/mapper/dm-slow                                         124T  100T   18T  85% /mnt/ost2

tried to mount with non-patched ldiskfs, but interrupted after about half an hour:

# time mount -t lustre /dev/mapper/dm-slow /mnt/ost2
^C

real	27m42.190s
user	0m0.000s
sys	0m0.002s

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 13311820 398560 6025368    0    0   200     0   67  332  0  0 50 50  0
 0  1      0 13311200 398760 6025680    0    0   200     0   65  327  0  0 50 50  1
 0  1      0 13310456 398960 6026220    0    0   200     0   66  338  0  0 50 50  0
 0  1      0 13309836 399160 6026620    0    0   200     0  162  426  0  0 50 50  0
 0  1      0 13309092 399360 6027036    0    0   200     0  165  422  0  0 50 50  1
 0  1      0 13308472 399560 6027428    0    0   200     0  169  434  0  0 50 50  0
 0  1      0 13308100 399760 6027912    0    0   200     0  164  421  0  0 50 50  0
 0  1      0 13307356 399960 6028240    0    0   200     0  168  439  0  0 50 50  0

according to vmstat mount was going to take ~11 hours.

now with patched ldiskfs with cache wiped by echo 3 >/proc/sys/vm/drop_caches:

# time mount -t lustre /dev/mapper/dm-slow /mnt/ost2

real	0m20.364s
user	0m0.192s
sys	0m1.103s
Comment by Andreas Dilger [ 27/Jan/20 ]

Alex, which patch(es) we're used for you test? Can you please update the commit message for you patch with these test results and we can get them landed.

Comment by Alex Zhuravlev [ 27/Jan/20 ]

both https://review.whamcloud.com/#/c/36891/ and https://review.whamcloud.com/#/c/36893/ were used and need refresh, working on this..

Comment by Andreas Dilger [ 27/Jan/20 ]

Do you also have Artem's mballoc patch applied? If not, should that be reverted from master, or do you think it makes sense to keep, but with lower thresholds (eg. 15%, 10%, 5% or whatever)?

Comment by Alex Zhuravlev [ 27/Jan/20 ]

Do you also have Artem's mballoc patch applied?

yes, AFAICS. and that didn't help, probably fillness wasn't high enough. I'll check tomorrow.

Comment by Gerrit Updater [ 27/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37155/
Subject: LU-12988 osd: do not use preallocation during mount
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2331cd1fa178b348d8aa048abbb5160ac9353461

Comment by Artem Blagodarenko (Inactive) [ 28/Jan/20 ]

bzzz I hope I am wrong...

I also get such optimistic results with my "adjusting table" script, but then I had realized that only FIRST MOUNT is slow. I recreated the partition and when mount became slow again even with my optimizations. bzzz, could you check your performance improvement on the fresh-created filesystem?

 

Comment by Alex Zhuravlev [ 28/Jan/20 ]

hmm, I did remounted few times.. newly-created filesystem shouldn't demontrate any degradation w/o the patches even as a perfect group would be found immediately?

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36891/
Subject: LU-12988 ldiskfs: skip non-loaded groups at cr=0/1
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6a7a700a1490dfde6b60c2fb36df92a052059866

Comment by Gerrit Updater [ 10/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36893/
Subject: LU-12988 ldiskfs: mballoc to prefetch groups
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 05f31782be20fc4c46082dba02c10bcea59539e3

Comment by Peter Jones [ 11/Feb/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 11/Feb/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37539
Subject: LU-12988 ldiskfs: skip non-loaded groups at cr=0/1
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a3febe05e566ab9dc81312fa441e49cddbd20eec

Comment by Gerrit Updater [ 11/Feb/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37540
Subject: LU-12988 ldiskfs: mballoc to prefetch groups
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d5735e733c84c18b632e26f482bb2663d34f2851

Comment by Gerrit Updater [ 19/Feb/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37619
Subject: LU-12988 ldiskfs: revert prefetch patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 353e0756371f92a7f0f2da80b2a803b278126bec

Comment by Gerrit Updater [ 20/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37619/
Subject: LU-12988 ldiskfs: revert prefetch patch
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2c5700fcb4cb15056dc901fedf97001d9b9fd845

Comment by Gerrit Updater [ 20/Feb/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37633
Subject: LU-12988 ldiskfs: mballoc to prefetch groups
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8e16c01a5d2d4b6448c3a766bf83ade35268f9d

Comment by Shuichi Ihara [ 20/Feb/20 ]

Although patch https://review.whamcloud.com/#/c/37619 was just reverted other reason, but that patch also caused a big perforamnce regression on large OST (280TB).
during IOR (FPP, 1MB), there were only few IOs and stucking, then few IO and stucking again below. Is new revised patch aware of it?

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 45456188 9647936 1017916    0    0    15 10610   27   10  0  2 98  0  0
 0  0      0 45456312 9647936 1017916    0    0     0     4  205  200  0  0 100  0  0
 0  0      0 45456312 9647936 1017916    0    0     0     4  150  142  0  0 100  0  0
 0  0      0 45456312 9647936 1017916    0    0     0     4  131  113  0  0 100  0  0
 0  0      0 45456312 9647936 1017916    0    0     0     4  269  360  0  0 100  0  0
 0  0      0 45456288 9647936 1017916    0    0     0     0  370  527  0  0 100  0  0
 1  1      0 45440948 9648208 1018852    0    0     0 1955144 19798 24610  0  5 93  2  0
 9  1      0 45375612 9648208 1018852    0    0     0 3232300 33648 35086  0 27 67  6  0
 9  0      0 45379668 9648208 1018852    0    0     0  1868 14431 8833  0 50 46  3  0
 9  0      0 45379460 9648216 1018844    0    0     0    16 8694  844  0 50 50  0  0
 9  0      0 45379460 9648216 1018852    0    0     0     0 8641  800  0 50 50  0  0
 9  0      0 45379460 9648224 1018852    0    0     0  1396 9604  838  0 50 50  0  0
 8  0      0 45379460 9648224 1018852    0    0     0 16388 12385  909  0 47 53  0  0
 2  0      0 45380196 9648224 1018856    0    0     0 79060 11568 1360  0 41 59  0  0
 6  0      0 45378456 9648232 1018852    0    0     0 640900 14445 7692  0 30 68  2  0
 6  0      0 45378784 9648232 1018856    0    0     0     0 7422  751  0 38 62  0  0
 6  0      0 45378784 9648232 1018856    0    0     0     4 7304  717  0 38 63  0  0
 4  0      0 45378784 9648232 1018856    0    0     0 61820 7182 1135  0 35 65  0  0
 7  0      0 45378832 9648240 1018860    0    0     0 692700 10136 5355  0 27 72  1  0
 7  0      0 45378832 9648240 1018860    0    0     0   284 10282  844  0 38 62  0  0
Comment by Alex Zhuravlev [ 20/Feb/20 ]

thanks for the report. have you got performance back to normal with mballoc-prefetch patch reverted? or it's still below expectation?

Comment by Shuichi Ihara [ 23/Feb/20 ]

Alex, please see LU-13290, there is another perforamnce regression in ldiksfs. I think there are two major performance regressions in ldiskfs. Although patch https://review.whamcloud.com/#/c/37619 was odd behaviors, we also need to fix another regression is caused by LU-12988.

Comment by Gerrit Updater [ 02/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37633/
Subject: LU-12988 ldiskfs: mballoc to prefetch groups
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b7cd65a3d1d665f1bee5eb8ad3b989b12be7de08

Comment by Gerrit Updater [ 05/Mar/20 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37811
Subject: LU-12988 ldiskfs: port ext4-mballoc-prefetch.patch to RHEL 8.1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 89fd17997924aafb78ab0cc24debaf12b0e17b87

Comment by Gerrit Updater [ 17/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37811/
Subject: LU-12988 ldiskfs: port ext4-mballoc-prefetch.patch to RHEL 8.1
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 896e12c2e4fc98cbc15c675ec2894e9511aa92a7

Comment by Gerrit Updater [ 14/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37539/
Subject: LU-12988 ldiskfs: skip non-loaded groups at cr=0/1
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: de994667dda925109e862edadb4aa4feaecd0e6b

Generated at Sat Feb 10 02:57:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.