[LU-12988] improve mount time on huge ldiskfs filesystem - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
- LTS12
- scalability

Rank (Obsolete):
9223372036854775807

Description

during Lustre server startup few small files need to be updated (e.g. config backup).
at this point buddy/bitmap cache is empty but mballoc wants to find a big chunk of free space for group preallocation and reads bitmaps one by one.
sometimes this can take a very long time.
one possisble workaround is to disable preallocation during mount.
a long term plan is to limit scanning and prefetch bitmaps, but this needs more efforts and will be tracked separately (LU-12970)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0001-debugfs-add-fake_fill_fs-to-fill-fs-for-testing.patch
3 kB
19/Jan/20 10:56 AM

Issue Links

is related to

LU-14453 LDISKFS-fs error (device xxx) in ldiskfs_do_update_inode:4840: Journal has aborted

Resolved

LU-13290 Write performance regression in ldiskfs patches

Resolved

is related to

LU-12970 improve mballoc for huge filesystems

Open

LU-12103 Improve block allocation for large partitions

Resolved

Activity

[LU-12988] improve mount time on huge ldiskfs filesystem

Gerrit Updater added a comment - 27/Jan/20 11:12 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37155/
Subject: ~~LU-12988~~ osd: do not use preallocation during mount
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2331cd1fa178b348d8aa048abbb5160ac9353461

Gerrit Updater added a comment - 27/Jan/20 11:12 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37155/ Subject: LU-12988 osd: do not use preallocation during mount Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 2331cd1fa178b348d8aa048abbb5160ac9353461

Alex Zhuravlev added a comment - 27/Jan/20 8:46 PM

Do you also have Artem's mballoc patch applied?

yes, AFAICS. and that didn't help, probably fillness wasn't high enough. I'll check tomorrow.

Alex Zhuravlev added a comment - 27/Jan/20 8:46 PM Do you also have Artem's mballoc patch applied? yes, AFAICS. and that didn't help, probably fillness wasn't high enough. I'll check tomorrow.

Andreas Dilger added a comment - 27/Jan/20 7:31 PM

Do you also have Artem's mballoc patch applied? If not, should that be reverted from master, or do you think it makes sense to keep, but with lower thresholds (eg. 15%, 10%, 5% or whatever)?

Andreas Dilger added a comment - 27/Jan/20 7:31 PM Do you also have Artem's mballoc patch applied? If not, should that be reverted from master, or do you think it makes sense to keep, but with lower thresholds (eg. 15%, 10%, 5% or whatever)?

Alex Zhuravlev added a comment - 27/Jan/20 6:14 PM

both https://review.whamcloud.com/#/c/36891/ and https://review.whamcloud.com/#/c/36893/ were used and need refresh, working on this..

Alex Zhuravlev added a comment - 27/Jan/20 6:14 PM both https://review.whamcloud.com/#/c/36891/ and https://review.whamcloud.com/#/c/36893/ were used and need refresh, working on this..

Andreas Dilger added a comment - 27/Jan/20 4:50 PM

Alex, which patch(es) we're used for you test? Can you please update the commit message for you patch with these test results and we can get them landed.

Andreas Dilger added a comment - 27/Jan/20 4:50 PM Alex, which patch(es) we're used for you test? Can you please update the commit message for you patch with these test results and we can get them landed.

Alex Zhuravlev added a comment - 27/Jan/20 4:01 PM

on 85% filled fs using "slow" devmapper with 5ms IO delay:

/dev/mapper/dm-slow                                         124T  100T   18T  85% /mnt/ost2

tried to mount with non-patched ldiskfs, but interrupted after about half an hour:

# time mount -t lustre /dev/mapper/dm-slow /mnt/ost2
^C

real	27m42.190s
user	0m0.000s
sys	0m0.002s

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 13311820 398560 6025368    0    0   200     0   67  332  0  0 50 50  0
 0  1      0 13311200 398760 6025680    0    0   200     0   65  327  0  0 50 50  1
 0  1      0 13310456 398960 6026220    0    0   200     0   66  338  0  0 50 50  0
 0  1      0 13309836 399160 6026620    0    0   200     0  162  426  0  0 50 50  0
 0  1      0 13309092 399360 6027036    0    0   200     0  165  422  0  0 50 50  1
 0  1      0 13308472 399560 6027428    0    0   200     0  169  434  0  0 50 50  0
 0  1      0 13308100 399760 6027912    0    0   200     0  164  421  0  0 50 50  0
 0  1      0 13307356 399960 6028240    0    0   200     0  168  439  0  0 50 50  0

according to vmstat mount was going to take ~11 hours.

now with patched ldiskfs with cache wiped by echo 3 >/proc/sys/vm/drop_caches:

# time mount -t lustre /dev/mapper/dm-slow /mnt/ost2

real	0m20.364s
user	0m0.192s
sys	0m1.103s

Alex Zhuravlev added a comment - 27/Jan/20 4:01 PM on 85% filled fs using "slow" devmapper with 5ms IO delay: /dev/mapper/dm-slow 124T 100T 18T 85% /mnt/ost2 tried to mount with non-patched ldiskfs, but interrupted after about half an hour: # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2 ^C real 27m42.190s user 0m0.000s sys 0m0.002s procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 0 13311820 398560 6025368 0 0 200 0 67 332 0 0 50 50 0 0 1 0 13311200 398760 6025680 0 0 200 0 65 327 0 0 50 50 1 0 1 0 13310456 398960 6026220 0 0 200 0 66 338 0 0 50 50 0 0 1 0 13309836 399160 6026620 0 0 200 0 162 426 0 0 50 50 0 0 1 0 13309092 399360 6027036 0 0 200 0 165 422 0 0 50 50 1 0 1 0 13308472 399560 6027428 0 0 200 0 169 434 0 0 50 50 0 0 1 0 13308100 399760 6027912 0 0 200 0 164 421 0 0 50 50 0 0 1 0 13307356 399960 6028240 0 0 200 0 168 439 0 0 50 50 0 according to vmstat mount was going to take ~11 hours. now with patched ldiskfs with cache wiped by echo 3 >/proc/sys/vm/drop_caches: # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2 real 0m20.364s user 0m0.192s sys 0m1.103s

Alex Zhuravlev added a comment - 27/Jan/20 2:32 PM

echo "0 134217728000  delay /dev/loop0 0 5" | dmsetup create dm-slow

so now:

# dd if=/dev/loop0 of=/dev/null bs=4k count=12800
12800+0 records in
12800+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0.0662961 s, 791 MB/s
# dd if=/dev/mapper/dm-slow of=/dev/null bs=4k count=12800
12800+0 records in
12800+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 4.05419 s, 12.9 MB/s

trying to start Lustre on this virtual device..

Alex Zhuravlev added a comment - 27/Jan/20 2:32 PM echo "0 134217728000 delay /dev/loop0 0 5" | dmsetup create dm-slow so now: # dd if =/dev/loop0 of=/dev/ null bs=4k count=12800 12800+0 records in 12800+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 0.0662961 s, 791 MB/s # dd if =/dev/mapper/dm-slow of=/dev/ null bs=4k count=12800 12800+0 records in 12800+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 4.05419 s, 12.9 MB/s trying to start Lustre on this virtual device..

Artem Blagodarenko (Inactive) added a comment - 21/Jan/20 12:31 PM

bzzz, do you have results of comparing mount time with https://review.whamcloud.com/#/c/36704/ and without it? Please, share if you have.

Artem Blagodarenko (Inactive) added a comment - 21/Jan/20 12:31 PM bzzz , do you have results of comparing mount time with https://review.whamcloud.com/#/c/36704/ and without it? Please, share if you have.

Alex Zhuravlev added a comment - 19/Jan/20 3:18 PM

Artem's patch wasn't used. I'm thinking of few policies.. will try to make an overview soon.

Alex Zhuravlev added a comment - 19/Jan/20 3:18 PM Artem's patch wasn't used. I'm thinking of few policies.. will try to make an overview soon.

Andreas Dilger added a comment - 19/Jan/20 1:26 PM

Are these results with or without Artem's patch https://review.whamcloud.com/35180 "LU-12103 ldiskfs: don't search large block range if disk full"? That would skip cr=0 and cr=1 immediately because of filesystem fullness. That may not be ideal for a real-world filesystem because the 80-20 block full-free pattern is unlikely to be seen in real life, but at the same time it would also find the groups that have free chunks and allocate more there.

The one improvement that would be possible with Artem's patch would be to not make it strictly based on filesystem fullness, but also using cX_failed to decide when the filesystem is too fragmented.

Andreas Dilger added a comment - 19/Jan/20 1:26 PM Are these results with or without Artem's patch https://review.whamcloud.com/35180 " LU-12103 ldiskfs: don't search large block range if disk full "? That would skip cr=0 and cr=1 immediately because of filesystem fullness. That may not be ideal for a real-world filesystem because the 80-20 block full-free pattern is unlikely to be seen in real life, but at the same time it would also find the groups that have free chunks and allocate more there. The one improvement that would be possible with Artem's patch would be to not make it strictly based on filesystem fullness, but also using cX_failed to decide when the filesystem is too fragmented.

Alex Zhuravlev added a comment - 19/Jan/20 11:26 AM - edited

now some results for the mballoc pathes..

so with fs filled as above (debugfs) where basically we get very fragmented filesystem (20 free blocks followed by 80 busy blocks):

# time dd if=/dev/zero of=/mnt/huge/f11 bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.521538 s, 15.7 kB/s
real	0m0.524s

and extra debugging from mballoc:

[ 5762.522831] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 2174 pref [ 166911 90072 384 ]

i.e. mballoc requested 1 block, set 512 as goal, prefetched 2174 bitmaps, then found 201 extents and preallocated 20 blocks.
it took 166 uses to issue IO to prefetch those groups (from SSD) and skip all uninitialized groups at cr=0.
then it took 90 usec to skip all uninitialized groups at cr=1.
and then few cycles to scan one group and return something.

so that shouldn't get stuck at Lustre mount..

but I think this level of fragmentation exposes another problem very well. say, all groups have been initialized finally.
now we try to write 8MB:

# time dd if=/dev/zero of=/mnt/huge/f10 bs=8M count=1
1+0 records in
1+0 records out
8388608 bytes (8.4 MB, 8.0 MiB) copied, 11.4156 s, 735 kB/s
real	0m11.418s

notice it's 11s ..

[ 5541.664107] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 76235 73909 13 ]
[ 5541.814086] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75747 73771 13 ]
[ 5541.964049] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75727 73776 12 ]
[ 5542.114082] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75681 73883 13 ]
[ 5542.269864] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75796 79530 13 ]
[ 5542.420171] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75875 73870 13 ]

i.e. it scans all groups at cr=0 and cr=1 taking ~75 usec each. and that repeats 256 times - 2048 blocks in 256 allocations, each finding 20 blocks as this is the largest chunk we can allocate from this filesystem.
that's a clear sign mballoc has to be able to remove groups from even checking based on some fragmentation criteria? like few lists containing groups.

Alex Zhuravlev added a comment - 19/Jan/20 11:26 AM - edited now some results for the mballoc pathes.. so with fs filled as above (debugfs) where basically we get very fragmented filesystem (20 free blocks followed by 80 busy blocks): # time dd if =/dev/zero of=/mnt/huge/f11 bs=8k count=1 1+0 records in 1+0 records out 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.521538 s, 15.7 kB/s real 0m0.524s and extra debugging from mballoc: [ 5762.522831] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 2174 pref [ 166911 90072 384 ] i.e. mballoc requested 1 block, set 512 as goal, prefetched 2174 bitmaps, then found 201 extents and preallocated 20 blocks. it took 166 uses to issue IO to prefetch those groups (from SSD) and skip all uninitialized groups at cr=0. then it took 90 usec to skip all uninitialized groups at cr=1. and then few cycles to scan one group and return something. so that shouldn't get stuck at Lustre mount.. but I think this level of fragmentation exposes another problem very well. say, all groups have been initialized finally. now we try to write 8MB: # time dd if =/dev/zero of=/mnt/huge/f10 bs=8M count=1 1+0 records in 1+0 records out 8388608 bytes (8.4 MB, 8.0 MiB) copied, 11.4156 s, 735 kB/s real 0m11.418s notice it's 11s .. [ 5541.664107] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 76235 73909 13 ] [ 5541.814086] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75747 73771 13 ] [ 5541.964049] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75727 73776 12 ] [ 5542.114082] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75681 73883 13 ] [ 5542.269864] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75796 79530 13 ] [ 5542.420171] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75875 73870 13 ] i.e. it scans all groups at cr=0 and cr=1 taking ~75 usec each. and that repeats 256 times - 2048 blocks in 256 allocations, each finding 20 blocks as this is the largest chunk we can allocate from this filesystem. that's a clear sign mballoc has to be able to remove groups from even checking based on some fragmentation criteria? like few lists containing groups.

People

Assignee:: Alex Zhuravlev

Reporter:: Alex Zhuravlev

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 20/Nov/19 3:46 PM

Updated:: 18/Mar/25 6:47 AM

Resolved:: 11/Feb/20 4:25 PM