Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12988

improve mount time on huge ldiskfs filesystem

Details

    • 9223372036854775807

    Description

      during Lustre server startup few small files need to be updated (e.g. config backup).
      at this point buddy/bitmap cache is empty but mballoc wants to find a big chunk of free space for group preallocation and reads bitmaps one by one.
      sometimes this can take a very long time.
      one possisble workaround is to disable preallocation during mount.
      a long term plan is to limit scanning and prefetch bitmaps, but this needs more efforts and will be tracked separately (LU-12970)

      Attachments

        Issue Links

          Activity

            [LU-12988] improve mount time on huge ldiskfs filesystem

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37155/
            Subject: LU-12988 osd: do not use preallocation during mount
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 2331cd1fa178b348d8aa048abbb5160ac9353461

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37155/ Subject: LU-12988 osd: do not use preallocation during mount Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 2331cd1fa178b348d8aa048abbb5160ac9353461

            Do you also have Artem's mballoc patch applied?

            yes, AFAICS. and that didn't help, probably fillness wasn't high enough. I'll check tomorrow.

            bzzz Alex Zhuravlev added a comment - Do you also have Artem's mballoc patch applied? yes, AFAICS. and that didn't help, probably fillness wasn't high enough. I'll check tomorrow.

            Do you also have Artem's mballoc patch applied? If not, should that be reverted from master, or do you think it makes sense to keep, but with lower thresholds (eg. 15%, 10%, 5% or whatever)?

            adilger Andreas Dilger added a comment - Do you also have Artem's mballoc patch applied? If not, should that be reverted from master, or do you think it makes sense to keep, but with lower thresholds (eg. 15%, 10%, 5% or whatever)?

            both https://review.whamcloud.com/#/c/36891/ and https://review.whamcloud.com/#/c/36893/ were used and need refresh, working on this..

            bzzz Alex Zhuravlev added a comment - both https://review.whamcloud.com/#/c/36891/ and https://review.whamcloud.com/#/c/36893/ were used and need refresh, working on this..

            Alex, which patch(es) we're used for you test? Can you please update the commit message for you patch with these test results and we can get them landed.

            adilger Andreas Dilger added a comment - Alex, which patch(es) we're used for you test? Can you please update the commit message for you patch with these test results and we can get them landed.

            on 85% filled fs using "slow" devmapper with 5ms IO delay:

            /dev/mapper/dm-slow                                         124T  100T   18T  85% /mnt/ost2
            

            tried to mount with non-patched ldiskfs, but interrupted after about half an hour:

            # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2
            ^C
            
            real	27m42.190s
            user	0m0.000s
            sys	0m0.002s
            
            procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
             r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
             1  1      0 13311820 398560 6025368    0    0   200     0   67  332  0  0 50 50  0
             0  1      0 13311200 398760 6025680    0    0   200     0   65  327  0  0 50 50  1
             0  1      0 13310456 398960 6026220    0    0   200     0   66  338  0  0 50 50  0
             0  1      0 13309836 399160 6026620    0    0   200     0  162  426  0  0 50 50  0
             0  1      0 13309092 399360 6027036    0    0   200     0  165  422  0  0 50 50  1
             0  1      0 13308472 399560 6027428    0    0   200     0  169  434  0  0 50 50  0
             0  1      0 13308100 399760 6027912    0    0   200     0  164  421  0  0 50 50  0
             0  1      0 13307356 399960 6028240    0    0   200     0  168  439  0  0 50 50  0
            

            according to vmstat mount was going to take ~11 hours.

            now with patched ldiskfs with cache wiped by echo 3 >/proc/sys/vm/drop_caches:

            # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2
            
            real	0m20.364s
            user	0m0.192s
            sys	0m1.103s
            
            bzzz Alex Zhuravlev added a comment - on 85% filled fs using "slow" devmapper with 5ms IO delay: /dev/mapper/dm-slow 124T 100T 18T 85% /mnt/ost2 tried to mount with non-patched ldiskfs, but interrupted after about half an hour: # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2 ^C real 27m42.190s user 0m0.000s sys 0m0.002s procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 0 13311820 398560 6025368 0 0 200 0 67 332 0 0 50 50 0 0 1 0 13311200 398760 6025680 0 0 200 0 65 327 0 0 50 50 1 0 1 0 13310456 398960 6026220 0 0 200 0 66 338 0 0 50 50 0 0 1 0 13309836 399160 6026620 0 0 200 0 162 426 0 0 50 50 0 0 1 0 13309092 399360 6027036 0 0 200 0 165 422 0 0 50 50 1 0 1 0 13308472 399560 6027428 0 0 200 0 169 434 0 0 50 50 0 0 1 0 13308100 399760 6027912 0 0 200 0 164 421 0 0 50 50 0 0 1 0 13307356 399960 6028240 0 0 200 0 168 439 0 0 50 50 0 according to vmstat mount was going to take ~11 hours. now with patched ldiskfs with cache wiped by echo 3 >/proc/sys/vm/drop_caches: # time mount -t lustre /dev/mapper/dm-slow /mnt/ost2 real 0m20.364s user 0m0.192s sys 0m1.103s
            echo "0 134217728000  delay /dev/loop0 0 5" | dmsetup create dm-slow
            

            so now:

            # dd if=/dev/loop0 of=/dev/null bs=4k count=12800
            12800+0 records in
            12800+0 records out
            52428800 bytes (52 MB, 50 MiB) copied, 0.0662961 s, 791 MB/s
            # dd if=/dev/mapper/dm-slow of=/dev/null bs=4k count=12800
            12800+0 records in
            12800+0 records out
            52428800 bytes (52 MB, 50 MiB) copied, 4.05419 s, 12.9 MB/s
            

            trying to start Lustre on this virtual device..

            bzzz Alex Zhuravlev added a comment - echo "0 134217728000 delay /dev/loop0 0 5" | dmsetup create dm-slow so now: # dd if =/dev/loop0 of=/dev/ null bs=4k count=12800 12800+0 records in 12800+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 0.0662961 s, 791 MB/s # dd if =/dev/mapper/dm-slow of=/dev/ null bs=4k count=12800 12800+0 records in 12800+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 4.05419 s, 12.9 MB/s trying to start Lustre on this virtual device..

            bzzz, do you have results of comparing mount time with https://review.whamcloud.com/#/c/36704/ and without it? Please, share if you have.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - bzzz , do you have results of comparing mount time with  https://review.whamcloud.com/#/c/36704/  and without it? Please, share if you have.

            Artem's patch wasn't used. I'm thinking of few policies.. will try to make an overview soon.

            bzzz Alex Zhuravlev added a comment - Artem's patch wasn't used. I'm thinking of few policies.. will try to make an overview soon.

            Are these results with or without Artem's patch https://review.whamcloud.com/35180 "LU-12103 ldiskfs: don't search large block range if disk full"? That would skip cr=0 and cr=1 immediately because of filesystem fullness. That may not be ideal for a real-world filesystem because the 80-20 block full-free pattern is unlikely to be seen in real life, but at the same time it would also find the groups that have free chunks and allocate more there.

            The one improvement that would be possible with Artem's patch would be to not make it strictly based on filesystem fullness, but also using cX_failed to decide when the filesystem is too fragmented.

            adilger Andreas Dilger added a comment - Are these results with or without Artem's patch https://review.whamcloud.com/35180 " LU-12103 ldiskfs: don't search large block range if disk full "? That would skip cr=0 and cr=1 immediately because of filesystem fullness. That may not be ideal for a real-world filesystem because the 80-20 block full-free pattern is unlikely to be seen in real life, but at the same time it would also find the groups that have free chunks and allocate more there. The one improvement that would be possible with Artem's patch would be to not make it strictly based on filesystem fullness, but also using cX_failed to decide when the filesystem is too fragmented.
            bzzz Alex Zhuravlev added a comment - - edited

            now some results for the mballoc pathes..

            so with fs filled as above (debugfs) where basically we get very fragmented filesystem (20 free blocks followed by 80 busy blocks):

            # time dd if=/dev/zero of=/mnt/huge/f11 bs=8k count=1
            1+0 records in
            1+0 records out
            8192 bytes (8.2 kB, 8.0 KiB) copied, 0.521538 s, 15.7 kB/s
            real	0m0.524s
            

            and extra debugging from mballoc:

            [ 5762.522831] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 2174 pref [ 166911 90072 384 ]
            

            i.e. mballoc requested 1 block, set 512 as goal, prefetched 2174 bitmaps, then found 201 extents and preallocated 20 blocks.
            it took 166 uses to issue IO to prefetch those groups (from SSD) and skip all uninitialized groups at cr=0.
            then it took 90 usec to skip all uninitialized groups at cr=1.
            and then few cycles to scan one group and return something.

            so that shouldn't get stuck at Lustre mount..

            but I think this level of fragmentation exposes another problem very well. say, all groups have been initialized finally.
            now we try to write 8MB:

            # time dd if=/dev/zero of=/mnt/huge/f10 bs=8M count=1
            1+0 records in
            1+0 records out
            8388608 bytes (8.4 MB, 8.0 MiB) copied, 11.4156 s, 735 kB/s
            real	0m11.418s
            

            notice it's 11s ..

            [ 5541.664107] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 76235 73909 13 ]
            [ 5541.814086] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75747 73771 13 ]
            [ 5541.964049] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75727 73776 12 ]
            [ 5542.114082] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75681 73883 13 ]
            [ 5542.269864] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75796 79530 13 ]
            [ 5542.420171] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75875 73870 13 ]
            

            i.e. it scans all groups at cr=0 and cr=1 taking ~75 usec each. and that repeats 256 times - 2048 blocks in 256 allocations, each finding 20 blocks as this is the largest chunk we can allocate from this filesystem.
            that's a clear sign mballoc has to be able to remove groups from even checking based on some fragmentation criteria? like few lists containing groups.

            bzzz Alex Zhuravlev added a comment - - edited now some results for the mballoc pathes.. so with fs filled as above (debugfs) where basically we get very fragmented filesystem (20 free blocks followed by 80 busy blocks): # time dd if =/dev/zero of=/mnt/huge/f11 bs=8k count=1 1+0 records in 1+0 records out 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.521538 s, 15.7 kB/s real 0m0.524s and extra debugging from mballoc: [ 5762.522831] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 2174 pref [ 166911 90072 384 ] i.e. mballoc requested 1 block, set 512 as goal, prefetched 2174 bitmaps, then found 201 extents and preallocated 20 blocks. it took 166 uses to issue IO to prefetch those groups (from SSD) and skip all uninitialized groups at cr=0. then it took 90 usec to skip all uninitialized groups at cr=1. and then few cycles to scan one group and return something. so that shouldn't get stuck at Lustre mount.. but I think this level of fragmentation exposes another problem very well. say, all groups have been initialized finally. now we try to write 8MB: # time dd if =/dev/zero of=/mnt/huge/f10 bs=8M count=1 1+0 records in 1+0 records out 8388608 bytes (8.4 MB, 8.0 MiB) copied, 11.4156 s, 735 kB/s real 0m11.418s notice it's 11s .. [ 5541.664107] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 76235 73909 13 ] [ 5541.814086] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75747 73771 13 ] [ 5541.964049] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75727 73776 12 ] [ 5542.114082] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75681 73883 13 ] [ 5542.269864] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75796 79530 13 ] [ 5542.420171] AC: 1 orig, 512 goal, 20 best, 201 found @ 2 0 pref [ 75875 73870 13 ] i.e. it scans all groups at cr=0 and cr=1 taking ~75 usec each. and that repeats 256 times - 2048 blocks in 256 allocations, each finding 20 blocks as this is the largest chunk we can allocate from this filesystem. that's a clear sign mballoc has to be able to remove groups from even checking based on some fragmentation criteria? like few lists containing groups.

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: