Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10946

add an interface to load ldiskfs block bitmaps

Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      During our benchmarking/testing, we found sometimes write performances are not stable enough and there are some small read during write which could drop thoughoutput of write performances.

      It turned out that block bitmaps load make some latency here, also for a heavy fragment filesystem, we might need load many bitmaps to find some free blocks.

      To improve above situation, we had a patch to load block bitmaps to memory and pin those bitmaps memory until unmount or we release the memory on purpose, this could stable write performances and improve performances of a heavy fragment filesystem.

      Attachments

        Issue Links

          Activity

            [LU-10946] add an interface to load ldiskfs block bitmaps

            Hi Nathan Dauchy,

            Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case?

            echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin.
            echo 2 > /sys/fs/ldiskfs/vdb/loadbbitmaps # pin block bitmaps in memory
            echo 0 > /sys/fs/ldiskfs/vdb/loadbbitmaps #unpin block bitmaps in memory

            Thanks,
            Shilong

            wangshilong Wang Shilong (Inactive) added a comment - Hi Nathan Dauchy, Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case? echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin. echo 2 > /sys/fs/ldiskfs/vdb/loadbbitmaps # pin block bitmaps in memory echo 0 > /sys/fs/ldiskfs/vdb/loadbbitmaps #unpin block bitmaps in memory Thanks, Shilong

            Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347
            Subject: LU-10946 ldiskfs: add an interface to load ldiskfs block bitmaps
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2989ee1ac1b7ca5666fc1cf42f9e95f3da20200

            gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347 Subject: LU-10946 ldiskfs: add an interface to load ldiskfs block bitmaps Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2989ee1ac1b7ca5666fc1cf42f9e95f3da20200
            wangshilong Wang Shilong (Inactive) added a comment - - edited

            Nathan,

            I mean when I downloaded the attachment it is hard for me to read it, it is something like following
            is a bit messy code..

            #!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg

            {verbose} = 0; sub dprint { if ($arg{verbose}

            > 1)

            { print @_; }

            } # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg

            {h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h}

            ); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg

            {meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta}

            or die "Couldn't open $arg

            {meta}

            : $!"; while () { if (/Inodes per group:\s+(\d+)/)

            { $inodesPerGroup = $1; }

            elsif (/Inode blocks per group:\s+(\d+)/)

            { $inodeBlocksPerGroup = $1; }

            elsif (/^$/)

            { last; }

            } if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0)

            { print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; }

            my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/)

            { $group = $1; }

            if (/Inode table at (\d+)/)

            { $firstItableBlock = $1; }

            if (/(\d+) free inodes/)

            { $usedInodes[$group] = $inodesPerGroup - $1; }

            if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1)

            { $usedIBlocks[$i] = $inodesPerBlock; }

            dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high)

            { my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

            } else

            { my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

            } for my $i (0..$#usedIBlocks) { if ($usedIBlocks[$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0)

            { dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

            } elsif (defined $arg

            {device}

            )

            { my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

            } $freeIBlocks += $groupFreeIBlocks; my @blockStr = map

            { sprintf "%3d", $_ }

            @usedIBlocks; if ($arg

            {verbose}

            )

            { printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); }

            } } print "Unused inode blocks: $freeIBlocks\n"; close M; _END_ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B<h|-help> Print a help message and exit. =back 8 =head1 EXAMPLES =head1 ENVIRONMENT =over 8 =item B FOO is an environment variable that somehow alters the execution of this program. =back 8 =head1 KNOWN BUGS =head1 CAVEATS =head1 DETAILS =head1 REPORTING BUGS =head1 AUTHOR =head1 SEE ALSO

            wangshilong Wang Shilong (Inactive) added a comment - - edited Nathan, I mean when I downloaded the attachment it is hard for me to read it, it is something like following is a bit messy code.. #!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg {verbose} = 0; sub dprint { if ($arg{verbose} > 1) { print @_; } } # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg {h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h} ); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg {meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta} or die "Couldn't open $arg {meta} : $!"; while () { if (/Inodes per group:\s+(\d+)/) { $inodesPerGroup = $1; } elsif (/Inode blocks per group:\s+(\d+)/) { $inodeBlocksPerGroup = $1; } elsif (/^$/) { last; } } if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0) { print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; } my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/) { $group = $1; } if (/Inode table at (\d+)/) { $firstItableBlock = $1; } if (/(\d+) free inodes/) { $usedInodes[$group] = $inodesPerGroup - $1; } if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1) { $usedIBlocks[$i] = $inodesPerBlock; } dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high) { my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; } } else { my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; } } for my $i (0..$#usedIBlocks) { if ($usedIBlocks [$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0) { dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); } } elsif (defined $arg {device} ) { my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); } } $freeIBlocks += $groupFreeIBlocks; my @blockStr = map { sprintf "%3d", $_ } @usedIBlocks; if ($arg {verbose} ) { printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); } } } print "Unused inode blocks: $freeIBlocks\n"; close M; _ END _ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B< h| -help> Print a help message and exit. =back 8 =head1 EXAMPLES =head1 ENVIRONMENT =over 8 =item B FOO is an environment variable that somehow alters the execution of this program. =back 8 =head1 KNOWN BUGS =head1 CAVEATS =head1 DETAILS =head1 REPORTING BUGS =head1 AUTHOR =head1 SEE ALSO
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited

            It is just a text file, perl script.  The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too.

            ndauchy Nathan Dauchy (Inactive) added a comment - - edited It is just a text file, perl script.  The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too.

            Nathan Dauchy,

            Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances?

            wangshilong Wang Shilong (Inactive) added a comment - Nathan Dauchy, Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances?

            Shilong, any progress on this patch?

            adilger Andreas Dilger added a comment - Shilong, any progress on this patch?

            For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts.  They are run from cron like:

            0 1   * * * root /usr/local/bin/read-meta.sh dump >> /var/log/lustre-read-meta.log 2>&1
            */15 * * * * root /usr/local/bin/read-meta.sh read >> /var/log/lustre-read-meta.log 2>&1
            

            The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary?

            ndauchy Nathan Dauchy (Inactive) added a comment - For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts.  They are run from cron like: 0 1 * * * root /usr/local/bin/read-meta.sh dump >> / var /log/lustre-read-meta.log 2>&1 */15 * * * * root /usr/local/bin/read-meta.sh read >> / var /log/lustre-read-meta.log 2>&1 The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary?

            Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count.  To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps.  I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over.

            ndauchy Nathan Dauchy (Inactive) added a comment - Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count.  To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps.  I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over.

            Hi Nathan Dauchy,

            Thanks for your input here, looks you raised a lot of interesting question here. _

            1)regarding memory requirement, yup, there will be memory pressure for pinning bitmaps.
            just considering we have one block bitmap(4K) vs a 128M block group, it might eat a lot
            of memory if system is big. and same requirement for inode bitmap. I guess system memory
            might not be enough to pin both inode bitmap and block bitmaps.

            2)we tried to just read bitmaps to memory and make it reclaimable but write performances
            still not stable since those bitmap memory reclaimed easier before we want it.

            3)I don't think it a good idea to load(pin) full inode table, since full inode table eat much more memory
            than bitmaps..maybe some on-demand load inode table ahead makes sense, which I am not sure.

            So I agreed we can make the patch more configurable with:
            1)pre-read of bitmap to warm the cache without pinning.
            2)option to pin bitmap

            Thanks,
            Shilong

            wangshilong Wang Shilong (Inactive) added a comment - Hi Nathan Dauchy, Thanks for your input here, looks you raised a lot of interesting question here. _ 1)regarding memory requirement, yup, there will be memory pressure for pinning bitmaps. just considering we have one block bitmap(4K) vs a 128M block group, it might eat a lot of memory if system is big. and same requirement for inode bitmap. I guess system memory might not be enough to pin both inode bitmap and block bitmaps. 2)we tried to just read bitmaps to memory and make it reclaimable but write performances still not stable since those bitmap memory reclaimed easier before we want it. 3)I don't think it a good idea to load(pin) full inode table, since full inode table eat much more memory than bitmaps..maybe some on-demand load inode table ahead makes sense, which I am not sure. So I agreed we can make the patch more configurable with: 1)pre-read of bitmap to warm the cache without pinning. 2)option to pin bitmap Thanks, Shilong
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited

            This feature is of interest to NASA. (We currently use scripts around debugfs run periodically from cron to dump the bitmap information, and also the object trees with 'debugfs -c -R "ls O/0/d$i"'.)

            I found this was discussed way back in LU-15, with memory requirement estimates (are they still valid?):
            LU-15 comment-12883
            Also from ticket LU-3631 is reading the inode bitmaps no longer really useful?

            Regarding this patch, I think it would be helpful to include options to make it more generally configurable and usable for a wider variety of use-cases, not just pin block bitmaps. For example...

            • Just do a pre-read of the bitmaps to warm the cache, without pinning.
            • An option to either load at mount time, or load on demand.
            • Differentiate between loading (and pinning) Data Block bitmap vs. Inode bitmap vs. the full Inode Table.

            Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - - edited This feature is of interest to NASA. (We currently use scripts around debugfs run periodically from cron to dump the bitmap information, and also the object trees with 'debugfs -c -R "ls O/0/d$i"'.) I found this was discussed way back in LU-15 , with memory requirement estimates (are they still valid?): LU-15 comment-12883 Also from ticket LU-3631 is reading the inode bitmaps no longer really useful? Regarding this patch, I think it would be helpful to include options to make it more generally configurable and usable for a wider variety of use-cases, not just pin block bitmaps. For example... Just do a pre-read of the bitmaps to warm the cache, without pinning. An option to either load at mount time, or load on demand. Differentiate between loading (and pinning) Data Block bitmap vs. Inode bitmap vs. the full Inode Table. Thanks!
            pjones Peter Jones added a comment -

            ok - thanks wangshilong

             

            pjones Peter Jones added a comment - ok - thanks wangshilong  

            People

              wshilong Wang Shilong (Inactive)
              wangshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: