Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10946

add an interface to load ldiskfs block bitmaps

Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      During our benchmarking/testing, we found sometimes write performances are not stable enough and there are some small read during write which could drop thoughoutput of write performances.

      It turned out that block bitmaps load make some latency here, also for a heavy fragment filesystem, we might need load many bitmaps to find some free blocks.

      To improve above situation, we had a patch to load block bitmaps to memory and pin those bitmaps memory until unmount or we release the memory on purpose, this could stable write performances and improve performances of a heavy fragment filesystem.

      Attachments

        Issue Links

          Activity

            [LU-10946] add an interface to load ldiskfs block bitmaps

            Hi Jay Lan,

            You could just ignore the 4 missing patches, and apply my patch directly, I build locally, it works.

            Thanks,
            Shilong

            wangshilong Wang Shilong (Inactive) added a comment - Hi Jay Lan, You could just ignore the 4 missing patches, and apply my patch directly, I build locally, it works. Thanks, Shilong
            jaylan Jay Lan (Inactive) added a comment - - edited

            Hi Shilong,

            Nathan asked me to cherry-pick the #32347 review. This patch caused conflicts in b2_10. Branch b2_10 is 4 ldiskfs kernel_patches behind compared to master branch. Is there dependency on the 4 missing patches or any other commit? If yes, could you list prerequisites of your patch?

            Thanks,
            Jay

            jaylan Jay Lan (Inactive) added a comment - - edited Hi Shilong, Nathan asked me to cherry-pick the #32347 review. This patch caused conflicts in b2_10. Branch b2_10 is 4 ldiskfs kernel_patches behind compared to master branch. Is there dependency on the 4 missing patches or any other commit? If yes, could you list prerequisites of your patch? Thanks, Jay

            Hi Nathan Dauchy,

            Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case?

            echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin.
            echo 2 > /sys/fs/ldiskfs/vdb/loadbbitmaps # pin block bitmaps in memory
            echo 0 > /sys/fs/ldiskfs/vdb/loadbbitmaps #unpin block bitmaps in memory

            Thanks,
            Shilong

            wangshilong Wang Shilong (Inactive) added a comment - Hi Nathan Dauchy, Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case? echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin. echo 2 > /sys/fs/ldiskfs/vdb/loadbbitmaps # pin block bitmaps in memory echo 0 > /sys/fs/ldiskfs/vdb/loadbbitmaps #unpin block bitmaps in memory Thanks, Shilong

            Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347
            Subject: LU-10946 ldiskfs: add an interface to load ldiskfs block bitmaps
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2989ee1ac1b7ca5666fc1cf42f9e95f3da20200

            gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347 Subject: LU-10946 ldiskfs: add an interface to load ldiskfs block bitmaps Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2989ee1ac1b7ca5666fc1cf42f9e95f3da20200
            wangshilong Wang Shilong (Inactive) added a comment - - edited

            Nathan,

            I mean when I downloaded the attachment it is hard for me to read it, it is something like following
            is a bit messy code..

            #!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg

            {verbose} = 0; sub dprint { if ($arg{verbose}

            > 1)

            { print @_; }

            } # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg

            {h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h}

            ); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg

            {meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta}

            or die "Couldn't open $arg

            {meta}

            : $!"; while () { if (/Inodes per group:\s+(\d+)/)

            { $inodesPerGroup = $1; }

            elsif (/Inode blocks per group:\s+(\d+)/)

            { $inodeBlocksPerGroup = $1; }

            elsif (/^$/)

            { last; }

            } if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0)

            { print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; }

            my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/)

            { $group = $1; }

            if (/Inode table at (\d+)/)

            { $firstItableBlock = $1; }

            if (/(\d+) free inodes/)

            { $usedInodes[$group] = $inodesPerGroup - $1; }

            if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1)

            { $usedIBlocks[$i] = $inodesPerBlock; }

            dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high)

            { my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

            } else

            { my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

            } for my $i (0..$#usedIBlocks) { if ($usedIBlocks[$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0)

            { dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

            } elsif (defined $arg

            {device}

            )

            { my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

            } $freeIBlocks += $groupFreeIBlocks; my @blockStr = map

            { sprintf "%3d", $_ }

            @usedIBlocks; if ($arg

            {verbose}

            )

            { printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); }

            } } print "Unused inode blocks: $freeIBlocks\n"; close M; _END_ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B<h|-help> Print a help message and exit. =back 8 =head1 EXAMPLES =head1 ENVIRONMENT =over 8 =item B FOO is an environment variable that somehow alters the execution of this program. =back 8 =head1 KNOWN BUGS =head1 CAVEATS =head1 DETAILS =head1 REPORTING BUGS =head1 AUTHOR =head1 SEE ALSO

            wangshilong Wang Shilong (Inactive) added a comment - - edited Nathan, I mean when I downloaded the attachment it is hard for me to read it, it is something like following is a bit messy code.. #!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg {verbose} = 0; sub dprint { if ($arg{verbose} > 1) { print @_; } } # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg {h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h} ); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg {meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta} or die "Couldn't open $arg {meta} : $!"; while () { if (/Inodes per group:\s+(\d+)/) { $inodesPerGroup = $1; } elsif (/Inode blocks per group:\s+(\d+)/) { $inodeBlocksPerGroup = $1; } elsif (/^$/) { last; } } if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0) { print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; } my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/) { $group = $1; } if (/Inode table at (\d+)/) { $firstItableBlock = $1; } if (/(\d+) free inodes/) { $usedInodes[$group] = $inodesPerGroup - $1; } if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1) { $usedIBlocks[$i] = $inodesPerBlock; } dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high) { my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; } } else { my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; } } for my $i (0..$#usedIBlocks) { if ($usedIBlocks [$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0) { dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); } } elsif (defined $arg {device} ) { my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); } } $freeIBlocks += $groupFreeIBlocks; my @blockStr = map { sprintf "%3d", $_ } @usedIBlocks; if ($arg {verbose} ) { printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); } } } print "Unused inode blocks: $freeIBlocks\n"; close M; _ END _ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B< h| -help> Print a help message and exit. =back 8 =head1 EXAMPLES =head1 ENVIRONMENT =over 8 =item B FOO is an environment variable that somehow alters the execution of this program. =back 8 =head1 KNOWN BUGS =head1 CAVEATS =head1 DETAILS =head1 REPORTING BUGS =head1 AUTHOR =head1 SEE ALSO
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited

            It is just a text file, perl script.  The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too.

            ndauchy Nathan Dauchy (Inactive) added a comment - - edited It is just a text file, perl script.  The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too.

            Nathan Dauchy,

            Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances?

            wangshilong Wang Shilong (Inactive) added a comment - Nathan Dauchy, Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances?

            Shilong, any progress on this patch?

            adilger Andreas Dilger added a comment - Shilong, any progress on this patch?

            For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts.  They are run from cron like:

            0 1   * * * root /usr/local/bin/read-meta.sh dump >> /var/log/lustre-read-meta.log 2>&1
            */15 * * * * root /usr/local/bin/read-meta.sh read >> /var/log/lustre-read-meta.log 2>&1
            

            The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary?

            ndauchy Nathan Dauchy (Inactive) added a comment - For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts.  They are run from cron like: 0 1 * * * root /usr/local/bin/read-meta.sh dump >> / var /log/lustre-read-meta.log 2>&1 */15 * * * * root /usr/local/bin/read-meta.sh read >> / var /log/lustre-read-meta.log 2>&1 The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary?

            Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count.  To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps.  I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over.

            ndauchy Nathan Dauchy (Inactive) added a comment - Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count.  To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps.  I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over.

            People

              wshilong Wang Shilong (Inactive)
              wangshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: