[LU-10946] add an interface to load ldiskfs block bitmaps Created: 24/Apr/18  Updated: 28/Sep/23  Resolved: 20/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Wang Shilong (Inactive) Assignee: Wang Shilong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch

Attachments: HTML File read-inodes     File read-meta.sh    
Issue Links:
Related
is related to LU-17153 Random block allocation policy in ldi... Open
is related to LU-10967 MDT page cache management improvements Open
is related to LU-12970 improve mballoc for huge filesystems Open
Rank (Obsolete): 9223372036854775807

 Description   

During our benchmarking/testing, we found sometimes write performances are not stable enough and there are some small read during write which could drop thoughoutput of write performances.

It turned out that block bitmaps load make some latency here, also for a heavy fragment filesystem, we might need load many bitmaps to find some free blocks.

To improve above situation, we had a patch to load block bitmaps to memory and pin those bitmaps memory until unmount or we release the memory on purpose, this could stable write performances and improve performances of a heavy fragment filesystem.



 Comments   
Comment by Wang Shilong (Inactive) [ 24/Apr/18 ]

I need cleanup our internal patch a bit, will push the patch to master very soon.

Comment by Peter Jones [ 24/Apr/18 ]

ok - thanks wangshilong

 

Comment by Nathan Dauchy (Inactive) [ 24/Apr/18 ]

This feature is of interest to NASA. (We currently use scripts around debugfs run periodically from cron to dump the bitmap information, and also the object trees with 'debugfs -c -R "ls O/0/d$i"'.)

I found this was discussed way back in LU-15, with memory requirement estimates (are they still valid?):
LU-15 comment-12883
Also from ticket LU-3631 is reading the inode bitmaps no longer really useful?

Regarding this patch, I think it would be helpful to include options to make it more generally configurable and usable for a wider variety of use-cases, not just pin block bitmaps. For example...

  • Just do a pre-read of the bitmaps to warm the cache, without pinning.
  • An option to either load at mount time, or load on demand.
  • Differentiate between loading (and pinning) Data Block bitmap vs. Inode bitmap vs. the full Inode Table.

Thanks!

Comment by Wang Shilong (Inactive) [ 24/Apr/18 ]

Hi Nathan Dauchy,

Thanks for your input here, looks you raised a lot of interesting question here. _

1)regarding memory requirement, yup, there will be memory pressure for pinning bitmaps.
just considering we have one block bitmap(4K) vs a 128M block group, it might eat a lot
of memory if system is big. and same requirement for inode bitmap. I guess system memory
might not be enough to pin both inode bitmap and block bitmaps.

2)we tried to just read bitmaps to memory and make it reclaimable but write performances
still not stable since those bitmap memory reclaimed easier before we want it.

3)I don't think it a good idea to load(pin) full inode table, since full inode table eat much more memory
than bitmaps..maybe some on-demand load inode table ahead makes sense, which I am not sure.

So I agreed we can make the patch more configurable with:
1)pre-read of bitmap to warm the cache without pinning.
2)option to pin bitmap

Thanks,
Shilong

Comment by Nathan Dauchy (Inactive) [ 25/Apr/18 ]

Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count.  To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps.  I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over.

Comment by Nathan Dauchy (Inactive) [ 26/Apr/18 ]

For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts.  They are run from cron like:

0 1   * * * root /usr/local/bin/read-meta.sh dump >> /var/log/lustre-read-meta.log 2>&1
*/15 * * * * root /usr/local/bin/read-meta.sh read >> /var/log/lustre-read-meta.log 2>&1

The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary?

Comment by Andreas Dilger [ 02/May/18 ]

Shilong, any progress on this patch?

Comment by Wang Shilong (Inactive) [ 07/May/18 ]

Nathan Dauchy,

Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances?

Comment by Nathan Dauchy (Inactive) [ 07/May/18 ]

It is just a text file, perl script.  The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too.

Comment by Wang Shilong (Inactive) [ 07/May/18 ]

Nathan,

I mean when I downloaded the attachment it is hard for me to read it, it is something like following
is a bit messy code..

#!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg

{verbose} = 0; sub dprint { if ($arg{verbose}

> 1)

{ print @_; }

} # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg

{h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h}

); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg

{meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta}

or die "Couldn't open $arg

{meta}

: $!"; while () { if (/Inodes per group:\s+(\d+)/)

{ $inodesPerGroup = $1; }

elsif (/Inode blocks per group:\s+(\d+)/)

{ $inodeBlocksPerGroup = $1; }

elsif (/^$/)

{ last; }

} if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0)

{ print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; }

my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/)

{ $group = $1; }

if (/Inode table at (\d+)/)

{ $firstItableBlock = $1; }

if (/(\d+) free inodes/)

{ $usedInodes[$group] = $inodesPerGroup - $1; }

if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1)

{ $usedIBlocks[$i] = $inodesPerBlock; }

dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high)

{ my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

} else

{ my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }

} for my $i (0..$#usedIBlocks) { if ($usedIBlocks[$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0)

{ dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

} elsif (defined $arg

{device}

)

{ my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }

} $freeIBlocks += $groupFreeIBlocks; my @blockStr = map

{ sprintf "%3d", $_ }

@usedIBlocks; if ($arg

{verbose}

)

{ printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); }

} } print "Unused inode blocks: $freeIBlocks\n"; close M; _END_ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B<h|-help> Print a help message and exit. =back 8 =head1 EXAMPLES =head1 ENVIRONMENT =over 8 =item B FOO is an environment variable that somehow alters the execution of this program. =back 8 =head1 KNOWN BUGS =head1 CAVEATS =head1 DETAILS =head1 REPORTING BUGS =head1 AUTHOR =head1 SEE ALSO

Comment by Gerrit Updater [ 10/May/18 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347
Subject: LU-10946 ldiskfs: add an interface to load ldiskfs block bitmaps
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f2989ee1ac1b7ca5666fc1cf42f9e95f3da20200

Comment by Wang Shilong (Inactive) [ 10/May/18 ]

Hi Nathan Dauchy,

Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case?

echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin.
echo 2 > /sys/fs/ldiskfs/vdb/loadbbitmaps # pin block bitmaps in memory
echo 0 > /sys/fs/ldiskfs/vdb/loadbbitmaps #unpin block bitmaps in memory

Thanks,
Shilong

Comment by Jay Lan (Inactive) [ 15/May/18 ]

Hi Shilong,

Nathan asked me to cherry-pick the #32347 review. This patch caused conflicts in b2_10. Branch b2_10 is 4 ldiskfs kernel_patches behind compared to master branch. Is there dependency on the 4 missing patches or any other commit? If yes, could you list prerequisites of your patch?

Thanks,
Jay

Comment by Wang Shilong (Inactive) [ 17/May/18 ]

Hi Jay Lan,

You could just ignore the 4 missing patches, and apply my patch directly, I build locally, it works.

Thanks,
Shilong

Generated at Sat Feb 10 02:39:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.