[LU-10946] add an interface to load ldiskfs block bitmaps Created: 24/Apr/18 Updated: 28/Sep/23 Resolved: 20/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
During our benchmarking/testing, we found sometimes write performances are not stable enough and there are some small read during write which could drop thoughoutput of write performances. It turned out that block bitmaps load make some latency here, also for a heavy fragment filesystem, we might need load many bitmaps to find some free blocks. To improve above situation, we had a patch to load block bitmaps to memory and pin those bitmaps memory until unmount or we release the memory on purpose, this could stable write performances and improve performances of a heavy fragment filesystem. |
| Comments |
| Comment by Wang Shilong (Inactive) [ 24/Apr/18 ] |
|
I need cleanup our internal patch a bit, will push the patch to master very soon. |
| Comment by Peter Jones [ 24/Apr/18 ] |
|
ok - thanks wangshilong
|
| Comment by Nathan Dauchy (Inactive) [ 24/Apr/18 ] |
|
This feature is of interest to NASA. (We currently use scripts around debugfs run periodically from cron to dump the bitmap information, and also the object trees with 'debugfs -c -R "ls O/0/d$i"'.) I found this was discussed way back in Regarding this patch, I think it would be helpful to include options to make it more generally configurable and usable for a wider variety of use-cases, not just pin block bitmaps. For example...
Thanks! |
| Comment by Wang Shilong (Inactive) [ 24/Apr/18 ] |
|
Hi Nathan Dauchy, Thanks for your input here, looks you raised a lot of interesting question here. _ 1)regarding memory requirement, yup, there will be memory pressure for pinning bitmaps. 2)we tried to just read bitmaps to memory and make it reclaimable but write performances 3)I don't think it a good idea to load(pin) full inode table, since full inode table eat much more memory So I agreed we can make the patch more configurable with: Thanks, |
| Comment by Nathan Dauchy (Inactive) [ 25/Apr/18 ] |
|
Discussions of this at LUG brought up another issue, and possible solution... for cases of failover pairs, a server may have enough memory for pinning bitmaps in normal operation but OOM in a failover event with 2X the OST count. To handle that case, and protect low-memory OSS in general, there could also be a configurable amount (80% by default?) of total memory threshold above which not to pin bitmaps. I don't know if it would be better to pin as much as possible up to that threshold, or to pre-calculate whether all bitmaps for a given OST are pinnable based on total (or free) memory; but either way it should report a kernel error message if not able to pin without going over. |
| Comment by Nathan Dauchy (Inactive) [ 26/Apr/18 ] |
|
For clarification on the metadata pre-loading done at NASA, I have uploaded the scripts. They are run from cron like: 0 1 * * * root /usr/local/bin/read-meta.sh dump >> /var/log/lustre-read-meta.log 2>&1 */15 * * * * root /usr/local/bin/read-meta.sh read >> /var/log/lustre-read-meta.log 2>&1 The original developer has left, but it sounds like the scripts were actually created with a focus on caching inode information (such as to speed up "ls -l"), not necessarily free blocks. Perhaps they have the nice side effect of refreshing the block bitmaps in cache, and if other changes in the last few years like flex_bg have improved inode table reading, then these scripts are now unnecessary? |
| Comment by Andreas Dilger [ 02/May/18 ] |
|
Shilong, any progress on this patch? |
| Comment by Wang Shilong (Inactive) [ 07/May/18 ] |
|
Nathan Dauchy, Attachment read-inode is bad format to read, so we might better figure out caching which kind of metadata exactly improved your performances? |
| Comment by Nathan Dauchy (Inactive) [ 07/May/18 ] |
|
It is just a text file, perl script. The file simply uses line feed characters, but perhaps your editor (when no extension is present) is looking for carriage return too. |
| Comment by Wang Shilong (Inactive) [ 07/May/18 ] |
|
Nathan, I mean when I downloaded the attachment it is hard for me to read it, it is something like following #!/usr/bin/perl -w # $Header: /cvsroot/lustre-tools/src/read-inodes,v 1.2 2011/03/16 00:04:17 jrappley Exp $ use strict; use File::Basename; use Getopt::Long; use Pod::Usage; use POSIX qw(ceil); use Fcntl 'SEEK_SET'; my $progname = basename $0; # Globals my $group = 0; my $inodesPerGroup = 0; my $inodeBlocksPerGroup = 0; my @usedInodes; my $freeIBlocks = 0; # Command line options my %arg; $arg {verbose} = 0; sub dprint { if ($arg{verbose}> 1) { print @_; }} # Parse command line options Getopt::Long::Configure("bundling"); GetOptions( "h|help" => \$arg {h}, "d|device=s" => \$arg{device}, "m|meta=s" => \$arg{meta}, "v|verbose+" => \$arg{verbose}, ) or pod2usage(-exitval => 2, -verbose => 1); pod2usage(1) if ($arg{h}); pod2usage(1) if (scalar(@ARGV) Unable to render embedded object: File (= 0); pod2usage(1) if () not found. defined $arg {meta}); if (defined $arg{device}) { open D, "<", "$arg{device}" or die "Couldn't open $arg{device}: $!"; } open M, "<", $arg{meta}or die "Couldn't open $arg {meta}: $!"; while () { if (/Inodes per group:\s+(\d+)/) { $inodesPerGroup = $1; }elsif (/Inode blocks per group:\s+(\d+)/) { $inodeBlocksPerGroup = $1; }elsif (/^$/) { last; }} if ($inodesPerGroup == 0 || $inodeBlocksPerGroup == 0) { print STDERR "Couldn't determine number of inodes per group, exiting\n"; exit 1; }my $inodesPerBlock = $inodesPerGroup / $inodeBlocksPerGroup; my $firstItableBlock; while () { if (/^Group (\d+)/) { $group = $1; }if (/Inode table at (\d+)/) { $firstItableBlock = $1; }if (/(\d+) free inodes/) { $usedInodes[$group] = $inodesPerGroup - $1; }if (/Free inodes:\s(.*)/) { # next if $group > 1; my @usedIBlocks; my $groupFreeIBlocks = 0; for my $i (0..$inodeBlocksPerGroup - 1) { $usedIBlocks[$i] = $inodesPerBlock; }dprint "$group: ", join(" ", @usedIBlocks), "\n"; my @irange = split(/, /, $1); dprint "Group $group: ", join(" X ", @irange), "\n"; foreach my $range (@irange) { dprint "range: $range\n"; if ($range =~ /(\d+)-(\d+)/) { my $low = $1 - ($group * $inodesPerGroup); my $high = $2 - ($group * $inodesPerGroup); dprint "marking $low..$high\n"; for my $inum ($low..$high) { my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "inum $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }} else { my $inum = $range - ($group * $inodesPerGroup); my $iBlock = ceil($inum / $inodesPerBlock) - 1; dprint "marking $inum block $iBlock\n"; $usedIBlocks[$iBlock]--; }} for my $i (0..$#usedIBlocks) { if ($usedIBlocks[$i] == 0) { $groupFreeIBlocks++; my $block = $firstItableBlock + $i; if (0) { dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }} elsif (defined $arg {device}) { my $block = $firstItableBlock + $i; dprint "group $group read ", $firstItableBlock + $i, "\n"; sysseek(D, $block * 4096, SEEK_SET); sysread(D, my $foo, 4096); }} $freeIBlocks += $groupFreeIBlocks; my @blockStr = map { sprintf "%3d", $_ }@usedIBlocks; if ($arg {verbose}) { printf("%6d: %3d/%3d | %s\n", $group, $usedInodes[$group], $groupFreeIBlocks, join("", @blockStr)); } } } print "Unused inode blocks: $freeIBlocks\n"; close M; _END_ =head1 NAME =head1 SYNOPSIS skeleton.pl [-h] =head1 DESCRIPTION =head1 OPTIONS =over 8 =item B< |
| Comment by Gerrit Updater [ 10/May/18 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/32347 |
| Comment by Wang Shilong (Inactive) [ 10/May/18 ] |
|
Hi Nathan Dauchy, Sorry for late patch, I just pushed a simple version, could you help test if it help your performance case? echo 1 > /sys/fs/ldiskfs/vdb/loadbbitmaps # this is just load not pin. Thanks, |
| Comment by Jay Lan (Inactive) [ 15/May/18 ] |
|
Hi Shilong, Nathan asked me to cherry-pick the #32347 review. This patch caused conflicts in b2_10. Branch b2_10 is 4 ldiskfs kernel_patches behind compared to master branch. Is there dependency on the 4 missing patches or any other commit? If yes, could you list prerequisites of your patch? Thanks, |
| Comment by Wang Shilong (Inactive) [ 17/May/18 ] |
|
Hi Jay Lan, You could just ignore the 4 missing patches, and apply my patch directly, I build locally, it works. Thanks, |