[LU-7922] ROOT dir created at mkfs time is using a high #d inode, >2G Created: 26/Mar/16  Updated: 17/Aug/16  Resolved: 29/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Lokesh Nagappa Jaliminche (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Steps to reproduce issue:
====================

export LOAD=yes
sh llmount.sh
/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/l_getidentity --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-N 300000  -G 1" --reformat /tmp/lustre-mdt1 > /dev/null
mkdir -p /mnt/mds1; mount -t lustre -o loop /tmp/lustre-mdt1 /mnt/mds1
mount -t ldiskfs /dev/loop0 /mnt/test/
 ls -i /mnt/test/
[root@server_lokesh tests]# ls -i /mnt/test/
  97 changelog_catalog        30001 O        30 oi.16.17      39 oi.16.26      48 oi.16.35      57 oi.16.44      66 oi.16.53      75 oi.16.62        240005 ROOT
    98 changelog_users        13 oi.16.0       31 oi.16.18      40 oi.16.27      49 oi.16.36      58 oi.16.45      67 oi.16.54      76 oi.16.63               85 seq_ctl
     240001 CONFIGS          14 oi.16.1       32 oi.16.19      41 oi.16.28      50 oi.16.37      59 oi.16.46      68 oi.16.55      20 oi.16.7                86 seq_srv
    84 fld                                23 oi.16.10      15 oi.16.2       42 oi.16.29      51 oi.16.38      60 oi.16.47      69 oi.16.56      21 oi.16.8
    99 hsm_actions                24 oi.16.11      33 oi.16.20      16 oi.16.3       52 oi.16.39      61 oi.16.48      70 oi.16.57      22 oi.16.9

Results
Disk_info after formatting :
====================

[root@server_lokesh tests]# dumpe2fs -h /dev/loop0
dumpe2fs 1.42.12.wc1 (15-Sep-2014)
Filesystem volume name:   lustre-MDT0000
Last mounted on:          /
Filesystem UUID:          a6926858-ad86-49a6-94ee-225ba0fc57cb
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              300000
Block count:              50000
Reserved block count:     2307
Free blocks:              7885
Free inodes:              299987
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      78
Blocks per group:         5120
Fragments per group:      5120
Inodes per group:         30000
Inode blocks per group:   3750
Filesystem created:       Tue Dec  1 15:11:34 2015
Last mount time:          Tue Dec  1 15:11:48 2015
Last write time:          Tue Dec  1 15:11:48 2015
Mount count:              3
Maximum mount count:      -1
Last checked:             Tue Dec  1 15:11:34 2015
Check interval:           0 (<none>)
Lifetime writes:          457 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      9483ebb9-ab24-47eb-b36f-7992baff0cd2
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         (none)
Journal size:             16M
Journal length:           4096
Journal sequence:         0x00000011
Journal start:            1

Inode allocation results :
========================

[root@server_lokesh tests]# ls -i /mnt/test/
    97 changelog_catalog   30001 O             30 oi.16.17      39 oi.16.26      48 oi.16.35      57 oi.16.44      66 oi.16.53      75 oi.16.62           240005 ROOT
    98 changelog_users        13 oi.16.0       31 oi.16.18      40 oi.16.27      49 oi.16.36      58 oi.16.45      67 oi.16.54      76 oi.16.63               85 seq_ctl
240001 CONFIGS                14 oi.16.1       32 oi.16.19      41 oi.16.28      50 oi.16.37      59 oi.16.46      68 oi.16.55      20 oi.16.7                86 seq_srv
    84 fld                    23 oi.16.10      15 oi.16.2       42 oi.16.29      51 oi.16.38      60 oi.16.47      69 oi.16.56      21 oi.16.8
    99 hsm_actions            24 oi.16.11      33 oi.16.20      16 oi.16.3       52 oi.16.39      61 oi.16.48      70 oi.16.57      22 oi.16.9
As per above results 
Inode count: 300000
Free inodes: 299987
Inodes per group: 30000
flex_bg 1
240005 ROOT

ROOT inode is assigned from the 9th out of 10 groups even we have enough free inodes in initial groups.



 Comments   
Comment by Gerrit Updater [ 26/Mar/16 ]

lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/19161
Subject: LU-7922 osd-ldiskfs: ROOT inode allocation fix.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bf58b712dd8744847d47740eedcbd3498151616

Comment by Andreas Dilger [ 26/Mar/16 ]

You have failed to indicate why the placement of the ROOT inode is actually a problem?

The ldiskfs inode allocator takes into account the fact that the earlier groups already have other inodes allocated, and that the journal, inode table, group descriptors, OI files, etc. are already taking space in those groups. Also, by setting the ROOT inode to the start of the filesystem it means that every new inode and subdir allocated therein will rescan all of the first groups for free inodes, even though the blocks in those groups are already consumed by the other filesystem metadata.

In the long run, it is typically better to move the ROOT/ inode to be in a separate group from the start of the filesystem to ensure that its subdirectories are not themselves stuck in groups that are already full of other filesystem metadata.

It would be informative to look at the full "dumpe2fs" output to see if the blocks of those early groups are totally full already? That is usually the case and the ldiskfs inode allocation is doing exactly the right thing for new subdirectories.

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 28/Mar/16 ]

You have failed to indicate why the placement of the ROOT inode is actually a problem?

We have some in-house systems, no large MDT. 61542 block groups on the MDT, for one ROOT is in block group 5232, for the other it's group in 30864. This random behaviour is because of the following ldiskfs code.

if (S_ISDIR(mode) &&
            ((parent == sb->s_root->d_inode) ||
             (ext4_test_inode_flag(parent, EXT4_INODE_TOPDIR)))) {
                int best_ndir = inodes_per_group;
                int ret = -1;

                if (qstr) {
                        hinfo.hash_version = DX_HASH_HALF_MD4;
                        hinfo.seed = sbi->s_hash_seed;
                        ext4fs_dirhash(qstr->name, qstr->len, &hinfo);
                        grp = hinfo.hash;
                } else
                        get_random_bytes(&grp, sizeof(grp));

This fs has seen a few problems that likely stem from this high inum, namely LU-7325. Although we have solution for this, I imagine there could be other ramifications of the >2G inode #s, so it seems dangerous to have the bulk of the inodes in that range. So IMO its better to keep ROOT in initial groups.

By setting the ROOT inode to the start of the filesystem it means that every new inode and subdir allocated therein will rescan all of the first groups for free inodes, even though the > blocks in those groups are already consumed by the other filesystem metadata.
In the long run, it is typically better to move the ROOT/ inode to be in a separate group from > the start of the filesystem to ensure that its subdirectories are not themselves
stuck in groups that are already full of other filesystem metadata.

Yes I agree with this , but if we skip the groups which contains filesystem metadata and select one from initial empty group it will suffice the purpose of this ticket.

It would be informative to look at the full "dumpe2fs" output to see if the blocks of those early > groups are totally full already?

I have seen full dumpe2fs output from my local setup and as per below output, not all but most of the blocks from group0 are used for metadata

dumpe2fs log:
===========

[root@dev-1 lustre-wc-rel]# dumpe2fs /dev/sdb
Filesystem volume name:   lustre-MDT0000
Last mounted on:          /
Filesystem UUID:          dd52b428-ff08-4627-8da3-e50064d7ec96
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              100000
Block count:              50000
Reserved block count:     2340
Free blocks:              33296
Free inodes:              99987
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      25
Blocks per group:         15592
Fragments per group:      15592
Inodes per group:         25000
Inode blocks per group:   3125
Flex block group size:    16
Filesystem created:       Mon Mar 28 22:53:37 2016
Last mount time:          Mon Mar 28 22:54:41 2016
Last write time:          Mon Mar 28 22:54:42 2016
Mount count:              4
Maximum mount count:      -1
Last checked:             Mon Mar 28 22:53:37 2016
Check interval:           0 (<none>)
Lifetime writes:          50 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8   
Default directory hash:   half_md4
Directory Hash Seed:      007f4c6b-a4fd-4cc8-96ab-73f63a5d5d1f
Journal backup:           inode blocks
User quota inode:         3   
Group quota inode:        4   
Journal features:         (none)
Journal size:             16M 

Journal length:           4096
Journal sequence:         0x00000018
Journal start:            1


Group 0: (Blocks 0-15591) [ITABLE_ZEROED]
  Checksum 0x7f05, unused inodes 24876
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-26
  Block bitmap at 27 (+27), Inode bitmap at 31 (+31)
  Inode table at 35-3159 (+35)
{color:red}
  2932 free blocks, 24876 free inodes, 2 directories, 24876 unused inodes
{color}  
  Free blocks: 12620-15551
  Free inodes: 125-25000
Group 1: (Blocks 15592-31183) [ITABLE_ZEROED]
  Checksum 0x9b45, unused inodes 24952
  Backup superblock at 15592, Group descriptors at 15593-15593
  Reserved GDT blocks at 15594-15618
  Block bitmap at 28 (bg #0 + 28), Inode bitmap at 32 (bg #0 + 32)
  Inode table at 3160-6284 (bg #0 + 3160)
  15361 free blocks, 24952 free inodes, 48 directories, 24952 unused inodes
  Free blocks: 15619-16103, 16107-16615, 16817-31183
  Free inodes: 25049-50000
Group 2: (Blocks 31184-46775) [ITABLE_ZEROED]
  Checksum 0xaf65, unused inodes 24955
  Block bitmap at 29 (bg #0 + 29), Inode bitmap at 33 (bg #0 + 33)
  Inode table at 6285-9409 (bg #0 + 6285)
  11491 free blocks, 24955 free inodes, 45 directories, 24955 unused inodes
  Free blocks: 35285-46775
  Free inodes: 50046-75000
Group 3: (Blocks 46776-49999) [ITABLE_ZEROED]
  Checksum 0x0301, unused inodes 24979
  Backup superblock at 46776, Group descriptors at 46777-46777
  Reserved GDT blocks at 46778-46802
  Block bitmap at 30 (bg #0 + 30), Inode bitmap at 34 (bg #0 + 34)
  Inode table at 9410-12534 (bg #0 + 9410)
  3197 free blocks, 24979 free inodes, 21 directories, 24979 unused inodes
  Free blocks: 46803-49999
  Free inodes: 75022-100000

setup:
=====

[root@dev-1 lustre-wc-rel]# lustre/utils/mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/l_getidentity --backfstype=ldiskfs --device-size=200000  --reformat /dev/sdb
[root@dev-1 lustre-wc-rel]# mkdir -p /mnt/mds1; mount -t lustre  /dev/sdb /mnt/mds1
[root@dev-1 lustre-wc-rel]# mount -t ldiskfs /dev/sdb /mnt/test/
[root@dev-1 lustre-wc-rel]# ls -i /mnt/test/
  101 BATCHID              123 lfsck_layout        26 oi.16.12     34 oi.16.20     43 oi.16.29     51 oi.16.37     59 oi.16.45     67 oi.16.53     75 oi.16.61      25041 quota_slave
  102 changelog_catalog    106 lfsck_namespace     27 oi.16.13     35 oi.16.21     17 oi.16.3      52 oi.16.38     60 oi.16.46     68 oi.16.54     76 oi.16.62      25003 REMOTE_PARENT_DIR
  103 changelog_users       11 lost+found          28 oi.16.14     36 oi.16.22     44 oi.16.30     53 oi.16.39     61 oi.16.47     69 oi.16.55     77 oi.16.63         90 reply_data
25001 CONFIGS            25026 NIDTBL_VERSIONS     29 oi.16.15     37 oi.16.23     45 oi.16.31     18 oi.16.4      62 oi.16.48     70 oi.16.56     21 oi.16.7       25043 ROOT
   86 fld                25002 O                   30 oi.16.16     38 oi.16.24     46 oi.16.32     54 oi.16.40     63 oi.16.49     71 oi.16.57     22 oi.16.8          87 seq_ctl
  104 hsm_actions           14 oi.16.0             31 oi.16.17     39 oi.16.25     47 oi.16.33     55 oi.16.41     19 oi.16.5      72 oi.16.58     23 oi.16.9          88 seq_srv
   89 last_rcvd             15 oi.16.1             32 oi.16.18     40 oi.16.26     48 oi.16.34     56 oi.16.42     64 oi.16.50     73 oi.16.59     13 OI_scrub        100 update_log
25048 LFSCK                 24 oi.16.10            33 oi.16.19     41 oi.16.27     49 oi.16.35     57 oi.16.43     65 oi.16.51     20 oi.16.6   25047 PENDING       25042 update_log_dir
  105 lfsck_bookmark        25 oi.16.11            16 oi.16.2      42 oi.16.28     50 oi.16.36     58 oi.16.44     66 oi.16.52     74 oi.16.60  25038 quota_master
Comment by Andreas Dilger [ 28/Mar/16 ]

In the test case, the ROOT and other metadata inodes are using inodes in the 25000 range in group #1, which is the best group (avoid group #0 which is almost full, has maximum free blocks). Your patch would choose group #0 for ROOT and all the other metadata directories, which is definitely sub-optimal.

I agree that there is a risk of problems like LU-7325, but isn't it better if such problems are found on test systems, or early on in a filesystem's lifetime instead of only after the filesystem has been in use a long time and becomes quite full?

I do agree that the random nature of the orlov allocator is potentially sub-optimal for the ROOT inode, and you might consider to improve the selection of the group to avoid groups that are mostly full of metadata, but the current patch is not an acceptable long-term solution just a quick hack that papers over the problem.

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 29/Mar/16 ]

I do agree that the random nature of the orlov allocator is potentially sub-optimal for the ROOT inode, and you might consider to improve the selection of the group to avoid groups that are mostly full of metadata, but the current patch is not an acceptable long-term solution just a quick hack that papers over the problem.

Yes I agree , for now i will improve the current patch with changes to select more appropriate group. (i.e. group which do not contains blocks full of metadata.)

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 30/Mar/16 ]

I have two proposals for deciding goal_inode

1. Start 
2. Repeat until  group g satisfies below condition
      If (block_count_for_current_group g == max_blocks_per_group)
       			break;
3. if (flex_group == 1) 
       GOAL_INODE = (g-1) * inodes_per_group +1 ;
   else
       GOAL_INODE = (((g*flex_size)-(flex_size))*inodes_per_group)+1;

But It is observed that metadata always occupies blocks from
initial 1 or 2 groups, so IMO its better to assign inode from third
group or flex_group.
.

1.Start
2. If (flex_group == 1)
	GOAL_INODE = (3 * inodes_per_group) 
    else
	GOAL_INODE = (3 * flex_size * inodes_per_group) 

Any suggestion on this ?

Comment by Ujjwal Lanjewar (Inactive) [ 31/Mar/16 ]

My view... orlov allocator tries to distribute root directory inodes allocation/placements. It offers better performance by ensuring root directory inodes are distributed. Keeping a fixed group by defining the GOAL inode, will cause all the root directory inodes to stay together and may defeat the purpose/cause performance issues.

Further, inode numbers can be as large as 2^32-1 (unsigned),it will not exceed this limit. Large inode number look ok as long as following cases are handled

  • Inode number should never be interpreted as signed number. If there is such case then it needs to be fixed.
  • Algorithms should be able to ensure that inodes under the root directories are able to stay close to the corresponding root directory (parent) group

We need to validate these facts and confirm if that is the case. If not we need to fix those as a solution.

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 01/Apr/16 ]

My view... orlov allocator tries to distribute root directory inodes allocation/placements. It offers better performance by ensuring root directory inodes are distributed. Keeping a fixed group by defining the GOAL inode, will cause all the root directory inodes to stay together and may defeat the purpose/cause performance issues.

Yes you are right, the Orlov algorithm tries to spread out "top-level" directories, for better performance. (Directories created in the root directory of a filesystem are considered top-level directories) but it does not guarantees this distribution across the disc because of its random behaviour.

Usecase: Contents of /mnt/mds1 directory after mount.:

[root@dev-1 lustre-wc-rel]# lustre/utils/mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/l_getidentity --backfstype=ldiskfs --device-size=200000  --reformat /dev/sdb
[root@dev-1 lustre-wc-rel]# mkdir -p /mnt/mds1; mount -t lustre  /dev/sdb /mnt/mds1
[root@dev-1 lustre-wc-rel]# mount -t ldiskfs /dev/sdb /mnt/test/
[root@dev-1 lustre-wc-rel]# ls -i /mnt/test/
  101 BATCHID              123 lfsck_layout        26 oi.16.12     34 oi.16.20     43 oi.16.29     51 oi.16.37     59 oi.16.45     67 oi.16.53     75 oi.16.61      25041 quota_slave
  102 changelog_catalog    106 lfsck_namespace     27 oi.16.13     35 oi.16.21     17 oi.16.3      52 oi.16.38     60 oi.16.46     68 oi.16.54     76 oi.16.62      25003 REMOTE_PARENT_DIR
  103 changelog_users       11 lost+found          28 oi.16.14     36 oi.16.22     44 oi.16.30     53 oi.16.39     61 oi.16.47     69 oi.16.55     77 oi.16.63         90 reply_data
25001 CONFIGS            25026 NIDTBL_VERSIONS     29 oi.16.15     37 oi.16.23     45 oi.16.31     18 oi.16.4      62 oi.16.48     70 oi.16.56     21 oi.16.7       25043 ROOT
   86 fld                25002 O                   30 oi.16.16     38 oi.16.24     46 oi.16.32     54 oi.16.40     63 oi.16.49     71 oi.16.57     22 oi.16.8          87 seq_ctl
  104 hsm_actions           14 oi.16.0             31 oi.16.17     39 oi.16.25     47 oi.16.33     55 oi.16.41     19 oi.16.5      72 oi.16.58     23 oi.16.9          88 seq_srv
   89 last_rcvd             15 oi.16.1             32 oi.16.18     40 oi.16.26     48 oi.16.34     56 oi.16.42     64 oi.16.50     73 oi.16.59     13 OI_scrub        100 update_log
25048 LFSCK                 24 oi.16.10            33 oi.16.19     41 oi.16.27     49 oi.16.35     57 oi.16.43     65 oi.16.51     20 oi.16.6   25047 PENDING       25042 update_log_dir
  105 lfsck_bookmark        25 oi.16.11            16 oi.16.2      42 oi.16.28     50 oi.16.36     58 oi.16.44     66 oi.16.52     74 oi.16.60  25038 quota_master

We can see here that orlov allocator (without the goal_inode) keeps the "top_level" directories in same group i.e. #1. This is probably because of its random behaviour. When creating a directory which is not in a top_level directory, the Orlov algorithm tries to put it into the same cylinder group as its parent.

With the usage of goal_inode(making it different for all top level directory) we can guarantee distribution of "top_level" directories across the disc. In this way we can also guarantee performance.

Comment by Andreas Dilger [ 01/Apr/16 ]

I've also thought about this a few times - the Lustre "ROOT/" directory does not get the LDISKFS_TOPDIR_FL set at creation time like the real ldiskfs root directory "/" (inode #2), so that top-level directories created therein (in the Lustre-visible root directory) are spread across the filesystem more. It does make sense to keep all of the Lustre directories together (e.g. ROOT, PENDING, REMOTE_PARENT_DIR, etc) so that there is minimal head movement when they are updated.

Looking at the above directory listing it seems that the ldiskfs orlov allocator is doing exactly the right thing - selecting the ROOT directory to go into the first/best block group in the filesystem (i.e. most free blocks). I don't think anything needs to be changed for this. I suspect that the use of such a small MDT filesystem is also skewing the behaviour because there is only a single group which has many more free blocks than the others.

The only useful change that might be made here is to set the TOPDIR flag on ROOT so that subsequent subdirectories are also spread out over the filesystem.

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 03/Apr/16 ]

I've also thought about this a few times - the Lustre "ROOT/" directory does not get the LDISKFS_TOPDIR_FL set at creation time like the real ldiskfs root directory "/" (inode #2), so that top-level directories created therein (in the Lustre-visible root directory) are spread across the filesystem more.

Correct

I suspect that the use of such a small MDT filesystem is also skewing the behaviour because there is only a single group which has many more free blocks than the others.

yes you are right I have tried above test case with bigger mdt. It does distribute all the directories across the disc but again it depends on random number generator in orlov allocator.

It does make sense to keep all of the Lustre directories together (e.g. ROOT, PENDING, REMOTE_PARENT_DIR, etc) so that there is minimal head movement when they are updated.

Yes, I was looking at the contents of all the top level directories (i.e. under /). Most of them have fixed number of files, only O and ROOT directories has subdirectories.
O :- Used to map legacy OST objects for compatibility.Since the repeated lookup() only be called for "/O" at mount time, it will not affect the whole performance.
ROOT:- Directories created under this depends on user. As per your suggestion we can set TOPDIR flag on ROOT so that subsequent subdirectories are spread.

We can use goal_Inode to achieve this by selecting goal_inode from the group/flex_group having maximum number of the empty blocks. Without this change Top level directories will be spread which will lead to increase in head movement.

Comment by Andreas Dilger [ 03/Apr/16 ]

We can use goal_Inode to achieve this by selecting goal_inode from the group/flex_group having maximum number of the empty blocks. Without this change Top level directories will be spread which will lead to increase in head movement.

Except for basic benchmarks on a newly-formatted filesystem, I don't think there is any value in this added complexity. There is always going to be head movement between directories of different users, as well as the journal, so the question is how to best optimize this. Putting all of the top-level subdirectories together may help with benchmarks but will not help in real-life usage when each of those directories may have many thousands or millions of their own files and subdirectories.

So far you haven't shown any evidence that any of the proposed changes is actually going to improve performance, reduce seeking under normal usage, or do anything beyond add some lines to the code. It is my expectation that adding EXT4_TOPDIR_FL on the ROOT/ inode will increase the spread between directories in the short term, but that has the long-term benefit of giving separate users (or projects, or whatever else is at the top level of the directory tree) more space to keep their own files together, and is more similar to how the root of a normal ext4 filesystem works.

Is your objection to the use of a random value in the Orlov group selection, or only to the initial placement of ROOT/, or something else?

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 04/Apr/16 ]

Let me describe the actual problem… During my experiments I observed that most of the time orlov allocator did not work as it expected (even for some obvious cases).

Here is the information about one such occurrence, which is based on the some traces added to the code around find_group_orlov() and comparing it with corresponding dumpe2fs output.

Refer to the sample debug logs from from find_group_orlov() and dumpe2fs output with flex group size of 2 and 64 total groups. As per dumpe2fs log inodes_per_flex_group should be < 9376 (i.e. 4688*2) and blocks_per_flex_group should be < 65536 (32768*2) but debug logs showed different values which appeared to be problematic to me.

Debug Logs from find_group_orlov():

pr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18752
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128720
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18752
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128720
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18652
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128620
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 127693
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18748
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128712
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18752
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128720
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128719
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18752
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 128720
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18637
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 126484
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 127693
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18752
Apr  2 20:14:11 dev-1 kernel: stats.free_blocks: 127694
Apr  2 20:14:11 dev-1 kernel: stats.free_inodes: 18751

Dumpe2fs log:

Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              300032
Block count:              2097152
Reserved block count:     104857
Free blocks:              1971094
Free inodes:              300019
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      511
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         4688
Inode blocks per group:   586
Flex block group size:    2
Filesystem created:       Sat Apr  2 20:13:54 2016
Last mount time:          Sat Apr  2 20:14:11 2016
Last write time:          Sat Apr  2 20:14:11 2016
Mount count:              3
Maximum mount count:      -1
Last checked:             Sat Apr  2 20:13:54 2016
Check interval:           0 (<none>)
Lifetime writes:          2981 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      f3983c1b-c083-4cad-ab48-d0bd740e31b1
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         (none)
Journal size:             327M
Journal length:           83712
Journal sequence:         0x00000012
Journal start:            1


Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0x15ae, unused inodes 4565
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-512
  Block bitmap at 513 (+513), Inode bitmap at 515 (+515)
  Inode table at 517-1102 (+517)
  30867 free blocks, 4565 free inodes, 2 directories, 4565 unused inodes
  Free blocks: 1700-2047, 2249-32767
  Free inodes: 124-4688
Group 1: (Blocks 32768-65535) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xe340, unused inodes 4688
  Backup superblock at 32768, Group descriptors at 32769-32769
  Reserved GDT blocks at 32770-33280
  Block bitmap at 514 (bg #0 + 514), Inode bitmap at 516 (bg #0 + 516)
  Inode table at 1103-1688 (bg #0 + 1103)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 33281-65535
  Free inodes: 4689-9376
Group 2: (Blocks 65536-98303) [ITABLE_ZEROED]
  Checksum 0xe1ef, unused inodes 4687
  Block bitmap at 65536 (+0), Inode bitmap at 65538 (+2)
  Inode table at 65540-66125 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 66713-98303
  Free inodes: 9378-14064
Group 3: (Blocks 98304-131071) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x2952, unused inodes 4688
  Backup superblock at 98304, Group descriptors at 98305-98305
  Reserved GDT blocks at 98306-98816
  Block bitmap at 65537 (bg #2 + 1), Inode bitmap at 65539 (bg #2 + 3)
  Inode table at 66126-66711 (bg #2 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 98817-131071
  Free inodes: 14065-18752
Group 4: (Blocks 131072-163839) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x1ec7, unused inodes 4688
  Block bitmap at 131072 (+0), Inode bitmap at 131074 (+2)
  Inode table at 131076-131661 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 132248-163839
  Free inodes: 18753-23440
Group 5: (Blocks 163840-196607) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xcbf4, unused inodes 4688
  Backup superblock at 163840, Group descriptors at 163841-163841
  Reserved GDT blocks at 163842-164352
  Block bitmap at 131073 (bg #4 + 1), Inode bitmap at 131075 (bg #4 + 3)
  Inode table at 131662-132247 (bg #4 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 164353-196607
  Free inodes: 23441-28128
Group 6: (Blocks 196608-229375) [ITABLE_ZEROED]
  Checksum 0x5d2b, unused inodes 4687
  Block bitmap at 196608 (+0), Inode bitmap at 196610 (+2)
  Inode table at 196612-197197 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 197785-229375
  Free inodes: 28130-32816
Group 7: (Blocks 229376-262143) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x9596, unused inodes 4688
  Backup superblock at 229376, Group descriptors at 229377-229377
  Reserved GDT blocks at 229378-229888
  Block bitmap at 196609 (bg #6 + 1), Inode bitmap at 196611 (bg #6 + 3)
  Inode table at 197198-197783 (bg #6 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 229889-262143
  Free inodes: 32817-37504
Group 8: (Blocks 262144-294911) [ITABLE_ZEROED]
  Checksum 0x8606, unused inodes 4687
  Block bitmap at 262144 (+0), Inode bitmap at 262146 (+2)
  Inode table at 262148-262733 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 263321-294911
  Free inodes: 37506-42192
Group 9: (Blocks 294912-327679) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x4ebb, unused inodes 4688
  Backup superblock at 294912, Group descriptors at 294913-294913
  Reserved GDT blocks at 294914-295424
  Block bitmap at 262145 (bg #8 + 1), Inode bitmap at 262147 (bg #8 + 3)
  Inode table at 262734-263319 (bg #8 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 295425-327679
  Free inodes: 42193-46880
Group 10: (Blocks 327680-360447) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xc5ea, unused inodes 4688
  Block bitmap at 327680 (+0), Inode bitmap at 327682 (+2)
  Inode table at 327684-328269 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 328856-360447
  Free inodes: 46881-51568
Group 11: (Blocks 360448-393215) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x9409, unused inodes 4688
  Block bitmap at 327681 (bg #10 + 1), Inode bitmap at 327683 (bg #10 + 3)
  Inode table at 328270-328855 (bg #10 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 360448-393215
  Free inodes: 51569-56256
Group 12: (Blocks 393216-425983) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x274c, unused inodes 4688
  Block bitmap at 393216 (+0), Inode bitmap at 393218 (+2)
  Inode table at 393220-393805 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 394392-425983
  Free inodes: 56257-60944
Group 13: (Blocks 425984-458751) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x76af, unused inodes 4688
  Block bitmap at 393217 (bg #12 + 1), Inode bitmap at 393219 (bg #12 + 3)
  Inode table at 393806-394391 (bg #12 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 425984-458751
  Free inodes: 60945-65632
Group 14: (Blocks 458752-491519) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x792e, unused inodes 4688
  Block bitmap at 458752 (+0), Inode bitmap at 458754 (+2)
  Inode table at 458756-459341 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 459928-491519
  Free inodes: 65633-70320
Group 15: (Blocks 491520-524287) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x28cd, unused inodes 4688
  Block bitmap at 458753 (bg #14 + 1), Inode bitmap at 458755 (bg #14 + 3)
  Inode table at 459342-459927 (bg #14 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 491520-524287
  Free inodes: 70321-75008
Group 16: (Blocks 524288-557055) [ITABLE_ZEROED]
  Checksum 0xcc9b, unused inodes 4687
  Block bitmap at 524288 (+0), Inode bitmap at 524290 (+2)
  Inode table at 524292-524877 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 525465-557055
  Free inodes: 75010-79696
Group 17: (Blocks 557056-589823) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x80f6, unused inodes 4688
  Block bitmap at 524289 (bg #16 + 1), Inode bitmap at 524291 (bg #16 + 3)
  Inode table at 524878-525463 (bg #16 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 557056-589823
  Free inodes: 79697-84384
Group 18: (Blocks 589824-622591) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x8f77, unused inodes 4688
  Block bitmap at 589824 (+0), Inode bitmap at 589826 (+2)
  Inode table at 589828-590413 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 591000-622591
  Free inodes: 84385-89072
Group 19: (Blocks 622592-655359) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xde94, unused inodes 4688
  Block bitmap at 589825 (bg #18 + 1), Inode bitmap at 589827 (bg #18 + 3)
  Inode table at 590414-590999 (bg #18 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 622592-655359
  Free inodes: 89073-93760
Group 20: (Blocks 655360-688127) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x6dd1, unused inodes 4688
  Block bitmap at 655360 (+0), Inode bitmap at 655362 (+2)
  Inode table at 655364-655949 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 656536-688127
  Free inodes: 93761-98448
Group 21: (Blocks 688128-720895) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x3c32, unused inodes 4688
  Block bitmap at 655361 (bg #20 + 1), Inode bitmap at 655363 (bg #20 + 3)
  Inode table at 655950-656535 (bg #20 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 688128-720895
  Free inodes: 98449-103136
Group 22: (Blocks 720896-753663) [ITABLE_ZEROED]
  Checksum 0x2e3d, unused inodes 4687
  Block bitmap at 720896 (+0), Inode bitmap at 720898 (+2)
  Inode table at 720900-721485 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 722073-753663
  Free inodes: 103138-107824
Group 23: (Blocks 753664-786431) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x6250, unused inodes 4688
  Block bitmap at 720897 (bg #22 + 1), Inode bitmap at 720899 (bg #22 + 3)
  Inode table at 721486-722071 (bg #22 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 753664-786431
  Free inodes: 107825-112512
Group 24: (Blocks 786432-819199) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xe89e, unused inodes 4688
  Block bitmap at 786432 (+0), Inode bitmap at 786434 (+2)
  Inode table at 786436-787021 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 787608-819199
  Free inodes: 112513-117200
Group 25: (Blocks 819200-851967) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x3dad, unused inodes 4688
  Backup superblock at 819200, Group descriptors at 819201-819201
  Reserved GDT blocks at 819202-819712
  Block bitmap at 786433 (bg #24 + 1), Inode bitmap at 786435 (bg #24 + 3)
  Inode table at 787022-787607 (bg #24 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 819713-851967
  Free inodes: 117201-121888
Group 26: (Blocks 851968-884735) [ITABLE_ZEROED]
  Checksum 0xab72, unused inodes 4687
  Block bitmap at 851968 (+0), Inode bitmap at 851970 (+2)
  Inode table at 851972-852557 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 853145-884735
  Free inodes: 121890-126576
Group 27: (Blocks 884736-917503) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x63cf, unused inodes 4688
  Backup superblock at 884736, Group descriptors at 884737-884737
  Reserved GDT blocks at 884738-885248
  Block bitmap at 851969 (bg #26 + 1), Inode bitmap at 851971 (bg #26 + 3)
  Inode table at 852558-853143 (bg #26 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 885249-917503
  Free inodes: 126577-131264
Group 28: (Blocks 917504-950271) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x545a, unused inodes 4688
  Block bitmap at 917504 (+0), Inode bitmap at 917506 (+2)
  Inode table at 917508-918093 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 918680-950271
  Free inodes: 131265-135952
Group 29: (Blocks 950272-983039) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x05b9, unused inodes 4688
  Block bitmap at 917505 (bg #28 + 1), Inode bitmap at 917507 (bg #28 + 3)
  Inode table at 918094-918679 (bg #28 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 950272-983039
  Free inodes: 135953-140640
Group 30: (Blocks 983040-1015807) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x0a38, unused inodes 4688
  Block bitmap at 983040 (+0), Inode bitmap at 983042 (+2)
  Inode table at 983044-983629 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 984216-1015807
  Free inodes: 140641-145328
Group 31: (Blocks 1015808-1048575) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x5bdb, unused inodes 4688
  Block bitmap at 983041 (bg #30 + 1), Inode bitmap at 983043 (bg #30 + 3)
  Inode table at 983630-984215 (bg #30 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1015808-1048575
  Free inodes: 145329-150016
Group 32: (Blocks 1048576-1081343) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x442f, unused inodes 4688
  Block bitmap at 1048576 (+0), Inode bitmap at 1048578 (+2)
  Inode table at 1048580-1049165 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1049752-1081343
  Free inodes: 150017-154704
Group 33: (Blocks 1081344-1114111) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x3a54, unused inodes 4688
  Block bitmap at 1048577 (bg #32 + 1), Inode bitmap at 1048579 (bg #32 + 3)
  Inode table at 1049166-1049751 (bg #32 + 590)
  0 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 
  Free inodes: 154705-159392
Group 34: (Blocks 1114112-1146879) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x8f83, unused inodes 4688
  Block bitmap at 1114112 (+0), Inode bitmap at 1114114 (+2)
  Inode table at 1114116-1114701 (+4)
  0 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 
  Free inodes: 159393-164080
Group 35: (Blocks 1146880-1179647) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xa274, unused inodes 4688
  Block bitmap at 1114113 (bg #34 + 1), Inode bitmap at 1114115 (bg #34 + 3)
  Inode table at 1114702-1115287 (bg #34 + 590)
  13333 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1166315-1179647
  Free inodes: 164081-168768
Group 36: (Blocks 1179648-1212415) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xf8eb, unused inodes 4688
  Block bitmap at 1179648 (+0), Inode bitmap at 1179650 (+2)
  Inode table at 1179652-1180237 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1180824-1212415
  Free inodes: 168769-173456
Group 37: (Blocks 1212416-1245183) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xa908, unused inodes 4688
  Block bitmap at 1179649 (bg #36 + 1), Inode bitmap at 1179651 (bg #36 + 3)
  Inode table at 1180238-1180823 (bg #36 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1212416-1245183
  Free inodes: 173457-178144
Group 38: (Blocks 1245184-1277951) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xa689, unused inodes 4688
  Block bitmap at 1245184 (+0), Inode bitmap at 1245186 (+2)
  Inode table at 1245188-1245773 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1246360-1277951
  Free inodes: 178145-182832
Group 39: (Blocks 1277952-1310719) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xf76a, unused inodes 4688
  Block bitmap at 1245185 (bg #38 + 1), Inode bitmap at 1245187 (bg #38 + 3)
  Inode table at 1245774-1246359 (bg #38 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1277952-1310719
  Free inodes: 182833-187520
Group 40: (Blocks 1310720-1343487) [ITABLE_ZEROED]
  Checksum 0x602a, unused inodes 4687
  Block bitmap at 1310720 (+0), Inode bitmap at 1310722 (+2)
  Inode table at 1310724-1311309 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1311897-1343487
  Free inodes: 187522-192208
Group 41: (Blocks 1343488-1376255) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x2c47, unused inodes 4688
  Block bitmap at 1310721 (bg #40 + 1), Inode bitmap at 1310723 (bg #40 + 3)
  Inode table at 1311310-1311895 (bg #40 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1343488-1376255
  Free inodes: 192209-196896
Group 42: (Blocks 1376256-1409023) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x23c6, unused inodes 4688
  Block bitmap at 1376256 (+0), Inode bitmap at 1376258 (+2)
  Inode table at 1376260-1376845 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1377432-1409023
  Free inodes: 196897-201584
Group 43: (Blocks 1409024-1441791) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x7225, unused inodes 4688
  Block bitmap at 1376257 (bg #42 + 1), Inode bitmap at 1376259 (bg #42 + 3)
  Inode table at 1376846-1377431 (bg #42 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1409024-1441791
  Free inodes: 201585-206272
Group 44: (Blocks 1441792-1474559) [ITABLE_ZEROED]
  Checksum 0x9a26, unused inodes 4588
  Block bitmap at 1441792 (+0), Inode bitmap at 1441794 (+2)
  Inode table at 1441796-1442381 (+4)
  31492 free blocks, 4588 free inodes, 100 directories, 4588 unused inodes
  Free blocks: 1443068-1474559
  Free inodes: 206373-210960
Group 45: (Blocks 1474560-1507327) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x9083, unused inodes 4688
  Block bitmap at 1441793 (bg #44 + 1), Inode bitmap at 1441795 (bg #44 + 3)
  Inode table at 1442382-1442967 (bg #44 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1474560-1507327
  Free inodes: 210961-215648
Group 46: (Blocks 1507328-1540095) [ITABLE_ZEROED]
  Checksum 0x828c, unused inodes 4687
  Block bitmap at 1507328 (+0), Inode bitmap at 1507330 (+2)
  Inode table at 1507332-1507917 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1508505-1540095
  Free inodes: 215650-220336
Group 47: (Blocks 1540096-1572863) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xcee1, unused inodes 4688
  Block bitmap at 1507329 (bg #46 + 1), Inode bitmap at 1507331 (bg #46 + 3)
  Inode table at 1507918-1508503 (bg #46 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1540096-1572863
  Free inodes: 220337-225024
Group 48: (Blocks 1572864-1605631) [ITABLE_ZEROED]
  Checksum 0x2ab7, unused inodes 4687
  Block bitmap at 1572864 (+0), Inode bitmap at 1572866 (+2)
  Inode table at 1572868-1573453 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1574041-1605631
  Free inodes: 225026-229712
Group 49: (Blocks 1605632-1638399) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xe20a, unused inodes 4688
  Backup superblock at 1605632, Group descriptors at 1605633-1605633
  Reserved GDT blocks at 1605634-1606144
  Block bitmap at 1572865 (bg #48 + 1), Inode bitmap at 1572867 (bg #48 + 3)
  Inode table at 1573454-1574039 (bg #48 + 590)
  32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1606145-1638399
  Free inodes: 229713-234400
Group 50: (Blocks 1638400-1671167) [ITABLE_ZEROED]
  Checksum 0x74d5, unused inodes 4687
  Block bitmap at 1638400 (+0), Inode bitmap at 1638402 (+2)
  Inode table at 1638404-1638989 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1639577-1671167
  Free inodes: 234402-239088
Group 51: (Blocks 1671168-1703935) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x38b8, unused inodes 4688
  Block bitmap at 1638401 (bg #50 + 1), Inode bitmap at 1638403 (bg #50 + 3)
  Inode table at 1638990-1639575 (bg #50 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1671168-1703935
  Free inodes: 239089-243776
Group 52: (Blocks 1703936-1736703) [ITABLE_ZEROED]
  Checksum 0x9673, unused inodes 4687
  Block bitmap at 1703936 (+0), Inode bitmap at 1703938 (+2)
  Inode table at 1703940-1704525 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1705113-1736703
  Free inodes: 243778-248464
Group 53: (Blocks 1736704-1769471) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xda1e, unused inodes 4688
  Block bitmap at 1703937 (bg #52 + 1), Inode bitmap at 1703939 (bg #52 + 3)
  Inode table at 1704526-1705111 (bg #52 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1736704-1769471
  Free inodes: 248465-253152
Group 54: (Blocks 1769472-1802239) [ITABLE_ZEROED]
  Checksum 0xc811, unused inodes 4687
  Block bitmap at 1769472 (+0), Inode bitmap at 1769474 (+2)
  Inode table at 1769476-1770061 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1770649-1802239
  Free inodes: 253154-257840
Group 55: (Blocks 1802240-1835007) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x847c, unused inodes 4688
  Block bitmap at 1769473 (bg #54 + 1), Inode bitmap at 1769475 (bg #54 + 3)
  Inode table at 1770062-1770647 (bg #54 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1802240-1835007
  Free inodes: 257841-262528
Group 56: (Blocks 1835008-1867775) [ITABLE_ZEROED]
  Checksum 0x570c, unused inodes 4686
  Block bitmap at 1835008 (+0), Inode bitmap at 1835010 (+2)
  Inode table at 1835012-1835597 (+4)
  31588 free blocks, 4686 free inodes, 1 directories, 4686 unused inodes
  Free blocks: 1836185-1836543, 1836547-1867775
  Free inodes: 262531-267216
Group 57: (Blocks 1867776-1900543) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x5f51, unused inodes 4688
  Block bitmap at 1835009 (bg #56 + 1), Inode bitmap at 1835011 (bg #56 + 3)
  Inode table at 1835598-1836183 (bg #56 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1867776-1900543
  Free inodes: 267217-271904
Group 58: (Blocks 1900544-1933311) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x50d0, unused inodes 4688
  Block bitmap at 1900544 (+0), Inode bitmap at 1900546 (+2)
  Inode table at 1900548-1901133 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1901720-1933311
  Free inodes: 271905-276592
Group 59: (Blocks 1933312-1966079) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0x0133, unused inodes 4688
  Block bitmap at 1900545 (bg #58 + 1), Inode bitmap at 1900547 (bg #58 + 3)
  Inode table at 1901134-1901719 (bg #58 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1933312-1966079
  Free inodes: 276593-281280
Group 60: (Blocks 1966080-1998847) [ITABLE_ZEROED]
  Checksum 0xaff8, unused inodes 4687
  Block bitmap at 1966080 (+0), Inode bitmap at 1966082 (+2)
  Inode table at 1966084-1966669 (+4)
  31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes
  Free blocks: 1967257-1998847
  Free inodes: 281282-285968
Group 61: (Blocks 1998848-2031615) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Checksum 0xe395, unused inodes 4688
  Block bitmap at 1966081 (bg #60 + 1), Inode bitmap at 1966083 (bg #60 + 3)
  Inode table at 1966670-1967255 (bg #60 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 1998848-2031615
  Free inodes: 285969-290656
Group 62: (Blocks 2031616-2064383) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0xec14, unused inodes 4688
  Block bitmap at 2031616 (+0), Inode bitmap at 2031618 (+2)
  Inode table at 2031620-2032205 (+4)
  31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 2032792-2064383
  Free inodes: 290657-295344
Group 63: (Blocks 2064384-2097151) [INODE_UNINIT, ITABLE_ZEROED]
  Checksum 0x7a0e, unused inodes 4688
  Block bitmap at 2031617 (bg #62 + 1), Inode bitmap at 2031619 (bg #62 + 3)
  Inode table at 2032206-2032791 (bg #62 + 590)
  32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes
  Free blocks: 2064384-2097151
  Free inodes: 295345-300032

Initially I thought that this is the issue with orlov allocator hence my recommendation was to substitute it with a better one. However while this discussion was in progress, I did some more debugging to find why orlov did not work properly. I observed that the find_group_orlov() relied on the stats information supplied by get_orlov_stats().

More logs revealed that get_orlov_stats() indeed returned incorrect stats. Finally I could get to the root cause and found that the culprit was ldiskfs_kvzalloc(size, GFP_KERNEL), which is called in ldiskfs_fill_flex_info() to reset sbi->s_flex_groups, before stats calculation. However ldiskfs_kvzalloc did not actually initialize the s_flex_groups as 0 (leading to accumulation of previous stats).

The simple fix for this would be to use correct flag for it to be initialized properly, i.e.

--- ldiskfs/super.c	2016-04-04 14:53:23.343136984 +0530
+++ ldiskfs/super.c.new	2016-04-04 14:53:09.946441993 +0530
@@ -96,7 +96,7 @@ void *ldiskfs_kvzalloc(size_t size, gfp_
{
	void *ret;
-	ret = kmalloc(size, flags);
+	ret = kmalloc(size, flags | __GFP_ZERO);
	if (!ret)
-		ret = __vmalloc(size, flags | __GFP_ZERO, PAGE_KERNEL);
+		ret = __vmalloc(size, flags , PAGE_KERNEL);
	return ret;
Comment by Andreas Dilger [ 10/Apr/16 ]

It isn't clear why you are removing the __GFP_ZERO flag from __vmalloc()?

That should be using kzalloc() instead of _GFP_ZERO, see commit v3.0-7217-gdb9481c04. Also, there should be _GFP_NOWARN for kmalloc() (see commit v3.11-rc2-221-g8be04b937). Please copy both commit messages from these commits and include the commit hashes in your patch, so that it is clear where the patch is coming from. It looks like the patches are only needed for RHEL6, not RHEL7.

Comment by Gerrit Updater [ 14/Apr/16 ]

lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/19541
Subject: LU-7922 ldiskfs: correction in ext4_kzalloc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7fe7ede48a2e9872300fc2d95017e477e6e50bfe

Comment by Gerrit Updater [ 25/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19541/
Subject: LU-7922 ldiskfs: correction in ext4_kzalloc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 101e729708289e46fe858a6b7162f779e24dfa5a

Comment by Peter Jones [ 28/Jun/16 ]

Should the original patch be abandoned and this ticket marked as resolved or does it need reworking?

Comment by Andreas Dilger [ 29/Jun/16 ]

Patch was landed for 2.9.0.

Generated at Sat Feb 10 02:13:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.