Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
Steps to reproduce issue:
====================
export LOAD=yes sh llmount.sh /mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/l_getidentity --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-N 300000 -G 1" --reformat /tmp/lustre-mdt1 > /dev/null mkdir -p /mnt/mds1; mount -t lustre -o loop /tmp/lustre-mdt1 /mnt/mds1 mount -t ldiskfs /dev/loop0 /mnt/test/ ls -i /mnt/test/ [root@server_lokesh tests]# ls -i /mnt/test/ 97 changelog_catalog 30001 O 30 oi.16.17 39 oi.16.26 48 oi.16.35 57 oi.16.44 66 oi.16.53 75 oi.16.62 240005 ROOT 98 changelog_users 13 oi.16.0 31 oi.16.18 40 oi.16.27 49 oi.16.36 58 oi.16.45 67 oi.16.54 76 oi.16.63 85 seq_ctl 240001 CONFIGS 14 oi.16.1 32 oi.16.19 41 oi.16.28 50 oi.16.37 59 oi.16.46 68 oi.16.55 20 oi.16.7 86 seq_srv 84 fld 23 oi.16.10 15 oi.16.2 42 oi.16.29 51 oi.16.38 60 oi.16.47 69 oi.16.56 21 oi.16.8 99 hsm_actions 24 oi.16.11 33 oi.16.20 16 oi.16.3 52 oi.16.39 61 oi.16.48 70 oi.16.57 22 oi.16.9
Results
Disk_info after formatting :
====================
[root@server_lokesh tests]# dumpe2fs -h /dev/loop0 dumpe2fs 1.42.12.wc1 (15-Sep-2014) Filesystem volume name: lustre-MDT0000 Last mounted on: / Filesystem UUID: a6926858-ad86-49a6-94ee-225ba0fc57cb Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 300000 Block count: 50000 Reserved block count: 2307 Free blocks: 7885 Free inodes: 299987 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 78 Blocks per group: 5120 Fragments per group: 5120 Inodes per group: 30000 Inode blocks per group: 3750 Filesystem created: Tue Dec 1 15:11:34 2015 Last mount time: Tue Dec 1 15:11:48 2015 Last write time: Tue Dec 1 15:11:48 2015 Mount count: 3 Maximum mount count: -1 Last checked: Tue Dec 1 15:11:34 2015 Check interval: 0 (<none>) Lifetime writes: 457 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 9483ebb9-ab24-47eb-b36f-7992baff0cd2 Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: (none) Journal size: 16M Journal length: 4096 Journal sequence: 0x00000011 Journal start: 1
Inode allocation results :
========================
[root@server_lokesh tests]# ls -i /mnt/test/ 97 changelog_catalog 30001 O 30 oi.16.17 39 oi.16.26 48 oi.16.35 57 oi.16.44 66 oi.16.53 75 oi.16.62 240005 ROOT 98 changelog_users 13 oi.16.0 31 oi.16.18 40 oi.16.27 49 oi.16.36 58 oi.16.45 67 oi.16.54 76 oi.16.63 85 seq_ctl 240001 CONFIGS 14 oi.16.1 32 oi.16.19 41 oi.16.28 50 oi.16.37 59 oi.16.46 68 oi.16.55 20 oi.16.7 86 seq_srv 84 fld 23 oi.16.10 15 oi.16.2 42 oi.16.29 51 oi.16.38 60 oi.16.47 69 oi.16.56 21 oi.16.8 99 hsm_actions 24 oi.16.11 33 oi.16.20 16 oi.16.3 52 oi.16.39 61 oi.16.48 70 oi.16.57 22 oi.16.9 As per above results Inode count: 300000 Free inodes: 299987 Inodes per group: 30000 flex_bg 1 240005 ROOT
ROOT inode is assigned from the 9th out of 10 groups even we have enough free inodes in initial groups.
Attachments
Activity
Should the original patch be abandoned and this ticket marked as resolved or does it need reworking?
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19541/
Subject: LU-7922 ldiskfs: correction in ext4_kzalloc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 101e729708289e46fe858a6b7162f779e24dfa5a
lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/19541
Subject: LU-7922 ldiskfs: correction in ext4_kzalloc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7fe7ede48a2e9872300fc2d95017e477e6e50bfe
It isn't clear why you are removing the __GFP_ZERO flag from __vmalloc()?
That should be using kzalloc() instead of _GFP_ZERO, see commit v3.0-7217-gdb9481c04. Also, there should be _GFP_NOWARN for kmalloc() (see commit v3.11-rc2-221-g8be04b937). Please copy both commit messages from these commits and include the commit hashes in your patch, so that it is clear where the patch is coming from. It looks like the patches are only needed for RHEL6, not RHEL7.
Let me describe the actual problem… During my experiments I observed that most of the time orlov allocator did not work as it expected (even for some obvious cases).
Here is the information about one such occurrence, which is based on the some traces added to the code around find_group_orlov() and comparing it with corresponding dumpe2fs output.
Refer to the sample debug logs from from find_group_orlov() and dumpe2fs output with flex group size of 2 and 64 total groups. As per dumpe2fs log inodes_per_flex_group should be < 9376 (i.e. 4688*2) and blocks_per_flex_group should be < 65536 (32768*2) but debug logs showed different values which appeared to be problematic to me.
Debug Logs from find_group_orlov():
pr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18752 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128720 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18752 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128720 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18652 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128620 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 127693 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18748 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128712 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18752 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128720 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128719 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18752 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 128720 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18637 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 126484 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 127693 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18752 Apr 2 20:14:11 dev-1 kernel: stats.free_blocks: 127694 Apr 2 20:14:11 dev-1 kernel: stats.free_inodes: 18751
Dumpe2fs log:
Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 300032 Block count: 2097152 Reserved block count: 104857 Free blocks: 1971094 Free inodes: 300019 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 511 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 4688 Inode blocks per group: 586 Flex block group size: 2 Filesystem created: Sat Apr 2 20:13:54 2016 Last mount time: Sat Apr 2 20:14:11 2016 Last write time: Sat Apr 2 20:14:11 2016 Mount count: 3 Maximum mount count: -1 Last checked: Sat Apr 2 20:13:54 2016 Check interval: 0 (<none>) Lifetime writes: 2981 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: f3983c1b-c083-4cad-ab48-d0bd740e31b1 Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: (none) Journal size: 327M Journal length: 83712 Journal sequence: 0x00000012 Journal start: 1 Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0x15ae, unused inodes 4565 Primary superblock at 0, Group descriptors at 1-1 Reserved GDT blocks at 2-512 Block bitmap at 513 (+513), Inode bitmap at 515 (+515) Inode table at 517-1102 (+517) 30867 free blocks, 4565 free inodes, 2 directories, 4565 unused inodes Free blocks: 1700-2047, 2249-32767 Free inodes: 124-4688 Group 1: (Blocks 32768-65535) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xe340, unused inodes 4688 Backup superblock at 32768, Group descriptors at 32769-32769 Reserved GDT blocks at 32770-33280 Block bitmap at 514 (bg #0 + 514), Inode bitmap at 516 (bg #0 + 516) Inode table at 1103-1688 (bg #0 + 1103) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 33281-65535 Free inodes: 4689-9376 Group 2: (Blocks 65536-98303) [ITABLE_ZEROED] Checksum 0xe1ef, unused inodes 4687 Block bitmap at 65536 (+0), Inode bitmap at 65538 (+2) Inode table at 65540-66125 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 66713-98303 Free inodes: 9378-14064 Group 3: (Blocks 98304-131071) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x2952, unused inodes 4688 Backup superblock at 98304, Group descriptors at 98305-98305 Reserved GDT blocks at 98306-98816 Block bitmap at 65537 (bg #2 + 1), Inode bitmap at 65539 (bg #2 + 3) Inode table at 66126-66711 (bg #2 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 98817-131071 Free inodes: 14065-18752 Group 4: (Blocks 131072-163839) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x1ec7, unused inodes 4688 Block bitmap at 131072 (+0), Inode bitmap at 131074 (+2) Inode table at 131076-131661 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 132248-163839 Free inodes: 18753-23440 Group 5: (Blocks 163840-196607) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xcbf4, unused inodes 4688 Backup superblock at 163840, Group descriptors at 163841-163841 Reserved GDT blocks at 163842-164352 Block bitmap at 131073 (bg #4 + 1), Inode bitmap at 131075 (bg #4 + 3) Inode table at 131662-132247 (bg #4 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 164353-196607 Free inodes: 23441-28128 Group 6: (Blocks 196608-229375) [ITABLE_ZEROED] Checksum 0x5d2b, unused inodes 4687 Block bitmap at 196608 (+0), Inode bitmap at 196610 (+2) Inode table at 196612-197197 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 197785-229375 Free inodes: 28130-32816 Group 7: (Blocks 229376-262143) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x9596, unused inodes 4688 Backup superblock at 229376, Group descriptors at 229377-229377 Reserved GDT blocks at 229378-229888 Block bitmap at 196609 (bg #6 + 1), Inode bitmap at 196611 (bg #6 + 3) Inode table at 197198-197783 (bg #6 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 229889-262143 Free inodes: 32817-37504 Group 8: (Blocks 262144-294911) [ITABLE_ZEROED] Checksum 0x8606, unused inodes 4687 Block bitmap at 262144 (+0), Inode bitmap at 262146 (+2) Inode table at 262148-262733 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 263321-294911 Free inodes: 37506-42192 Group 9: (Blocks 294912-327679) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x4ebb, unused inodes 4688 Backup superblock at 294912, Group descriptors at 294913-294913 Reserved GDT blocks at 294914-295424 Block bitmap at 262145 (bg #8 + 1), Inode bitmap at 262147 (bg #8 + 3) Inode table at 262734-263319 (bg #8 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 295425-327679 Free inodes: 42193-46880 Group 10: (Blocks 327680-360447) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xc5ea, unused inodes 4688 Block bitmap at 327680 (+0), Inode bitmap at 327682 (+2) Inode table at 327684-328269 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 328856-360447 Free inodes: 46881-51568 Group 11: (Blocks 360448-393215) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x9409, unused inodes 4688 Block bitmap at 327681 (bg #10 + 1), Inode bitmap at 327683 (bg #10 + 3) Inode table at 328270-328855 (bg #10 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 360448-393215 Free inodes: 51569-56256 Group 12: (Blocks 393216-425983) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x274c, unused inodes 4688 Block bitmap at 393216 (+0), Inode bitmap at 393218 (+2) Inode table at 393220-393805 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 394392-425983 Free inodes: 56257-60944 Group 13: (Blocks 425984-458751) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x76af, unused inodes 4688 Block bitmap at 393217 (bg #12 + 1), Inode bitmap at 393219 (bg #12 + 3) Inode table at 393806-394391 (bg #12 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 425984-458751 Free inodes: 60945-65632 Group 14: (Blocks 458752-491519) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x792e, unused inodes 4688 Block bitmap at 458752 (+0), Inode bitmap at 458754 (+2) Inode table at 458756-459341 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 459928-491519 Free inodes: 65633-70320 Group 15: (Blocks 491520-524287) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x28cd, unused inodes 4688 Block bitmap at 458753 (bg #14 + 1), Inode bitmap at 458755 (bg #14 + 3) Inode table at 459342-459927 (bg #14 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 491520-524287 Free inodes: 70321-75008 Group 16: (Blocks 524288-557055) [ITABLE_ZEROED] Checksum 0xcc9b, unused inodes 4687 Block bitmap at 524288 (+0), Inode bitmap at 524290 (+2) Inode table at 524292-524877 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 525465-557055 Free inodes: 75010-79696 Group 17: (Blocks 557056-589823) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x80f6, unused inodes 4688 Block bitmap at 524289 (bg #16 + 1), Inode bitmap at 524291 (bg #16 + 3) Inode table at 524878-525463 (bg #16 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 557056-589823 Free inodes: 79697-84384 Group 18: (Blocks 589824-622591) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x8f77, unused inodes 4688 Block bitmap at 589824 (+0), Inode bitmap at 589826 (+2) Inode table at 589828-590413 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 591000-622591 Free inodes: 84385-89072 Group 19: (Blocks 622592-655359) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xde94, unused inodes 4688 Block bitmap at 589825 (bg #18 + 1), Inode bitmap at 589827 (bg #18 + 3) Inode table at 590414-590999 (bg #18 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 622592-655359 Free inodes: 89073-93760 Group 20: (Blocks 655360-688127) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x6dd1, unused inodes 4688 Block bitmap at 655360 (+0), Inode bitmap at 655362 (+2) Inode table at 655364-655949 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 656536-688127 Free inodes: 93761-98448 Group 21: (Blocks 688128-720895) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x3c32, unused inodes 4688 Block bitmap at 655361 (bg #20 + 1), Inode bitmap at 655363 (bg #20 + 3) Inode table at 655950-656535 (bg #20 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 688128-720895 Free inodes: 98449-103136 Group 22: (Blocks 720896-753663) [ITABLE_ZEROED] Checksum 0x2e3d, unused inodes 4687 Block bitmap at 720896 (+0), Inode bitmap at 720898 (+2) Inode table at 720900-721485 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 722073-753663 Free inodes: 103138-107824 Group 23: (Blocks 753664-786431) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x6250, unused inodes 4688 Block bitmap at 720897 (bg #22 + 1), Inode bitmap at 720899 (bg #22 + 3) Inode table at 721486-722071 (bg #22 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 753664-786431 Free inodes: 107825-112512 Group 24: (Blocks 786432-819199) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xe89e, unused inodes 4688 Block bitmap at 786432 (+0), Inode bitmap at 786434 (+2) Inode table at 786436-787021 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 787608-819199 Free inodes: 112513-117200 Group 25: (Blocks 819200-851967) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x3dad, unused inodes 4688 Backup superblock at 819200, Group descriptors at 819201-819201 Reserved GDT blocks at 819202-819712 Block bitmap at 786433 (bg #24 + 1), Inode bitmap at 786435 (bg #24 + 3) Inode table at 787022-787607 (bg #24 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 819713-851967 Free inodes: 117201-121888 Group 26: (Blocks 851968-884735) [ITABLE_ZEROED] Checksum 0xab72, unused inodes 4687 Block bitmap at 851968 (+0), Inode bitmap at 851970 (+2) Inode table at 851972-852557 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 853145-884735 Free inodes: 121890-126576 Group 27: (Blocks 884736-917503) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x63cf, unused inodes 4688 Backup superblock at 884736, Group descriptors at 884737-884737 Reserved GDT blocks at 884738-885248 Block bitmap at 851969 (bg #26 + 1), Inode bitmap at 851971 (bg #26 + 3) Inode table at 852558-853143 (bg #26 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 885249-917503 Free inodes: 126577-131264 Group 28: (Blocks 917504-950271) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x545a, unused inodes 4688 Block bitmap at 917504 (+0), Inode bitmap at 917506 (+2) Inode table at 917508-918093 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 918680-950271 Free inodes: 131265-135952 Group 29: (Blocks 950272-983039) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x05b9, unused inodes 4688 Block bitmap at 917505 (bg #28 + 1), Inode bitmap at 917507 (bg #28 + 3) Inode table at 918094-918679 (bg #28 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 950272-983039 Free inodes: 135953-140640 Group 30: (Blocks 983040-1015807) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x0a38, unused inodes 4688 Block bitmap at 983040 (+0), Inode bitmap at 983042 (+2) Inode table at 983044-983629 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 984216-1015807 Free inodes: 140641-145328 Group 31: (Blocks 1015808-1048575) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x5bdb, unused inodes 4688 Block bitmap at 983041 (bg #30 + 1), Inode bitmap at 983043 (bg #30 + 3) Inode table at 983630-984215 (bg #30 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1015808-1048575 Free inodes: 145329-150016 Group 32: (Blocks 1048576-1081343) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x442f, unused inodes 4688 Block bitmap at 1048576 (+0), Inode bitmap at 1048578 (+2) Inode table at 1048580-1049165 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1049752-1081343 Free inodes: 150017-154704 Group 33: (Blocks 1081344-1114111) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x3a54, unused inodes 4688 Block bitmap at 1048577 (bg #32 + 1), Inode bitmap at 1048579 (bg #32 + 3) Inode table at 1049166-1049751 (bg #32 + 590) 0 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: Free inodes: 154705-159392 Group 34: (Blocks 1114112-1146879) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x8f83, unused inodes 4688 Block bitmap at 1114112 (+0), Inode bitmap at 1114114 (+2) Inode table at 1114116-1114701 (+4) 0 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: Free inodes: 159393-164080 Group 35: (Blocks 1146880-1179647) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xa274, unused inodes 4688 Block bitmap at 1114113 (bg #34 + 1), Inode bitmap at 1114115 (bg #34 + 3) Inode table at 1114702-1115287 (bg #34 + 590) 13333 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1166315-1179647 Free inodes: 164081-168768 Group 36: (Blocks 1179648-1212415) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xf8eb, unused inodes 4688 Block bitmap at 1179648 (+0), Inode bitmap at 1179650 (+2) Inode table at 1179652-1180237 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1180824-1212415 Free inodes: 168769-173456 Group 37: (Blocks 1212416-1245183) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xa908, unused inodes 4688 Block bitmap at 1179649 (bg #36 + 1), Inode bitmap at 1179651 (bg #36 + 3) Inode table at 1180238-1180823 (bg #36 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1212416-1245183 Free inodes: 173457-178144 Group 38: (Blocks 1245184-1277951) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xa689, unused inodes 4688 Block bitmap at 1245184 (+0), Inode bitmap at 1245186 (+2) Inode table at 1245188-1245773 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1246360-1277951 Free inodes: 178145-182832 Group 39: (Blocks 1277952-1310719) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xf76a, unused inodes 4688 Block bitmap at 1245185 (bg #38 + 1), Inode bitmap at 1245187 (bg #38 + 3) Inode table at 1245774-1246359 (bg #38 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1277952-1310719 Free inodes: 182833-187520 Group 40: (Blocks 1310720-1343487) [ITABLE_ZEROED] Checksum 0x602a, unused inodes 4687 Block bitmap at 1310720 (+0), Inode bitmap at 1310722 (+2) Inode table at 1310724-1311309 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1311897-1343487 Free inodes: 187522-192208 Group 41: (Blocks 1343488-1376255) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x2c47, unused inodes 4688 Block bitmap at 1310721 (bg #40 + 1), Inode bitmap at 1310723 (bg #40 + 3) Inode table at 1311310-1311895 (bg #40 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1343488-1376255 Free inodes: 192209-196896 Group 42: (Blocks 1376256-1409023) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x23c6, unused inodes 4688 Block bitmap at 1376256 (+0), Inode bitmap at 1376258 (+2) Inode table at 1376260-1376845 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1377432-1409023 Free inodes: 196897-201584 Group 43: (Blocks 1409024-1441791) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x7225, unused inodes 4688 Block bitmap at 1376257 (bg #42 + 1), Inode bitmap at 1376259 (bg #42 + 3) Inode table at 1376846-1377431 (bg #42 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1409024-1441791 Free inodes: 201585-206272 Group 44: (Blocks 1441792-1474559) [ITABLE_ZEROED] Checksum 0x9a26, unused inodes 4588 Block bitmap at 1441792 (+0), Inode bitmap at 1441794 (+2) Inode table at 1441796-1442381 (+4) 31492 free blocks, 4588 free inodes, 100 directories, 4588 unused inodes Free blocks: 1443068-1474559 Free inodes: 206373-210960 Group 45: (Blocks 1474560-1507327) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x9083, unused inodes 4688 Block bitmap at 1441793 (bg #44 + 1), Inode bitmap at 1441795 (bg #44 + 3) Inode table at 1442382-1442967 (bg #44 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1474560-1507327 Free inodes: 210961-215648 Group 46: (Blocks 1507328-1540095) [ITABLE_ZEROED] Checksum 0x828c, unused inodes 4687 Block bitmap at 1507328 (+0), Inode bitmap at 1507330 (+2) Inode table at 1507332-1507917 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1508505-1540095 Free inodes: 215650-220336 Group 47: (Blocks 1540096-1572863) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xcee1, unused inodes 4688 Block bitmap at 1507329 (bg #46 + 1), Inode bitmap at 1507331 (bg #46 + 3) Inode table at 1507918-1508503 (bg #46 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1540096-1572863 Free inodes: 220337-225024 Group 48: (Blocks 1572864-1605631) [ITABLE_ZEROED] Checksum 0x2ab7, unused inodes 4687 Block bitmap at 1572864 (+0), Inode bitmap at 1572866 (+2) Inode table at 1572868-1573453 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1574041-1605631 Free inodes: 225026-229712 Group 49: (Blocks 1605632-1638399) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xe20a, unused inodes 4688 Backup superblock at 1605632, Group descriptors at 1605633-1605633 Reserved GDT blocks at 1605634-1606144 Block bitmap at 1572865 (bg #48 + 1), Inode bitmap at 1572867 (bg #48 + 3) Inode table at 1573454-1574039 (bg #48 + 590) 32255 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1606145-1638399 Free inodes: 229713-234400 Group 50: (Blocks 1638400-1671167) [ITABLE_ZEROED] Checksum 0x74d5, unused inodes 4687 Block bitmap at 1638400 (+0), Inode bitmap at 1638402 (+2) Inode table at 1638404-1638989 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1639577-1671167 Free inodes: 234402-239088 Group 51: (Blocks 1671168-1703935) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x38b8, unused inodes 4688 Block bitmap at 1638401 (bg #50 + 1), Inode bitmap at 1638403 (bg #50 + 3) Inode table at 1638990-1639575 (bg #50 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1671168-1703935 Free inodes: 239089-243776 Group 52: (Blocks 1703936-1736703) [ITABLE_ZEROED] Checksum 0x9673, unused inodes 4687 Block bitmap at 1703936 (+0), Inode bitmap at 1703938 (+2) Inode table at 1703940-1704525 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1705113-1736703 Free inodes: 243778-248464 Group 53: (Blocks 1736704-1769471) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xda1e, unused inodes 4688 Block bitmap at 1703937 (bg #52 + 1), Inode bitmap at 1703939 (bg #52 + 3) Inode table at 1704526-1705111 (bg #52 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1736704-1769471 Free inodes: 248465-253152 Group 54: (Blocks 1769472-1802239) [ITABLE_ZEROED] Checksum 0xc811, unused inodes 4687 Block bitmap at 1769472 (+0), Inode bitmap at 1769474 (+2) Inode table at 1769476-1770061 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1770649-1802239 Free inodes: 253154-257840 Group 55: (Blocks 1802240-1835007) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x847c, unused inodes 4688 Block bitmap at 1769473 (bg #54 + 1), Inode bitmap at 1769475 (bg #54 + 3) Inode table at 1770062-1770647 (bg #54 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1802240-1835007 Free inodes: 257841-262528 Group 56: (Blocks 1835008-1867775) [ITABLE_ZEROED] Checksum 0x570c, unused inodes 4686 Block bitmap at 1835008 (+0), Inode bitmap at 1835010 (+2) Inode table at 1835012-1835597 (+4) 31588 free blocks, 4686 free inodes, 1 directories, 4686 unused inodes Free blocks: 1836185-1836543, 1836547-1867775 Free inodes: 262531-267216 Group 57: (Blocks 1867776-1900543) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x5f51, unused inodes 4688 Block bitmap at 1835009 (bg #56 + 1), Inode bitmap at 1835011 (bg #56 + 3) Inode table at 1835598-1836183 (bg #56 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1867776-1900543 Free inodes: 267217-271904 Group 58: (Blocks 1900544-1933311) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x50d0, unused inodes 4688 Block bitmap at 1900544 (+0), Inode bitmap at 1900546 (+2) Inode table at 1900548-1901133 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1901720-1933311 Free inodes: 271905-276592 Group 59: (Blocks 1933312-1966079) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0x0133, unused inodes 4688 Block bitmap at 1900545 (bg #58 + 1), Inode bitmap at 1900547 (bg #58 + 3) Inode table at 1901134-1901719 (bg #58 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1933312-1966079 Free inodes: 276593-281280 Group 60: (Blocks 1966080-1998847) [ITABLE_ZEROED] Checksum 0xaff8, unused inodes 4687 Block bitmap at 1966080 (+0), Inode bitmap at 1966082 (+2) Inode table at 1966084-1966669 (+4) 31591 free blocks, 4687 free inodes, 1 directories, 4687 unused inodes Free blocks: 1967257-1998847 Free inodes: 281282-285968 Group 61: (Blocks 1998848-2031615) [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED] Checksum 0xe395, unused inodes 4688 Block bitmap at 1966081 (bg #60 + 1), Inode bitmap at 1966083 (bg #60 + 3) Inode table at 1966670-1967255 (bg #60 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 1998848-2031615 Free inodes: 285969-290656 Group 62: (Blocks 2031616-2064383) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xec14, unused inodes 4688 Block bitmap at 2031616 (+0), Inode bitmap at 2031618 (+2) Inode table at 2031620-2032205 (+4) 31592 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 2032792-2064383 Free inodes: 290657-295344 Group 63: (Blocks 2064384-2097151) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0x7a0e, unused inodes 4688 Block bitmap at 2031617 (bg #62 + 1), Inode bitmap at 2031619 (bg #62 + 3) Inode table at 2032206-2032791 (bg #62 + 590) 32768 free blocks, 4688 free inodes, 0 directories, 4688 unused inodes Free blocks: 2064384-2097151 Free inodes: 295345-300032
Initially I thought that this is the issue with orlov allocator hence my recommendation was to substitute it with a better one. However while this discussion was in progress, I did some more debugging to find why orlov did not work properly. I observed that the find_group_orlov() relied on the stats information supplied by get_orlov_stats().
More logs revealed that get_orlov_stats() indeed returned incorrect stats. Finally I could get to the root cause and found that the culprit was ldiskfs_kvzalloc(size, GFP_KERNEL), which is called in ldiskfs_fill_flex_info() to reset sbi->s_flex_groups, before stats calculation. However ldiskfs_kvzalloc did not actually initialize the s_flex_groups as 0 (leading to accumulation of previous stats).
The simple fix for this would be to use correct flag for it to be initialized properly, i.e.
--- ldiskfs/super.c 2016-04-04 14:53:23.343136984 +0530 +++ ldiskfs/super.c.new 2016-04-04 14:53:09.946441993 +0530 @@ -96,7 +96,7 @@ void *ldiskfs_kvzalloc(size_t size, gfp_ { void *ret; - ret = kmalloc(size, flags); + ret = kmalloc(size, flags | __GFP_ZERO); if (!ret) - ret = __vmalloc(size, flags | __GFP_ZERO, PAGE_KERNEL); + ret = __vmalloc(size, flags , PAGE_KERNEL); return ret;
We can use goal_Inode to achieve this by selecting goal_inode from the group/flex_group having maximum number of the empty blocks. Without this change Top level directories will be spread which will lead to increase in head movement.
Except for basic benchmarks on a newly-formatted filesystem, I don't think there is any value in this added complexity. There is always going to be head movement between directories of different users, as well as the journal, so the question is how to best optimize this. Putting all of the top-level subdirectories together may help with benchmarks but will not help in real-life usage when each of those directories may have many thousands or millions of their own files and subdirectories.
So far you haven't shown any evidence that any of the proposed changes is actually going to improve performance, reduce seeking under normal usage, or do anything beyond add some lines to the code. It is my expectation that adding EXT4_TOPDIR_FL on the ROOT/ inode will increase the spread between directories in the short term, but that has the long-term benefit of giving separate users (or projects, or whatever else is at the top level of the directory tree) more space to keep their own files together, and is more similar to how the root of a normal ext4 filesystem works.
Is your objection to the use of a random value in the Orlov group selection, or only to the initial placement of ROOT/, or something else?
I've also thought about this a few times - the Lustre "ROOT/" directory does not get the LDISKFS_TOPDIR_FL set at creation time like the real ldiskfs root directory "/" (inode #2), so that top-level directories created therein (in the Lustre-visible root directory) are spread across the filesystem more.
Correct
I suspect that the use of such a small MDT filesystem is also skewing the behaviour because there is only a single group which has many more free blocks than the others.
yes you are right I have tried above test case with bigger mdt. It does distribute all the directories across the disc but again it depends on random number generator in orlov allocator.
It does make sense to keep all of the Lustre directories together (e.g. ROOT, PENDING, REMOTE_PARENT_DIR, etc) so that there is minimal head movement when they are updated.
Yes, I was looking at the contents of all the top level directories (i.e. under /). Most of them have fixed number of files, only O and ROOT directories has subdirectories.
O :- Used to map legacy OST objects for compatibility.Since the repeated lookup() only be called for "/O" at mount time, it will not affect the whole performance.
ROOT:- Directories created under this depends on user. As per your suggestion we can set TOPDIR flag on ROOT so that subsequent subdirectories are spread.
We can use goal_Inode to achieve this by selecting goal_inode from the group/flex_group having maximum number of the empty blocks. Without this change Top level directories will be spread which will lead to increase in head movement.
I've also thought about this a few times - the Lustre "ROOT/" directory does not get the LDISKFS_TOPDIR_FL set at creation time like the real ldiskfs root directory "/" (inode #2), so that top-level directories created therein (in the Lustre-visible root directory) are spread across the filesystem more. It does make sense to keep all of the Lustre directories together (e.g. ROOT, PENDING, REMOTE_PARENT_DIR, etc) so that there is minimal head movement when they are updated.
Looking at the above directory listing it seems that the ldiskfs orlov allocator is doing exactly the right thing - selecting the ROOT directory to go into the first/best block group in the filesystem (i.e. most free blocks). I don't think anything needs to be changed for this. I suspect that the use of such a small MDT filesystem is also skewing the behaviour because there is only a single group which has many more free blocks than the others.
The only useful change that might be made here is to set the TOPDIR flag on ROOT so that subsequent subdirectories are also spread out over the filesystem.
My view... orlov allocator tries to distribute root directory inodes allocation/placements. It offers better performance by ensuring root directory inodes are distributed. Keeping a fixed group by defining the GOAL inode, will cause all the root directory inodes to stay together and may defeat the purpose/cause performance issues.
Yes you are right, the Orlov algorithm tries to spread out "top-level" directories, for better performance. (Directories created in the root directory of a filesystem are considered top-level directories) but it does not guarantees this distribution across the disc because of its random behaviour.
Usecase: Contents of /mnt/mds1 directory after mount.:
[root@dev-1 lustre-wc-rel]# lustre/utils/mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/mnt/lokesh/seagate/lustre-wc-rel/lustre/tests/../utils/l_getidentity --backfstype=ldiskfs --device-size=200000 --reformat /dev/sdb [root@dev-1 lustre-wc-rel]# mkdir -p /mnt/mds1; mount -t lustre /dev/sdb /mnt/mds1 [root@dev-1 lustre-wc-rel]# mount -t ldiskfs /dev/sdb /mnt/test/ [root@dev-1 lustre-wc-rel]# ls -i /mnt/test/ 101 BATCHID 123 lfsck_layout 26 oi.16.12 34 oi.16.20 43 oi.16.29 51 oi.16.37 59 oi.16.45 67 oi.16.53 75 oi.16.61 25041 quota_slave 102 changelog_catalog 106 lfsck_namespace 27 oi.16.13 35 oi.16.21 17 oi.16.3 52 oi.16.38 60 oi.16.46 68 oi.16.54 76 oi.16.62 25003 REMOTE_PARENT_DIR 103 changelog_users 11 lost+found 28 oi.16.14 36 oi.16.22 44 oi.16.30 53 oi.16.39 61 oi.16.47 69 oi.16.55 77 oi.16.63 90 reply_data 25001 CONFIGS 25026 NIDTBL_VERSIONS 29 oi.16.15 37 oi.16.23 45 oi.16.31 18 oi.16.4 62 oi.16.48 70 oi.16.56 21 oi.16.7 25043 ROOT 86 fld 25002 O 30 oi.16.16 38 oi.16.24 46 oi.16.32 54 oi.16.40 63 oi.16.49 71 oi.16.57 22 oi.16.8 87 seq_ctl 104 hsm_actions 14 oi.16.0 31 oi.16.17 39 oi.16.25 47 oi.16.33 55 oi.16.41 19 oi.16.5 72 oi.16.58 23 oi.16.9 88 seq_srv 89 last_rcvd 15 oi.16.1 32 oi.16.18 40 oi.16.26 48 oi.16.34 56 oi.16.42 64 oi.16.50 73 oi.16.59 13 OI_scrub 100 update_log 25048 LFSCK 24 oi.16.10 33 oi.16.19 41 oi.16.27 49 oi.16.35 57 oi.16.43 65 oi.16.51 20 oi.16.6 25047 PENDING 25042 update_log_dir 105 lfsck_bookmark 25 oi.16.11 16 oi.16.2 42 oi.16.28 50 oi.16.36 58 oi.16.44 66 oi.16.52 74 oi.16.60 25038 quota_master
We can see here that orlov allocator (without the goal_inode) keeps the "top_level" directories in same group i.e. #1. This is probably because of its random behaviour. When creating a directory which is not in a top_level directory, the Orlov algorithm tries to put it into the same cylinder group as its parent.
With the usage of goal_inode(making it different for all top level directory) we can guarantee distribution of "top_level" directories across the disc. In this way we can also guarantee performance.
Patch was landed for 2.9.0.