[LU-7592] Сhange force_over_128tb lustre mount option to force_over_256b for ldiskfs Created: 22/Dec/15  Updated: 08/Dec/17  Resolved: 18/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Improvement Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-8974 Сhange force_over_256tb lustre mount ... Resolved
is related to LU-8465 parallel e2fsck performance at scale Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Currently attempts of creating ldisk file system with size >128TB finished with message.

LDISKFS-fs does not support file systems greater than 128TB and can cause data corruption.Use "force_over_128tb" mount option to override.

Before using “force_over_128tb” parameter in production systems lustre file system software should be analyzed to point possible large disks support issues. This issue is about research of some aspects of Lustre software. Finally patch that change "force_over_128tb" to "force_over_256tb" should be landed. This gives ability use ldiskfs partitions <256tb without options.



 Comments   
Comment by Gerrit Updater [ 22/Dec/15 ]

Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: http://review.whamcloud.com/17702
Subject: LU-7592 osd-ldiskfs: increase supported ldiskfs fs size limit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1bc3e656ae7096711c2ce6310234a81b989089b8

Comment by Artem Blagodarenko (Inactive) [ 24/Dec/15 ]

Verification steps:
1. Lustre code check to verify it is ready for <256TB targets
2. Testing

Issues verified and tested:
1. Default ldiskfs parameters for command
mkfs.lustre --ost --fsname=testfs --mountfsoptions='force_over_128tb' /dev/md1

-J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init,lazy_itable_init=0 -F

2 Inode count limitation
There is inode count limitation check in mkfs utility (misc/mke2fs.c file) "num_inodes > MAX_32_NUM"
With current option -i 1048576 for 256 TB OST inode count is 256M that is less then 2^32 -1. Smallest bytes per node ratio for 256TB is 32769. If this parameter smaller than value is truncated by mkfs utility to maximum possible.
Case with with -i < 32769 is successfully tested

MDS required inodes count can be calculated. It has to be more than ost inode count * number of ost. This calculations for worst case with 1 stripe. With current option -i 1048576 for 256TB OST inode count is 256M. Maximum inodes count is 2^32-1=4294967296, so this limit is exceeded with 4294967296/256M=16 OSTs.
Often this parameter for MDS smaller then default( -i 4096 for example)
Such rate (-i 4096) can’t be used for 256TB disk, because exceeding of 4G inode. Currently, because of limitation of inodes count sometimes MDT can’t be used fully. Probably this is time for extending this limit (Should we add such task?).

3. Directories format. 32 directories with 64kb files
OST has 32 object directories. Each of them can store 64kb of files. Thus, the limit of files on OST is
65536 * 32 = 2097152 this. Such situation fixes option dir_nlink. The pieces of code that enable this option shown above:

static void ext4_inc_count(handle_t *handle, struct inode *inode)
{
        inc_nlink(inode);
        if (is_dx(inode) && inode->i_nlink > 1) {
                /* limit is 16-bit i_links_count */
                if (inode->i_nlink >= EXT4_LINK_MAX || inode->i_nlink == 2) {
                        inode->i_nlink = 1;
                        EXT4_SET_RO_COMPAT_FEATURE(inode->i_sb,
                                              EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
                }
        }
}

/*
 * If a directory had nlink == 1, then we should let it be 1. This indicates
 * directory has >LDISKFS_LINK_MAX subdirs.
 */
static void ldiskfs_dec_count(handle_t *handle, struct inode *inode)
{
        if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2)
                drop_nlink(inode);
}

There are some doubts how this code works when i_nlink become less then EXT4_LINK_MAX. There is sanity run_test 51b "exceed 64k subdirectory nlink limit" but it has some issues:
a. It tests exceed 64k subdirectory on mds, but ost differ from mds (at least ost use vfs)
b. Test doesn’t create 64k files
The requirements for test improvement added to 3.1.3.
Test case
a. Create more than 64k files on ldiskfs partition
b. Try to delete file so file count less than EXT4_LINK_MAX
c. Force file system check with fsck (ost)
successfully tested

4 Performance near first and last block of disk
Due to large disk size some performance loss at the end of surface is possible. There are mkfs options that move some metadata to the start of disk (flex_bg and -G) . This options are used in some configuration, but numbers should be corrected. “-G 256” means that 256 block groups are allocated at the start of disk to store bitmaps and inode tables. This parameter can be adjusted for new size of disk. Patch that adds option "-G" in LU-6442 landed.

5 ldiskfs data structures limitations
5.1 ext4_map_inode_page function's parameter blocks should be 64bit long
There is function with parameter “unsigned long *blocks”:
int ext4_map_inode_page(struct inode *inode, struct page *page,
unsigned long *blocks, int create)

But ext4_bmap returns sector_t value.

static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
blocks[i] = ext4_bmap(inode->i_mapping, iblock);

That depending on macros can be 32 or 64 bit long
/**

  • The type used for indexing onto a disc or disc partition.
    *
  • Linux always considers sectors to be 512 bytes long independently
  • of the devices real block size.
    *
  • blkcnt_t is th type of the inode's block count.
    */
    #ifdef CONFIG_LBDAF
    typedef u64 sector_t;
    typedef u64 blkcnt_t;
    #else
    typedef unsigned long sector_t;
    typedef unsigned long blkcnt_t;
    #endif

CONFIG_LBDAF:Enable block devices or files of size 2TB and larger.This option is required to support the full capacity of large (2TB+) block devices, including RAID, disk, Network Block
Device, Logical Volume Manager (LVM) and loopback.This option also enables support for single files larger than 2TB.The ext4 filesystem requires that this feature be enabled in order to support filesystems that have the huge_file feature enabled. Otherwise, it will refuse to mount in the read-write mode any filesystems that use the huge_file feature, which is enabled by default by mke2fs.ext4.The GFS2 filesystem also requires this feature.If unsure, say Y.

So we need to use sector_t for this array of blocks.
The field dr_blocks in osd_iobuf and its users should be corrected.
This fix is actual for x86_32 systems only, because unsigned long is 64bit long on x86_64 systems. Fix is uploaded to (LU-6464, landed)

6. Obdfilter. Block addressing etc.
Nothing suspicious.

7. Extended attribute inode probable overflow
xattr store 32 bit inode number (as expected).
__le32 e_value_inum; /* inode in which the value is stored */
xattr inode’s blocks addressed using local block counters.

8. Quta limits. Sizes and inodes.
Nothing changed. 32 bit counters are used for inode addressing. Quotes are still ready for such counters.

9. llog. llog id limitaions
Llog subsystem uses llog_logid, that has ost_id type inside with 64bit types.

10 Tools. FSCK - 64 bits block number
There are 64 bit for addressing blocks by number.
typedef __u64 __bitwise blk64_t;
There is 32 bit version
typedef __u32 __bitwise blk_t;

1) It is used for bad blocks accessing in wrong way. There is patch that chagnes bad blocks numbers to 64bit http://patchwork.ozlabs.org/patch/279297/ we could port it or make from scratch. (LU-XXXX)
2) some functions in bitmap layer uses blk_t. Sometimes blk_t and blk64_t are used in same operation.
Indeed for large EXT4 file systems extents are used for addressing blocks so bitmap code is not used (LU-XXXX)
3) Hurd translators
4) For back compatibility
(LU-XXXX)

11. e2fsprogs update
It looks like all 64-bit related patches are landed to master from (http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/log/ )

12. fsck time
fsck for full 256GB partition without errors should require reasonable time. Need to be checked.

13. lfsck
lfsck doesn’t use global blocks counters. There are also no other limitations.

For points there marker will upload patches in near future.

Comment by Andreas Dilger [ 17/Apr/16 ]

Thank you for this detailed analysis. For some reason I don't recall reading it, maybe because it was posted on Christmas and I was on holidays for a couple of weeks and missed it on my return. In any case it looks very thorough.

Some issues I think are important in this area to discuss in advance if you plan to keep enhancing ext4 for even larger OSTs:

  • badblocks: this is generally unused, which is why the patch was rejected. That said, I don't know if Darrick went back and audited the badblocks code to properly reject block numbers larger than 2^32 or not.
  • if filesystems get any larger, we will need to force the ext4 meta_bg feature on, because the group descriptors will not be able to fit into the first group after it grows beyond 32767 blocks, ~= 2M groups ~= 256 TB. The meta_bg option is much more efficient than without, but suffers from a lack of robustness because there is only a single copy of the last group's description block (unfortunately this feature was implemented and landed in private before such issues could be discussed and resolved).
  • I don't think there is much value to ldiskfs MDTs with more than 4B inodes. It is always possible to use DNE, which will give better performance and workload isolation, allow parallel e2fsck, and I'm any case there are relatively few systems that are even hitting the 4B limit before seeing problems with performance. If you did want to go down this route, then it makes sense to use the dirdata feature to allow optionally storing the high 32 bits of the inode number into direntries, which is what the first dirdata bit was reserved for. This would keep compatibility with existing directories, and this feature could be enabled on existing filesystems without the need to rewrite all directories with a 64-bit inode direntry, or have problems adding a 64-bit inode number to an existing directory with only 32-bit dirents.
  • probably a feature like bigalloc would be interesting for OSTs since it can speed up allocation performance, but the drawback is that this is very inefficient for small files. This might be compensated by having larger inodes (e.g. 4KB) and then using the online data feature to store smaller files inside the inode. Another benefit of bigalloc is to avoid fragmentation of the O/0/d* directories
  • e2fsck performance will become an issue at this scale, and it would likely need to be parallelized to be able to complete in a reasonable time. It could reasonably expect multiple disks at this scale, so having larger numbers of IOs in flight would help, as would an event-driven model with aio that generates lists of blocks to check (itable blocks first), submits them to disk, and then processes them as they are read, generating more blocks to read (more itable blocks, indirect/index/xattr/directory blocks, etc), repeat.
  • I'm not sure if the 16TB extent-mapped file size limit will be important for Lustre, since it is always possible (and desirable for many reasons) to stripe a file widely long before this size is hit for a single file. With PFL it is also possible to restripe a file widely at the end to avoid this problem. True, it would be possible to fill the whole Lustre filesystem with a single file, but that has never been a concern in the past and we've had OSTs > 16TB for some time.
  • the three-level htree/2GB+ directory patch for e2fsck is relatively well understood and described in LU-1365 and seems like a good place to start. The htree limit is relatively easy to test with 1KB blocksize and long filenames with hard links (createmany -l). This has been discussed many time with the other ext4 devs and would very likely be accepted with little complaint.
  • the large xattr patch needs to be able to store 64KB xattrs directly into blocks, and is described in LU-908 in detail. Kalpak is also very aware of this, as he worked on it in the past. This might also speed up wide striped file access a bit.

If you are planning to do further enhancements to ldiskfs, I'd strongly recommend to discuss them on the linux-ext4 mailing list first, so they have a chance to be improved and hopefully landed instead of being for Lustre only.

Comment by Andreas Dilger [ 17/Apr/16 ]

More on the MDT side, a couple of interesting possibilities exist:

  • the inline_data feature may be of interest on the MDT together with Data-on-MDT, or for small directories.
  • shrinking existing very large but mostly empty directories could be done efficiently. The high bits of the htree logical block pointers are reserved for storing the "fullness" of each leaf block. With the 3-level htree patch, there are 4 bits of space there, which is enough to have 1/16 gradients of fullness. The idea is that when adjacent block become less than, say, 1/3 or 1/4 full they could be merged when deleting files. We don't want to merge when just below 1/2 full, since this could cause repeated split/merge cycles, so some hysteresis is needed. this is actually a topic of interest for ext4 right now, because of high latency to ls a large-but-mostly-empty directory.

PS: if you do plan on working on any new features, we should move the discussion to new tickets, if they don't already have one.

Comment by Gerrit Updater [ 22/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17702/
Subject: LU-7592 osd-ldiskfs: increase supported ldiskfs fs size limit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5ca1a1e01d456c09d11d8a3409a83e055a7974a1

Comment by Gerrit Updater [ 26/Apr/16 ]

Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: http://review.whamcloud.com/19788
Subject: LU-7592 osd-ldiskfs: remove force_over_128 warning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8f912feba1ce961ab9ba060f7d0674c13968f4a0

Comment by Artem Blagodarenko (Inactive) [ 10/Feb/17 ]

https://review.whamcloud.com/#/c/19788 is abandoned because its change is landed as part of https://review.whamcloud.com/#/c/24524

Comment by Andreas Dilger [ 18/Apr/17 ]

The two patches here were landed for 2.9.0.

Generated at Sat Feb 10 02:10:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.