[LU-13533] Disable lazy_itable_init Created: 07/May/20  Updated: 15/Feb/22  Resolved: 01/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: Artem Blagodarenko (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-14767 mkfs.lustre unable to disable lazy_it... Resolved
is related to LU-8144 add lazyinit progress support Reopened
Rank (Obsolete): 9223372036854775807

 Description   

lazyinit gets more than 24H on typical OST installation and produces writes inside read.  This influences to benchmark tests that are usually executed just after cluster installation is complete. 

Testing shows, disabling that feature adds ~30 sec to formating OST drive so that is not a noticeable time during install.



 Comments   
Comment by Gerrit Updater [ 07/May/20 ]

Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/38534
Subject: LU-13533 utils: ext4lazyinit should be disabled
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 35524f6c2c2fea4f20d8fda4ddfdf2f94dfd1253

Comment by Gerrit Updater [ 01/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38534/
Subject: LU-13533 utils: ext4lazyinit should be disabled
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 701cc249594eae08f6e762eff74183e768c2cee3

Comment by Peter Jones [ 01/Sep/20 ]

Landed for 2.14

Comment by Andreas Dilger [ 13/Feb/21 ]

Artem, did you ever look into why the lazy_itable_init was taking so long in the kernel? Since mke2fs is also calling the same underlying blkdev_issue_zeroout()}] code (via {{fallocate(FALLOC_FL_ZERO_RANGE) on the block device) as the kernel (via ext4_issue_zeroout->sb_issue_zeroout()), it must be more of a scheduling issue from the ext4_lazyinit_thread.

It looks like the default is to schedule the next group to start 10x the time it took to zero out the current group (default s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT = 10), so at best it could use 10% of the available disk bandwidth. It is possible to use "-o lazy_itable_init=N" to tune this delay, although it looks like setting s_li_wait_mult=0 or if the writes complete in less than 1 jiffie (maybe with flash) will cause all of the writes to be synchronous:

                timeout = jiffies;
                ret = ext4_init_inode_table(sb, group,
                                            elr->lr_timeout ? 0 : 1);
                if (elr->lr_timeout == 0) {
                        timeout = (jiffies - timeout) *
                                  elr->lr_sbi->s_li_wait_mult;
                        elr->lr_timeout = timeout;
                }

At a minimum, the above code should set elr->lr_timeout = timeout ?: 1 so that it doesn't do all writes synchronously for fast storage.

A better solution would be to schedule the inode table zeroing as fast as possible after mounting, and only throttle it down when the filesystem is "busy" with user requests. This could be done by using "trylock" variants for alloc_sem, etc. to detect if there is contention, or by having a simple percpu counter in some of the main incoming codepaths (e.g. lookup, read, write) and checking if this is changing between calls to detect if there is userspace activity. I'd thought maybe using the superblock s_kbytes_written counter for this, but it is only updated at unmount time, and from the underlying block device, so that would be affected by the zeroout itself.

Comment by Artem Blagodarenko (Inactive) [ 15/Feb/21 ]

adilger do you have plans to re-enable lazy_itable_init for Lustre?

Comment by Andreas Dilger [ 15/Feb/21 ]

I've been thinking bout using the same mechanism for doing background TRIM operations, instead of the current "-o discard" code, so I would like to know what is wrong with the current code. 

Comment by Malcolm Haak (Inactive) [ 11/Oct/21 ]

While it might only add a few second to an OST it means our MDT's are going to take over 2 hrs EACH to format. We have like 8..... makes formatting MDT's a multi-day affair.

Can we re-look at this?

Comment by Andreas Dilger [ 11/Oct/21 ]

haaknci, you can still use "--mkfsoptions='-E lazy_itable_init'" when formatting your MDTs. This is just a change to the default options because we had a number of complaints that performance after formatting the filesystem was slow.

Comment by Malcolm Haak (Inactive) [ 11/Oct/21 ]

Not on 2.14.0 you can't. It has a bug. I've cherry picked the 2.15 fix.

Generated at Sat Feb 10 03:02:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.