[LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get() Created: 21/Mar/12  Updated: 01/Feb/14  Resolved: 26/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre/tree/2.1.0-llnl


Severity: 3
Rank (Obsolete): 6428

 Description   

The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here.

Most of the stacks looked like this:

__cond_resched
_cond_resched
ifind_fast
iget_locked
ldiskfs_iget
? generic_detach_inode
osd_iget
osd_ea_fid_get
osd_it_ea_rec
mdd_readpage
cml_readpage
mdt_readpage
? mdt_unpack_req_pack_rep
mdt_handle_common
? lustre_msg_get_transno
mdt_readpage_handle
ptlrpc_main
child_rip



 Comments   
Comment by Peter Jones [ 22/Mar/12 ]

Lsi

Could you please comment on this one?

Thanks

Peter

Comment by Lai Siyao [ 01/Apr/12 ]

This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x.

Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet.

Comment by Lai Siyao [ 05/Apr/12 ]

Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir?

Comment by Ned Bass [ 05/Apr/12 ]

Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data?

There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning.

Comment by Lai Siyao [ 06/Apr/12 ]

Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this.

Comment by Lai Siyao [ 06/Apr/12 ]

I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826.

Comment by nasf (Inactive) [ 04/Jun/12 ]

There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8.

But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade.

We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

Comment by Lai Siyao [ 04/Jun/12 ]

Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

Comment by Andreas Dilger [ 04/Jun/12 ]

The need to be able to downgrade from 2.x to 1.8 is only in the case of "simple" upgrade to 2.x that has hit problems and needs to be able to downgrade. If the upgrade has been successful, and then the admin (separately) enables the "dir_data" feature using tune2fs on the filesystem, this should be enough to allow storing FIDs in the directory entries. After that point, the filesystem should not be downgraded to 1.8 anymore.

What definitely should be avoided is any automatic enabling of the "dir_data" feature on the filesystem when it is first mounted, since this will cause problems if there are FIDs stored in the directory entries, then the filesystem is downgraded to 1.8, the FID-in-LMA is deleted upon access (reverting to IGIF for that inode), and then the filesystem is upgraded again. That would cause the FID-in-dirent to contain invalid data that OI scrub and e2fsck will not fix yet.

So, my understanding is that if you are sure there is no need to downgrade to 1.8, it should be possible with 2.1+ to use:

tune2fs -O dirdata /dev/{mdtdev}

to enable this feature, and then newly-created files/links will store the FID in the directory. I don't know if we have tested this process or not.

Assuming this is OK, it would then be possible in that case to "refresh" the directory with a script to re-link filenames that are expected to live for a long time, assuming they are not in use, something like:

lfs find /mnt/lustre/some/dir -type f | while read F; do
        FTMP="$F.XXXXXX.$$.$RANDOM"
        mv "$F" "$FTMP" && mv "$FTMP" "$F"
done

In a later phase of LFSCK, the FID-in-dirent data will be verified and refreshed if missing, but this is not part of the Phase I deliverable.

This existing behaviour is not a net performance loss in many use cases, since it is prefetching the inode into MDS memory for use as soon as e.g. "ls" does a stat() on the file. There would only be a visible slowdown in the case of e.g. "find" that is not accessing any of the file attributes, and only generating pathnames.

Comment by Lai Siyao [ 04/Jun/12 ]

Andreas, thanks for your detailed explanation! I'll verify `tune2fs` to enable 'dir_data' feature later.

Ned, are you fine with the result?

Comment by Lai Siyao [ 05/Jun/12 ]

The command should be `tune2fs -O dirdata /dev/

{mdtdev}

`, and I've verified that newly created dir will store FID in it.

Comment by Ned Bass [ 05/Jun/12 ]

Lai, enabling dir_data seems like a reasonable course of action. We'll start some local testing and propose it to our sysadmin team. Thanks

Comment by Lai Siyao [ 26/Jun/12 ]

If 1.8 system is upgraded to 2.x successfully, tunefs can be used to enable dirdata feature, then new directory will contain inode FID in its data.

Comment by Andreas Dilger [ 13/Jul/12 ]

The Lustre Manual should be updated to inform users about how to enable "dirdata" on an upgraded 1.8->2.x MDT, once they are sure that they will not be downgrading the MDS to 1.8 again. This will minimize performance impact on newly created files.

Comment by Peter Jones [ 16/Jul/12 ]

Cliff could you please create an LUDOC ticket to track Andreas's request?

Comment by Cliff White (Inactive) [ 16/Jul/12 ]

http://jira.whamcloud.com/browse/LUDOC-68 has been created to track the manual changes

Generated at Sat Feb 10 01:14:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.