[LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
https://github.com/chaos/lustre/tree/2.1.0-llnl

Severity:
3
Rank (Obsolete):
6428

Description

The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here.

Most of the stacks looked like this:

__cond_resched
_cond_resched
ifind_fast
iget_locked
ldiskfs_iget
? generic_detach_inode
osd_iget
osd_ea_fid_get
osd_it_ea_rec
mdd_readpage
cml_readpage
mdt_readpage
? mdt_unpack_req_pack_rep
mdt_handle_common
? lustre_msg_get_transno
mdt_readpage_handle
ptlrpc_main
child_rip

Attachments

Activity

[LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get()

Lai Siyao added a comment - 04/Jun/12 5:17 AM

Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

Lai Siyao added a comment - 04/Jun/12 5:17 AM Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this?

nasf (Inactive) added a comment - 04/Jun/12 5:06 AM

There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8.

But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade.

We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

nasf (Inactive) added a comment - 04/Jun/12 5:06 AM There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8. But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade. We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8).

Lai Siyao added a comment - 06/Apr/12 1:52 AM

I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826.

Lai Siyao added a comment - 06/Apr/12 1:52 AM I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826 .

Lai Siyao added a comment - 06/Apr/12 12:37 AM

Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this.

Lai Siyao added a comment - 06/Apr/12 12:37 AM Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this.

Ned Bass (Inactive) added a comment - 05/Apr/12 5:54 PM

Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data?

There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning.

Ned Bass (Inactive) added a comment - 05/Apr/12 5:54 PM Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data? There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning.

Lai Siyao added a comment - 05/Apr/12 6:36 AM

Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir?

Lai Siyao added a comment - 05/Apr/12 6:36 AM Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir?

Lai Siyao added a comment - 01/Apr/12 4:58 AM

This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x.

Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet.

Lai Siyao added a comment - 01/Apr/12 4:58 AM This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x. Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet.

Peter Jones added a comment - 22/Mar/12 10:38 PM

Lsi

Could you please comment on this one?

Thanks

Peter

Peter Jones added a comment - 22/Mar/12 10:38 PM Lsi Could you please comment on this one? Thanks Peter

People

Assignee:: Lai Siyao

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/Mar/12 5:16 PM

Updated:: 01/Feb/14 8:35 AM

Resolved:: 26/Jun/12 10:03 AM