[LU-1248] all mdt_rdpg_* threads busy in osd_ea_fid_get() Created: 21/Mar/12 Updated: 01/Feb/14 Resolved: 26/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 6428 |
| Description |
|
The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here. Most of the stacks looked like this: __cond_resched |
| Comments |
| Comment by Peter Jones [ 22/Mar/12 ] |
|
Lsi Could you please comment on this one? Thanks Peter |
| Comment by Lai Siyao [ 01/Apr/12 ] |
|
This looks normal from the code. Basically it's an inode scalability problem, these busy threads are contending on inode_lock, and NIck Piggin's inode lock scalability patches are getting merged into kernel 3.x. Currently MDT still uses directory+ea to store metadata, while IAM looks to have better performance and scalability, but I'm not clear why it's not enabled yet. |
| Comment by Lai Siyao [ 05/Apr/12 ] |
|
Ned, is the system upgraded from 1.8? Normally readdir can get fid from dir data, and then it doesn't need read fid from ea, but for a upgraded system, it needs query each inode. Besides, did you see any error messages related with this dir? |
| Comment by Ned Bass [ 05/Apr/12 ] |
|
Yes the system was upgraded from 1.8. Will files created after the upgrade store the fid in the dir data? There are a few "osd_object_delete() Failed to cleanup: -2" console messages on the MDS from around that time. I don't find any other errors worth mentioning. |
| Comment by Lai Siyao [ 06/Apr/12 ] |
|
Yes, on upgraded system even newly created dir won't store fid in dir data; I'll try whether it's easy to implement this. |
| Comment by Lai Siyao [ 06/Apr/12 ] |
|
I don't find a easy way to implement: without change for disk format, there's no way to distinguish 1.8 directory and newly created directory. The original design for 1.8 <-> 2.x interoperatability server is on bz11826. |
| Comment by nasf (Inactive) [ 04/Jun/12 ] |
|
There are incompatible format between b1_8 and b2_1: in b1_8, lvfs_dentry_params is appended after the name entry in parent directory; but in b2_1, it is ldiskfs_dentry_param. They are different and incompatible. So when system upgraded from b1_8 to b2_1, the new created fie cannot append ldiskfs_dentry_param (which contains the FID) after its name entry in parent directory, otherwise, the system cannot downgrade back to b1_8. But if without FID appended after name entry in parent directory, it will cause performance (for dir readpage) regression. I do not think it is good solution, because upgrade is more often used than downgrade. We should make some patch in b2_x to support appending FID after the name entry in parent directory for upgrading case and skip lvfs_dentry_params after the name entry for old files. On the other hand, need another patch against b1_8_x (x >= 8) to skip FID after the name entry in parent directory to support downgrade back to b1_8_x (x >= 8). |
| Comment by Lai Siyao [ 04/Jun/12 ] |
|
Andreas, it looks like we need change both 2.x and 1.8 ldiskfs code to keep both backward and forward compatibility for this, any suggestion for this? |
| Comment by Andreas Dilger [ 04/Jun/12 ] |
|
The need to be able to downgrade from 2.x to 1.8 is only in the case of "simple" upgrade to 2.x that has hit problems and needs to be able to downgrade. If the upgrade has been successful, and then the admin (separately) enables the "dir_data" feature using tune2fs on the filesystem, this should be enough to allow storing FIDs in the directory entries. After that point, the filesystem should not be downgraded to 1.8 anymore. What definitely should be avoided is any automatic enabling of the "dir_data" feature on the filesystem when it is first mounted, since this will cause problems if there are FIDs stored in the directory entries, then the filesystem is downgraded to 1.8, the FID-in-LMA is deleted upon access (reverting to IGIF for that inode), and then the filesystem is upgraded again. That would cause the FID-in-dirent to contain invalid data that OI scrub and e2fsck will not fix yet. So, my understanding is that if you are sure there is no need to downgrade to 1.8, it should be possible with 2.1+ to use: tune2fs -O dirdata /dev/{mdtdev}
to enable this feature, and then newly-created files/links will store the FID in the directory. I don't know if we have tested this process or not. Assuming this is OK, it would then be possible in that case to "refresh" the directory with a script to re-link filenames that are expected to live for a long time, assuming they are not in use, something like: lfs find /mnt/lustre/some/dir -type f | while read F; do
FTMP="$F.XXXXXX.$$.$RANDOM"
mv "$F" "$FTMP" && mv "$FTMP" "$F"
done
In a later phase of LFSCK, the FID-in-dirent data will be verified and refreshed if missing, but this is not part of the Phase I deliverable. This existing behaviour is not a net performance loss in many use cases, since it is prefetching the inode into MDS memory for use as soon as e.g. "ls" does a stat() on the file. There would only be a visible slowdown in the case of e.g. "find" that is not accessing any of the file attributes, and only generating pathnames. |
| Comment by Lai Siyao [ 04/Jun/12 ] |
|
Andreas, thanks for your detailed explanation! I'll verify `tune2fs` to enable 'dir_data' feature later. Ned, are you fine with the result? |
| Comment by Lai Siyao [ 05/Jun/12 ] |
|
The command should be `tune2fs -O dirdata /dev/ {mdtdev}`, and I've verified that newly created dir will store FID in it. |
| Comment by Ned Bass [ 05/Jun/12 ] |
|
Lai, enabling dir_data seems like a reasonable course of action. We'll start some local testing and propose it to our sysadmin team. Thanks |
| Comment by Lai Siyao [ 26/Jun/12 ] |
|
If 1.8 system is upgraded to 2.x successfully, tunefs can be used to enable dirdata feature, then new directory will contain inode FID in its data. |
| Comment by Andreas Dilger [ 13/Jul/12 ] |
|
The Lustre Manual should be updated to inform users about how to enable "dirdata" on an upgraded 1.8->2.x MDT, once they are sure that they will not be downgrading the MDS to 1.8 again. This will minimize performance impact on newly created files. |
| Comment by Peter Jones [ 16/Jul/12 ] |
|
Cliff could you please create an LUDOC ticket to track Andreas's request? |
| Comment by Cliff White (Inactive) [ 16/Jul/12 ] |
|
http://jira.whamcloud.com/browse/LUDOC-68 has been created to track the manual changes |