Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.1.0
-
None
-
3
-
6428
Description
The load average on the MDS for a classified production 2.1 filesystem jumped to over 400. Top showed mdt_rdpg_* threads all using 4-7% CPU time. This may have been due to a pathological workload, but we were wondering if there's something like an overly contended lock in ldiskfs going on here.
Most of the stacks looked like this:
__cond_resched
_cond_resched
ifind_fast
iget_locked
ldiskfs_iget
? generic_detach_inode
osd_iget
osd_ea_fid_get
osd_it_ea_rec
mdd_readpage
cml_readpage
mdt_readpage
? mdt_unpack_req_pack_rep
mdt_handle_common
? lustre_msg_get_transno
mdt_readpage_handle
ptlrpc_main
child_rip
The need to be able to downgrade from 2.x to 1.8 is only in the case of "simple" upgrade to 2.x that has hit problems and needs to be able to downgrade. If the upgrade has been successful, and then the admin (separately) enables the "dir_data" feature using tune2fs on the filesystem, this should be enough to allow storing FIDs in the directory entries. After that point, the filesystem should not be downgraded to 1.8 anymore.
What definitely should be avoided is any automatic enabling of the "dir_data" feature on the filesystem when it is first mounted, since this will cause problems if there are FIDs stored in the directory entries, then the filesystem is downgraded to 1.8, the FID-in-LMA is deleted upon access (reverting to IGIF for that inode), and then the filesystem is upgraded again. That would cause the FID-in-dirent to contain invalid data that OI scrub and e2fsck will not fix yet.
So, my understanding is that if you are sure there is no need to downgrade to 1.8, it should be possible with 2.1+ to use:
to enable this feature, and then newly-created files/links will store the FID in the directory. I don't know if we have tested this process or not.
Assuming this is OK, it would then be possible in that case to "refresh" the directory with a script to re-link filenames that are expected to live for a long time, assuming they are not in use, something like:
In a later phase of LFSCK, the FID-in-dirent data will be verified and refreshed if missing, but this is not part of the Phase I deliverable.
This existing behaviour is not a net performance loss in many use cases, since it is prefetching the inode into MDS memory for use as soon as e.g. "ls" does a stat() on the file. There would only be a visible slowdown in the case of e.g. "find" that is not accessing any of the file attributes, and only generating pathnames.