Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
If applications/tools that do frequent llapi_fid2path() lookups then liblustreapi should cache the results to avoid repeated RPCs to the MDS to generate similar pathnames.
My proposal is not about caching full path components, but more like keeping a cache of the linkEA FID to <parent,name> components of each directory, along with a (potentially precomputed) "match state" for each pathname.
This not only avoids repeated lookup and generation of the pathname for each entry, but also allows immediate decisions on whether a file matches the pathname rule(s) or not. This FID cache would best be accessed directly by the application, but could be used afterward for bulk llapi_fid2path() lookups as well.
In that case, we could do path2fid lookup of /scratch (or other leaf directory) and pin its FID into the "match" hash table, and pin "ROOT/" (or other path component before the leaf directory) in the "not-match" hash table. There is a third hash table for "waiting" FIDs that we don't know the result for yet.
As the filesystem is traversed, each directory creates an FID cache entry with FID from LMA and <parent FID, name> from linkEA. It checks for its own FID in the "waiting" hash first. If found, it uses that FID cache entry and fills in the name and pFID, otherwise a new one is allocated.
It checks if its pFID is already in the cache in the "waiting", "match", "not-match" hash table, and if found and adds its FID entry to that list, links it's cache entry to its parent pFID entry, and marks its cache entry as "match" or "not-match", or allocates and adds a dummy entry for pFID to "waiting" hash and links to that.
If the self-FID entry was found in the "waiting" hash, but the pFID is in the "(not-)matched" hash, then it needs to recursively update all child FID entries to the correct state/hash. The list of child entries is walked recursively to mark and move the FID entries to the appropriate list.
A shorter process is done with file entries. If their pFID is found in the cache, then they can immediately be discarded if pFID is in "not-matched" state, or output full pathname if pFID is in "matched" state by walking each pFID from cache to generate the pathname.
For files/directories that have pFID in the "waiting" hash, we can do two things. Either add the child name/FID (file or directory) to the "waiting" parent pFID entry and resolve it later, or if the "waiting" hash is too large then we can do direct linkEA lookups for the pFID (recursively if needed) to attach the entry to a known "match" or "not-match" FID entry, and then resolve all of the waiting child entries immediately (dropping regular files from the parent entry, but keeping directories in the cache).
Since ldiskfs normally allocates files in the same group as the parent, and after the parent is allocated, the on-disk inodes will typically be in "parent, child, child, ..." order. This means that pFID will normally already be in the cache, so regular file inodes can often be resolved immediately for their "matched" or "not-matched" state, and the full pathname generated from cache if needed.
For DNE, the same process applies. Depending of what the initial "match" or "not-match" pathnames are, there may or may not yet be any entires in the "match" or "not-match" hashes for each MDT. In that case, rather than just accumulating all entries into "waiting" until the cache size limit is hit, it makes sense to do some initial direct "pFID" lookups to populate the cache with "match" and "non-match" entries. It makes sense to do this for all entries that are in the REMOTE_PARENT directory, since we know the parent will not be found on this MDT.
Attachments
Issue Links
- is related to
-
LU-11380 IOC_MDC_GETFILEINFO returns garbage stripe info for files with long names but no striping
- Closed