It would be useful to implement TopN list support for file heat.
After an hour discussion with Yingjin and Shilong about the possible designs and their cons and pros, we think the following design might work:
1) In order to avoid any performance impact, the sorting of the TopN will not add anything in the I/O process.
2) TopN sorting will be implemented using binary heap based on cfs_binheap.
3) A background thread scans "s_inodes" list of "struct super_block" and adds each inode into the binary heap. Examples of this are wait_sb_inodes(), add_dquot_ref(), drop_pagecache_sb() etc.
4) The scanning frequency could be a configurable parameter which depends on total inode numbers, heat decay time, resolution ratio (the rate that user-space tool dumps the list), etc.
5) Between handling of each inode, functions like cond_resched() should be called to avoid too high CPU usage or affecting normal I/O or file system access.
6) The binary heap is reverse order, meaning the top node is the file with least heat on the top.
7) Whenever an inode is adding to the binary heap, a new structure (e.g. "struct inode_heat_node") will be allocated for the inode with FIDs, inums and other fields properly set.
8) If the size of binary heap becomes larger than N (i.e. size == N+1) after inserting, an inodes will be removed from the top of the binary heap.
9) optimization: cfs_binheap_copy() will be implemented to copy the binary heap. When a user-space process starts to dump the TopN list, the binary heap will be copied so as to make the dumping quicker. cfs_binheap_copy() essentially copies element arrays, thus should be fast.
10) When user-space tool starts to dump the TopN list, the dumping binary heap shall be removed from the top to generate the TopN list with ascending order. The user-space tool shall change the list to descending order if needed.
11) Before dumping the TopN list, the administrator should configure which type of file heats (read, write, read sample or write sample) the sorting should be based on. N should also be a configurable parameter. These things could be configured through /sys or /proc interface. If memory is not a limitation, we could maitain four topN list at the same time for each type of file heats.
12) User-space tool shall read the TopN list and show in a proper way flushing the result after each time period (like what command top does).
I've had reason to look at this ticket again, and have a potential solution that might allow the "hottest inodes" list maintenance to be efficient without scanning all of the inodes each time (which might be in the 100k-1M range). It is impossible that all of the inodes on a client are "hottest", so doing a full inode scan would mostly be wasted effort, and we really want time and effort focused on those inodes that are actually "hot".
Instead of doing a periodic scan of all inodes to find "hot" ones, I think it would be better for the inodes to "add themselves" to the list when they become hot. This can be done easily/efficiently by keeping a "heat threshold" of the minimum file heat needed to be in the "top heat" listing. As each inode is accumulating its own heat (as normal) it will compare whether its new heat is above "heat threshold". If yes, and it is not already on the list, then it will insert itself into the "top heat" list so that it is a candidate for the next time that the "lfs heat_top" list is generated for userspace.
My thought is that the "top heat" list (and when I write "list" I mean "some data structure that can resolve to a sorted list when it is accessed by userspace") should only be lazily maintained in the kernel, to keep it efficient and low overhead. There are a few possible models for how the list could be maintained:
We don't know the "N" that userspace wants the top list for, but this could either be a parameter from the "lfs heat_top" that informs the kernel of "N" for the future (e.g. average of "N" for each access), or it is a tunable parameter (e.g. 500 by default) that can be changed as needed. We probably also want to have an "lctl get_param" interface to dump the "top heat" list and leave it to userspace to process instead of having to run "lfs heat_top" each time to scrape this information into a monitoring tool.
Depending on the implementation, the thread adding the inode to the "top heat" list could do some some maintenance work each time.
This would likely benefit from the improved LDLM LRU handling (e.g. patch https://review.whamcloud.com/56648 "LU-11509 ldlm: Implement LFRU cache eviction") so that we don't just have strict LRU eviction of inodes from the client cache, otherwise clients may have difficulty to accumulate any history of inode usage. It may be that the file heat can also help inform the DLM LFRU handling to make better choices of what should stay in the client LRU, and which locks should be cancelled.