max_requests is the maximum number of active requests at the same time per coordinator, so it has nothing to do with the HSM action queue. By the way, the Lustre documentation for max_requests is correct.
actions is used to dump the action queue
(Note: it is also accessible through "lctl get_param mdt.lustre-MDT0000.hsm.actions")
Typical entries in actions look like:
lrh=[type=10680000 len=136 idx=342/13652] fid=[0x200034e87:0x8700:0x0] dfid=[0x200034e87:0x8700:0x0] compound/cookie=0x57d8a6d9/0x57d89f3e action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=342/13655] fid=[0x20002a3ec:0xf909:0x0] dfid=[0x20002a3ec:0xf909:0x0] compound/cookie=0x57d8a6dc/0x57d89f41 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
To retrieve statistics from this file to put them into Graphite/Grafana, I use the following script that extracts the number of lines by grouping them by "action" and also "status":
prefix="$CARBON_PREFIX.$MDT.hsm.actions"
lctl get_param mdt.$MDT.hsm.actions | awk -v prefix=$prefix 'BEGIN { now = systime() } / status=/ { action=gensub(/action=(.+)/, "\\1", "g", $7); status=gensub(/status=(.+)/, "\\1", "g", $(NF-1)); arr[action"."status] += 1 } END { for (s in arr) { printf "%s.%s %s %d\n", prefix, s, arr[s], now} }'
So I get lines that are valid for Graphite that look like:
srcc.sherlock.lustre.mdt.regal-MDT0000.hsm.actions.ARCHIVE.SUCCEED 3 1474463707
I am sure you do something similar for IML as you have a graph for HSM, but it is also broken when too many hsm actions are present.
The problem is that when the actions file reaches 100K's entries, it takes so long it is not parsable in a timely manner anymore...
I think having counters per "action" and "status" could be very useful, perhaps something like:
actions_stats:
status=archive status=STARTED count=23
status=archive status=SUCCEED count=1234
status=restore status=STARTED count=0
status=restore status=SUCCEED count=1
About the way to limit the size of the queue, a tunable like max_actions and -EFBIG sounds good, but still there might be a problem if restore actions are triggered when the action queue is full. Maybe the best would be a max_actions per action (archive, restore, remove)?...
Landed for 2.11