[LU-8626] limit number of items in HSM action queue Created: 19/Sep/16  Updated: 04/Apr/18

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Quentin Bouget
Resolution: Unresolved Votes: 2
Labels: hsm

Issue Links:
Related
is related to LU-7988 HSM: high lock contention for cdt_llo... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Several presentations at RUG'16 mentioned that Lustre has poor performance when there are very large numbers of HSM actions outstanding on the coordinator.

Firstly, having a /proc file that exposes the number of entries currently in the HSM action list would allow RBH and monitoring scripts to easily monitor the number of enties.



 Comments   
Comment by Andreas Dilger [ 19/Sep/16 ]

I don't know enough about the details to write a good description of what needs to be done here. There is a max_requests tunable, but that appears to control the number of requests outstanding from the coordinator to a single copytool (cdt_max_requests).

Stephane, could you please add some info here about what variable(s) should be exposed via /proc, and what should be used to limit the size of the queue? What error should be returned by the coordinator if the action queue size limit is exceeded, -EFBIG?

Comment by Bruno Faccini (Inactive) [ 20/Sep/16 ]

Andreas, you sure about the fact that [cdt_]max_requests only concerns requests from the coordinator to a single copytool? I thought (and seems confirmed with my related code reading) it is used to account and limit the whole set of "active" requests been handled by a CDT and regardless to the concerned Agent(s).

About the performance to access the hsm_actions LLOG content when there is a huge back-log, I think that there have been already some work done, and not only in LU-7988. I will try to gather these infos and add to this ticket.

And also, as a first step, could it be acceptable to implement a simple/basic limit of a maximum number of (active and not) requests to be handled by a single CDT and simply return an error (-EFBIG ?) when reached ?

Comment by Andreas Dilger [ 21/Sep/16 ]

Hopefully Stéphane or Patrick can answer your question here. I was just trying to record the issue raised during RUG. It appears this is related to LU-7988 also caused by having a large number of items in the action list.

Comment by Stephane Thiell [ 21/Sep/16 ]

max_requests is the maximum number of active requests at the same time per coordinator, so it has nothing to do with the HSM action queue. By the way, the Lustre documentation for max_requests is correct.

actions is used to dump the action queue
(Note: it is also accessible through "lctl get_param mdt.lustre-MDT0000.hsm.actions")

Typical entries in actions look like:

lrh=[type=10680000 len=136 idx=342/13652] fid=[0x200034e87:0x8700:0x0] dfid=[0x200034e87:0x8700:0x0] compound/cookie=0x57d8a6d9/0x57d89f3e action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=342/13655] fid=[0x20002a3ec:0xf909:0x0] dfid=[0x20002a3ec:0xf909:0x0] compound/cookie=0x57d8a6dc/0x57d89f41 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]

To retrieve statistics from this file to put them into Graphite/Grafana, I use the following script that extracts the number of lines by grouping them by "action" and also "status":

prefix="$CARBON_PREFIX.$MDT.hsm.actions"

lctl get_param mdt.$MDT.hsm.actions | awk -v prefix=$prefix 'BEGIN { now = systime() } / status=/ { action=gensub(/action=(.+)/, "\\1", "g", $7); status=gensub(/status=(.+)/, "\\1", "g", $(NF-1)); arr[action"."status] += 1   } END { for (s in arr) { printf "%s.%s %s %d\n", prefix, s, arr[s], now} }'

So I get lines that are valid for Graphite that look like:

srcc.sherlock.lustre.mdt.regal-MDT0000.hsm.actions.ARCHIVE.SUCCEED 3 1474463707

I am sure you do something similar for IML as you have a graph for HSM, but it is also broken when too many hsm actions are present.

The problem is that when the actions file reaches 100K's entries, it takes so long it is not parsable in a timely manner anymore...

I think having counters per "action" and "status" could be very useful, perhaps something like:

actions_stats:

status=archive status=STARTED count=23
status=archive status=SUCCEED count=1234
status=restore status=STARTED count=0
status=restore status=SUCCEED count=1

About the way to limit the size of the queue, a tunable like max_actions and -EFBIG sounds good, but still there might be a problem if restore actions are triggered when the action queue is full. Maybe the best would be a max_actions per action (archive, restore, remove)?...

Comment by Andreas Dilger [ 21/Sep/16 ]

I think we should stick with a single value per file, since this is required when moving stats into /sys/fs/lustre, so something like action_archive_started_count, action_archive_succeed_count, action_restore_started_count, action_restore_succeed_count.

Comment by Gerrit Updater [ 24/Aug/17 ]

Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/28677
Subject: LU-8626 hsm: count the number of started requests of each type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4e84ba8aa669554b2d1b77459ebe79770aa4ad37

Comment by Gerrit Updater [ 01/Dec/17 ]

Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/30336
Subject: LU-8626 hsm: expose the number of active hsm requests per type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 294a0e4ff77fb09ac643b5e15af224027ded4aee

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28677/
Subject: LU-8626 hsm: count the number of started requests of each type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 973759d1ff3bbcb217754bd9942fdf670dec2d96

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30336/
Subject: LU-8626 hsm: expose the number of active hsm requests per type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 42e40555f250b83730d233dc5e22fd1f9396ccfe

Comment by Peter Jones [ 04/Jan/18 ]

Landed for 2.11

Comment by Quentin Bouget [ 04/Jan/18 ]

Hi Peter,

I am not sure this issue should be marked as resolved yet.
The patches that landed only provide information about how many requests the coordinator is currently handling, there are no built-in limitations yet.

Comment by Peter Jones [ 04/Jan/18 ]

ok

Comment by Thomas Leibovici [ 03/Apr/18 ]

About already landed https://review.whamcloud.com/30336/:

Given the implemented counters refers to the contents of "active_requests" list, they should rather be named "active_archive_count", "active_restore_count", ... instead of "archive_count", etc... to be more explicit and avoid any confusion with the contents of hsm/actions that contains all requested actions.

This change should be done before releasing 2.11 to avoid changing names in /proc later after the feature is released.

Comment by Andreas Dilger [ 04/Apr/18 ]

2.11 has already been released.

Generated at Sat Feb 10 02:19:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.