[LU-10968] add coordinator bypass upcalls for HSM archive and remove Created: 30/Apr/18 Updated: 22/Mar/22 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Nikitas Angelinas |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
HSM restore performance is often degraded by the presence of too many archive requests in the CDT llog or CT pipeline. Offer upcalls for archive and remove to be invoked on the MDT which allow bypassing of the coordinator and better scheduling of archives and removes. From the commit message on https://review.whamcloud.com/32212: This change provides an HSM upcall facility to optionally bypass the
HSM coordinator (CDT) for archive and remove requests. (Release
already bypasses the CDT and restore bypass is not supported by this
change.)
Requires updated MDT and a worker client. OSTs, compute nodes, and
copytool nodes need not be updated.
lctl set_param mdt.*.hsm.upcall_mask='ARCHIVE' # or 'ARCHIVE RESTORE', 'RESTORE', ''
lctl set_param mdt.*.hsm.upcall_path=/.../lhsm_mdt_upcall # Full path.
HSM requests whose action is set in the upcall_mask parameter will be
diverted from the coordinator and handled by the executable specified
by upcall_path. By default upcall_mask is empty which gives the normal
HSM coordinator handling behavior.
The upcall (to be supplied by the site) will be invoked by MDT RPC
handler (runs on MDT as a root privileged process with an empty
environment). Invocation will be of the form:
/.../lhsm_mdt_upcall ACTION FSNAME ARCHIVE_ID FLAGS DATA FID...
with one or more FIDs each as a separate argument. The upcall_path
paramater can be set to the path of an arbitrary (site supplied)
executable as long as it DTRT. The RPC handler will block until the
upcall completes. So for safety/liveness the upcall should really not
access Lustre. Instead the upcall should put the request in an
off-Lustre persistent queue or database and then exit. The actions
could be submitted to a job scheduler but care must be taken to ensure
thatthis does not entail any Lustre operations. See comments in
mdt_hsm_upcall().
A separate process (called a "worker" and also to be supplied by the
site) should read from that persistent queue and perform the
actions. The worker process does what a copytool does but instead of
listening on a KUC pipe for actions it reads form the queue. Like
existing copytools it must interact with the Lustre and with the
archive. The main difference (on the Lustre side) is that it uses
slightly modified ioctls to handle the upcalled requests. To make it
easier I added a new command ('lfs hsm_upcall') that manages the
Lustre half of an upcalled action and a sample script
lustre/utils/lhsm_worker_posix that handles the archive side (assuming
a lhsmtool_posix archive layout). The idea is that 'lfs hsm_upcall'
knows about Lustre and lhsm_worker_posix knows about the
archive. Running
lfs hsm_upcall lhsm_worker_posix ARCHIVE FSNAME ARCHIVE_ID FLAGS DATA FID...
will do the following for each FID:
1. Open the Lustre file to be archived specified by FID.
2. Send an RPC (which bypasses the CDT) to the MDT to say that ARCHIVE is starting.
3. Invoke
lhsm_worker_posix ACTION FSNAME ARCHIVE_ID FLAGS DATA FID
with stdin opened to the file to be archived.
4. Wait for lhsm_worker_posix and send a ARCHIVE completion RPC
(with the exit status of lhsm_worker_posix to the MDT).
5. Close the file to be archived.
Remove is handled similarly by without the open or close.
See comments in lustre/utils/lhsm_worker_posix and lfs_hsm_upcall().
This may seem like a lot of moving parts but internally HSM has a lot
of parts and this was the cleanest way to decompose it that would
offer the flexibility needed.
|
| Comments |
| Comment by Gerrit Updater [ 30/Apr/18 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/32212 |
| Comment by Nathan Rutman [ 06/Sep/18 ] |
|
A couple of thoughts:
|
| Comment by Nathan Rutman [ 26/Sep/18 ] |
|
Discussion from LAD18 Dev Day. #1. upcall vs netlink. (Cray/@BenEvans has a netlink prototype.) Upcalls make it easy to implement standalone userspace consumers. But require a relatively heavyweight process spawning for each request. #2. lfs hsm_upcall vs llapi_hsmaction_start|end @JohnHammond's lfs hsm_upcall takes as an argument a script to call back for each file. This again generates a process spawn/context switch, which @NathanRutman doesn't like. Instead, 2 llapi calls could be called directly by the copytool, one before (llapi_hsmaction_start) and one after (llapi_hsmaction_end) each file. These calls would do all the Lustre housekeeping bits: locks, layouts, ioctls. I think John agreed this is ok. #3. exporting restore vs retaining the coordinator John's patch as it stands only exports the archive and cancel, not restore. Restore retains it's current form, using the Coordinator and Lustre-registered copytools. The reason for this is for MDT failure: the MDT re-enqueues the layout lock for any incomplete restores, in order to prevent the clients from obtaining the layout lock when they reconnect. We discussed 3 options, all sadly uninformed by any actual knowledge of how this works right now. #3a. The MDS should not grant the layout lock to a client if the request comes with a "restore intent", and should enqueue the lock itself instead. It is not clear that the client sends such an intent (does it?). The MDS can't block all layout access because we expect other things to break if we do this (such as?). #3b. Pin the client's RESTORE RPC for replay. When the MDS restarts, the client re-sends the RPC, which triggers the MDS to re-enqueue the layout lock. Problem is that the client is also replaying the layout lock request, and we're not sure which one is processed first, and also not sure this won't change in the future. #3c. The client should re-check that the layout is not released upon being granted the layout lock. If it is still released, then loop back to the send RESTORE RPC step. Nobody is quite sure how this check actually happens today, but theoretically if it happens on the client once, it should be possible to check again. Requires a bit of research as to how. Of these three options, it seems to @NathanRutman and @QuintenBouget that 3c. seems the most bulletproof. #4. Combining Ben's and John's efforts. In addition to #1 above, @BenEvans has a similar prototype patch of his own. Obviously we need to consolidate down to a single implementation. I suggest Ben attach his patch here as well, and everyone comment on Johns and Bens so we can figure out how to converge. #5. Just for fun Nathan brought up the idea that if archive was a real layout, it could be the second mirror, and we could possibly use the automatic reconstruction code being developed for EC layouts to populate the primary mirror. Not quite that easy, because the reconstruct happens on the client, which would then have to ask the MDS to restore. But John one-upped him by suggesting that the archive mirror could just be a single object on a specialized archive OST. This makes its HSM status irrelevant to client code; client just treats it as a one-stripe mirror. It could read from it directly. The archive OST would have upcalls to connect to copytools. (Hmm, maybe it is a regular ldiskfs OST, except that it has a filter upcall to locally restore any missing objects to a temporary disk file.)
|
| Comment by Ben Evans (Inactive) [ 26/Sep/18 ] |
|
3a doesn't work, as there may be multiple restore requests for the same file on multiple clients, the MDS needs to keep the lock for restore. 3b looks workable, but any coordinator needs to be able to check for duplicate requests, and the MDS needs to take the layout lock only on the first restore request 3c an implicit restore returns -ENODATA if the layout lock gets released and there is no data. You'd have to retry the restore. |
| Comment by Gerrit Updater [ 26/Sep/18 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/33245 |
| Comment by Quentin Bouget [ 27/Sep/18 ] |
|
More on 3c: The code we care about is located in lustre/llite/vvp_io.c::vvp_io_fini() if (io->ci_restore_needed) {
/* file was detected release, we need to restore it
* before finishing the io
*/
rc = ll_layout_restore(inode, 0, OBD_OBJECT_EOF);
/* if restore registration failed, no restart,
* we will return -ENODATA */
/* The layout will change after restore, so we need to
* block on layout lock held by the MDT
* as MDT will not send new layout in lvb (see LU-3124)
* we have to explicitly fetch it, all this will be done
* by ll_layout_refresh().
* Even if ll_layout_restore() returns zero, it doesn't mean
* that restore has been successful. Therefore it sets
* ci_verify_layout so that it will check layout at the end
* of this function.
*/
if (rc) {
io->ci_restore_needed = 1;
io->ci_need_restart = 0;
io->ci_verify_layout = 0;
io->ci_result = rc;
GOTO(out, rc);
}
io->ci_restore_needed = 0;
/* Even if ll_layout_restore() returns zero, it doesn't mean
* that restore has been successful. Therefore it should verify
* if there was layout change and restart I/O correspondingly.
*/
ll_layout_refresh(inode, &gen);
io->ci_need_restart = vio->vui_layout_gen != gen;
if (io->ci_need_restart) {
CDEBUG(D_VFSTRACE,
DFID" layout changed from %d to %d.\n",
PFID(lu_object_fid(&obj->co_lu)),
vio->vui_layout_gen, gen);
/* today successful restore is the only possible
* case */
/* restore was done, clear restoring state */
ll_file_clear_flag(ll_i2info(vvp_object_inode(obj)),
LLIF_FILE_RESTORING);
}
GOTO(out, 0);
}
ll_layout_restore() sends the RESTORE RPC and ll_layout_refresh() blocks waiting for the layout_lock to be released by the MDS. The comments in the code clearly suggest that the client has all the information it needs to detect that the RESTORE operation did not complete yet or that it failed (the generation number of the layout is not modified). IMO adding an "else goto retry_restore;" to the last if block and a "retry_restore" label above the call to ll_layout_restore() should be enough to make the client resilient to MDS failovers without needing the MDS to store RESTORE requests in an llog. |
| Comment by Nathan Rutman [ 02/Oct/18 ] |
|
My Gerrit/git skills are long gone, but I agree - something like this seems right. Only question is if we should give up after say 10 tries, and then explicitly return ENODATA. diff --git a/lustre/llite/vvp_io.c b/lustre/llite/vvp_io.c index 793ec00de1..39e3d0ac29 100644 --- a/lustre/llite/vvp_io.c +++ b/lustre/llite/vvp_io.c @@ -311,7 +311,7 @@ static void vvp_io_fini(const struct lu_env *env, const struct cl_io_slice *ios) vio->vui_layout_gen, io->ci_need_write_intent, io->ci_restore_needed); - if (io->ci_restore_needed) { + while (io->ci_restore_needed) { /* file was detected release, we need to restore it * before finishing the io */ @@ -329,15 +329,12 @@ static void vvp_io_fini(const struct lu_env *env, const struct cl_io_slice *ios) * of this function. */ if (rc) { - io->ci_restore_needed = 1; io->ci_need_restart = 0; io->ci_verify_layout = 0; io->ci_result = rc; GOTO(out, rc); } - io->ci_restore_needed = 0; - /* Even if ll_layout_restore() returns zero, it doesn't mean * that restore has been successful. Therefore it should verify * if there was layout change and restart I/O correspondingly. @@ -351,11 +348,13 @@ static void vvp_io_fini(const struct lu_env *env, const struct cl_io_slice *ios) vio->vui_layout_gen, gen); /* today successful restore is the only possible * case */ + io->ci_restore_needed = 0; + /* restore was done, clear restoring state */ ll_file_clear_flag(ll_i2info(vvp_object_inode(obj)), LLIF_FILE_RESTORING); + GOTO(out, 0); } - GOTO(out, 0); } /** |
| Comment by Quentin Bouget [ 02/Oct/18 ] |
|
> Only question is if we should give up after say 10 tries I would say it is up to the MDS to mark the file as "lost" and maybe the decision should come from the copytool. |
| Comment by Nathan Rutman [ 02/Oct/18 ] |
|
One more point to consider: #6. Copytool connection. My assumption was that externalizing the coordinator implied that we were no longer going to use Lustre's copytool registration and request assigning functions, but @BenEvans points out that is not mandatory. His patch actually leaves the copytool registration functionality in place (for those who want to use it - not mandatory), and thereby allows the external coordinator to pass the copytool requests back through Lustre. This allows for customizing the coordinator (e.g. reprioritize requests), without affecting anything else. (He also is providing a switch to select from internal to external coordinator.) |
| Comment by John Hammond [ 03/Oct/18 ] |
|
Nathan, Re your one line change to vvp_io_fini(), I think there still needs to be some added check in the loop to see if the file layout is not-released. That's an issue that we have now. If the layout lock is granted but the file is still released then the client proceeds and the IO returns -ENODATA. Due to some questionable layering choices only the lov layer understands whether a layout is released or not. But only the llite layer can ask for it to be restored. |
| Comment by Ben Evans (Inactive) [ 03/Oct/18 ] |
|
I've added the restore retry to my patch, also added a COORDINATOR variable to sanity-hsm, so we should be able to test external vs. internal coordinators with the flip of a switch. The environment variable mdt.*.hsm_control=external sets the coordinator to be external, rather than internal. Archive, restore, release, etc. all work either through lfs commands or through the implicit-restore path. |
| Comment by Nathan Rutman [ 03/Oct/18 ] |
If the file is not-released but MDS decides to release layout lock, wouldn't ENODATA be the correct response? Say archive file is "lost" - it remains released, and MDS should probably return an error in ll_layout_restore. I think Quentin is right - let the MDS decide, and let the client return an error rather than retry. Checking for 'released' bit isn't necessarily the right thing for the client to do. |
| Comment by Ben Evans (Inactive) [ 13/Feb/19 ] |
|
I'm actively working this, and will update my patch soon, so I'm assigning it to myself. |
| Comment by James A Simmons [ 13/Feb/19 ] |
|
To let you know I'm going to push another LU-9680 update. I have been talking to Amir about its application for LNet UDSP as well as using this for lnet selftest so I might move the netlink handling into liblnetconfig. I will rebase the LU-7659 patch on top of LU-9680 as well as push a early stats patch I developed which is not finished. |
| Comment by Gerrit Updater [ 18/Sep/19 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/36235 |
| Comment by Gerrit Updater [ 18/Oct/19 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/36492 |
| Comment by Cory Spitz [ 19/Feb/20 ] |
|
beevans, this is assigned to your old persona. |
| Comment by John Hammond [ 09/Jun/21 ] |
|
Closing since this isn't being worked on. |
| Comment by Cory Spitz [ 09/Jun/21 ] |
|
There are still two patches pending in Gerrit for this ticket. It is probably best not to abandon them. Granted it isn't a true dependency, but we've all been waiting for the netlink changes to finish refreshing and landing the HSM/data movement patches that would be impacted. It looks like https://review.whamcloud.com/#/c/34230 finally has all the necessary +1s and it is even in master-next now, so perhaps we can reopen and resume this work soon. |
| Comment by James A Simmons [ 22/Mar/22 ] |
|
Both HPE and Microsoft is interested in this work. |
| Comment by Ben Evans (Inactive) [ 22/Mar/22 ] |
|
I think 99% of this can be skipped by LU-13384 and using purely non-kernel calls. lfs hsm ... calls just all get routed to the external coordinator (of whatever form). The only issue is the calls that perform imperative restore on file access. I believe those can be easily added using a smaller chunk of the infrastructure in this PR. |
| Comment by James A Simmons [ 22/Mar/22 ] |
|
Why even bother with kernel space at all then. If you want a pure user land solution then look at http://github.com/cea-hpc/coordinatool. This work is looking to improve what we are already using without creating a new interface. |
| Comment by Ben Evans (Inactive) [ 22/Mar/22 ] |
|
yep, given some improvements, coordinatool looks like a really good solution. |