[LU-4931] New feature of giving server/storage side advice of accessing file Created: 19/Apr/14 Updated: 01/Dec/17 Resolved: 05/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Li Xi (Inactive) | Assignee: | Li Xi (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4j, patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 13631 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We implement a new feature which provides new APIs and utils for senior users and smart applications to give advices about the access pattern of Lustre file so as to improve the data/metadata access peformance. It has a similar idea with fadvise64_64(2) or posix_fadvise(2), yet can pass specical advices directly through Lustre client to server/storage side. Some tests show that this feature might help us to get performance improvement for some application by giving proper advices in advance. |
| Comments |
| Comment by Li Xi (Inactive) [ 19/Apr/14 ] |
|
The patch is tracked here. |
| Comment by Peter Jones [ 21/Apr/14 ] |
|
Bobijam Could you please review this suggested feature and provide feedback? Thanks Peter |
| Comment by Andreas Dilger [ 29/Aug/14 ] |
|
I also noticed while looking at For ADVICE_RANDOM, if the file size is small enough that the file can fit entirely into the client RAM or the objects can fit entirely into the OST(s) RAM then it could be mapped to ADVICE_CACHE_DATA. Otherwise it should disable readahead on the file/object. This would also be useful if there was a burst buffer device that could copy the file from the OSTs into SSD/NVRAM storage, but that doesn't exist yet. For ADVICE_SEQUENTIAL it would make sense to increase the readahead window for the file on both the client and the backing OST(s). |
| Comment by Li Xi (Inactive) [ 30/Aug/14 ] |
|
Hi Andreas, Thank you very much for the guidance. That is really interesting. I will add these hints when I get the chance to refresh the patch. |
| Comment by Li Xi (Inactive) [ 28/Oct/14 ] |
|
There could be different implemetations for each kind of advice/hint. So, in order to make it easier to review, I seperate the change into several patches. The first patch adds a framework and other patches add advices/hints support based on the framework. Framework patch: Willneed patch: |
| Comment by Jinshan Xiong (Inactive) [ 09/Jan/15 ] |
|
The purpose of this patch wasn't clear to me. This patch may work under a few assumptions: I'm not sure if we can make these assumptions. The problem to block us from implementing fadivse(2) is that there is no file system callback for this interface. If we can take the ioctl() solution as this patch suggested, we can alternatively implement an ioctl based fadvise(2)(which can be easily migrated once the file system callback of fadvise(2) is added into kernel). Therefore, WILLNEED will actually read ahead pages from OFD side. How do you think? |
| Comment by Li Xi (Inactive) [ 14/Jan/15 ] |
|
Hi Jinshan, We did a little bit benchmark with this patch for concept validation. If we prefetch data from disk to memory, we will be able to get a huge improvement for read performance, especially for small read. And even if we prefetch the data from disk to SSD (with the support from hardware APIs), the performance improvement is still significant. I am not sure, but I guess, comparing to pretching to memory, this patch might be more useful for hybrid storage drivers, because as you said, the memory size is really limited. Using existing IOCTL framework is an interesting idea. But maybe, there are some other advice types (for example, some of the types that Andreas had suggested) which don't need to be sent to OSS/MDS side. IOCTL seems like a low level interfaces which controls ldiskfs/ZFS directly. Fadvise could be more upper-level and might trigger smart reactions from different levels of Lustre components. So, I think it is a little bit different with IOCTL interfaces. |
| Comment by Jinshan Xiong (Inactive) [ 15/Jan/15 ] |
|
I had a conversation with Li Xi. It turned out that this is not exactly the same scenario of fadvise(), the goal of this work is not even to read WILLNEED pages into memory. The intention is to work on a specialized hardware where it has HDD on the OST and SSD as the cache of HDD, and it just prefetches the data from HDD to SSD so that the upcoming read can find the data in SSD. The data can be read by multiple clients therefore it may not be good to transfer those data to client's memory as fadvise() does. In that case, it's really confusing to use terminologies of fadvise(). I would like to change the name to prefetch or something else. Li Xi, please correct me if I got this wrong. |
| Comment by Li Xi (Inactive) [ 16/Jan/15 ] |
|
Hi Jinshan, Yeah, the scenario of this work is different with traditional fadvise(). So it would be a good idea to use a different name to prevent confusion. |
| Comment by Andreas Dilger [ 16/Jan/15 ] |
|
JInshan, I think "prefetch" is not necessarily correct either, since there may be a desire to do e.g. DONTNEED or NOREUSE to flush data from the cache. I don't mind "ladvise" as the name, since it is essentially "fadvise, but on the server and not the client". This may also be able to integrate into DSS into the future to give cache hints to the server. |
| Comment by Jinshan Xiong (Inactive) [ 16/Jan/15 ] |
|
Hi Andreas, If I could make a choice, I'd avoid ladvise() and any terminologies similar to fadvise(). Sooner or later, we're going to implement fadvise(2) for Lustre, and people will start to ask what's the difference between ladvise() and fadvise() |
| Comment by Li Xi (Inactive) [ 20/Jan/15 ] |
|
Hi Jinshan, Do you have any better idea about the name? I am fine with ladvise(). I guess the difference between fadvise() and this mechanism has to be explained anyway no matter which name we choose. The difference between names of ladvise() and fadvise() looks enough to alert people that they are similar but different. And names that has similar meaning, e.g. hint and intent, might conflicts with existing mechanisms too, which will be worse. |
| Comment by Gerrit Updater [ 09/Feb/15 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/13691 |
| Comment by Oleg Drokin [ 14/Feb/15 ] |
|
I just read through the patch (and left a bunch of comments). I think it leaves more questions than it answers. The very important question of why did not you try to talk to upstream kernel people to see if they would be willing to add a callback in fadvise system call to call into th filesystem? If they are willing to do this - your job is suddenly much simplified as a lot of 3rd party apps that currently use posix_fadvise would start magically working and we won't have one more API to think about. Have you considered that there's no sense to send any advises to the servers all by themself? Why not cache the advise information on the client - not only would you be able to glance information about this from the file descriptor (even if the kernel guys don't agree to insert an FS callback) for compat reasons, there's one less RPC, and then you can only send relevant advices with every IO too, and refresh server idea of your wishes every time should the server forget (due to failover/recovery or memory pressure or countless other reasons). I see your example of "migrate data to other tier of storage" potentially by a sysadmin (with lfs ... command, lfs ladvise is too cryptic and I think it's best if you have something more intuitive thought up, things like lfs mark fastaccess file, to give a random nonbinding example), the implementation then would be to issue the same advise command (whatever would be the way to do this) and then issuing a 1 byte read or some other lighweight IO in the necessary region so that your wishes are transferred to the server. Oh, also note how now if an application wants to control a file access, it needs to do two calls - to posix_Fadvise and your ioctl - this is also inconvenient, I imagine, and if the kernel guys reject your approach of calling into the filesystem - you might want to call into sys_fadvise64 yourself from your ioctl. |
| Comment by Andreas Dilger [ 14/Feb/15 ] |
|
Oleg, I think you miss some of the vslue of this interface. The regular fadvise syscall is itself not necessarily storing any persistent state on the client either. fadvise(WILLNEED) just prefetches pages into cache, but has no guarantee they will be even loaded or kept, since it is only advice to try to optimize performance. The ladvise code is similar to fadvise, except it is like calling fadvise on the server, which isn't possible today. Even if the upstream kernel was changed to allow fadvise() to contact the filesystem, the behavior is different. The workload for ladvise is e.g. a bunch of different clients are doing small random reads of a file, so prefetching pages into OSS cache with big linear reads before the random IO is a net benefit. Fetching all that data into each client cache with fadvise() may not be, due to much more data being sent to the client. Similarly, having an ladvise DONTNEED that could flush all the cache for a specific object from OST cache may be better than only flushing it from the client cache. Even if fadvise is changed in the upstream kernel, it will take several years before that got into widely used vendor kernels (I don't think we plan to oatch client kernels), so having an interface for current systems is needed. |
| Comment by Li Xi (Inactive) [ 14/Feb/15 ] |
|
Hi Oleg, I think Andreas has pointed out all the possible reasons that I can think of. And please notice that another important usage of ladvise() interface is to give DSS hints to OSTs in the future. Ofcouse, we could hack fadvise() interface and add DSS advice. But DSS advice is so different with existing fadvise() framework that a seperate interface seems better. |
| Comment by Oleg Drokin [ 14/Feb/15 ] |
|
Andreas: I got that. regular fadvise does store permanent state on the client in many cases even now (not in enough details to be useful here, though). I feel that your example of random reads from a file does not really address all problems at hand. In reality there are three possible cases: a single client does a lot of random reads from the file and a lot of clients do random reads from the file (causing or not causing entire file to be read in the end). As for patching client servers, at least in case of RedHat they seem to be happy to backport patches from upstream that people need, as long as they were already accepted upstream. I imagine other major vendors are in a similar position. Now, back to the protocol level changes - do we really think sending those advices separately, as opposed to a part of IO RPC is the right choice? Protocol changes are the most fixed in place and hardest to change so we really need to strike it right the first go around. I guess if we do decide on the separate RPCs for advices, it makes sense to futureproof them a bit to allow multiple entries. |
| Comment by Andreas Dilger [ 14/Feb/15 ] |
|
I think it may be best to consider client vs. server cache management separately, even if a functional fadvise() call was available in the future. If a client calls fadvise(DONTNEED), it isn't clear if that should also flush cache on the server (maybe other clients are still accessing that file). If an app is sophisticated enough to use llapi_ladvise() then it can also call fadvise() as needed. Why should we entangle client and server cache management if there may be good reasons not to? One example could be an app that uses ladvise(WILLNEED) to fetch random IO data into server cache via prefetch, reads randomly into client cache, then ladvise(DONTNEED) to drop it from the server cache to start loading the next dataset in advance while it is still in client cache until processing is done. The app doesn't want ladvise(DONTNEED) to flush the client cache, and if we entangle the two then such optimization wouldn't be possible. Yes, it is possible that conflicting directives could be sent from multiple clients, but the OST could also start to ignore the advice in this case. In any case, in this case, the effect will generally be limited to a single user's files, so if they are asking for inconsistent behaviour then they get what they ask for in the end. As for adding such directives with each IO RPC, I could imagine that might also be possible (e.g. server side prefetch for readahead), and that isn't precluded in the future, but I imagine that for most workloads the app will call ladvise() separately from read/write so it makes sense to have a separate RPC for it today. Also, I expect that ladvise() advice will only be needed when it is not something that the kernel could detect itself (e.g. that some random IO is coming soon) so it might never be possible to automatically generate such hints automatically from within the kernel. |
| Comment by Oleg Drokin [ 14/Feb/15 ] |
|
Looking at the ongoing examples in the gerrit, it looks like we really have two usecases here. As such I imagine the current patch is mostly fine as the very first step in that direction. Protocol-wise it's ok. |
| Comment by Jinshan Xiong (Inactive) [ 16/Feb/15 ] |
I think this is exactly the problem this patch will address. I don't think random IO is in scope of this patch because dedicated API is provided to applications and the applications should know exactly what they are doing. It's simply for the applications to look for trouble to issue wrong IO model to OSTs. I guess one confusion is from the function name fadivse(). This is why I'd like to avoid advise() similar stuff and using a totally different name. The comments are too long, sorry if I missed something. |
| Comment by Oleg Drokin [ 17/Feb/15 ] |
|
Actually I guess teh advise thing makes sense in a way, if we consider that it is application givilg the acces advice ahead of time. It's just the RPCs proposed only make sense as advice to tiered storage on the backend to let it know ahead of time when to move stuff to different tiers ahead of time. |
| Comment by Andreas Dilger [ 17/Feb/15 ] |
|
If the server cache can be considered a "storage tier" then this code is already useful. Also, for ZFS with L2ARC or DSS it also would be useful if wired in correctly. |
| Comment by Oleg Drokin [ 18/Feb/15 ] |
|
Server cache is definitely a storage tier in my book. |
| Comment by Andreas Dilger [ 09/May/15 ] |
|
I was thinking for LU_LADVISE_RANDOM that it makes sense to send the random IO blocksize with the ladvise RPC. For new file writes with ZFS this would allow selecting the blocksize of the file to match the random IO size to avoid large read-modify-writes. |
| Comment by Li Xi (Inactive) [ 11/May/15 ] |
|
Yeah, that makes sense. And in the process of adding advice type support for DSS, I also realized that extra fields of 'struct lu_ladvise' might be necessary for specifying arguments. That requires wire protocol updates. I am not sure how these arguments could be added in a extendable way, because one or two u64 padding fields does not sound like a good solution. |
| Comment by Li Xi (Inactive) [ 30/Nov/15 ] |
|
I've cleaned up the codes again. A manual of lfs-ladvise has been added. WILLNEED advice has been renamed to WILLREAD to prevent confusion with Linux kernel fadvise. |
| Comment by Ben Evans (Inactive) [ 18/Jan/16 ] |
|
Would it be possible (or reasonable) to add a lockless_truncate flag into the ladvise framework? We have customers who would like more fine-grained control over the lockless_truncate flag (by file, rather than by filesystem). |
| Comment by Andreas Dilger [ 19/Jan/16 ] |
|
Ben, You could propose a patch and we can see what it looks like. It would just be a new advice type, and set a flag on the Lustre file info that truncates are lockless, unless the client already has a lock. |
| Comment by Andreas Dilger [ 23/Feb/16 ] |
|
Add ticket for manual update. |
| Comment by Nathan Rutman [ 23/Feb/16 ] |
|
If the purpose of the advice is to influence backend tier selection, it also probably makes sense to include a ADVICE_ARCHIVE directive indicating that data should be down-tiered to HSM or slower Lustre storage. |
| Comment by Andreas Dilger [ 24/Feb/16 ] |
|
Is that a different interface for "lfs hsm_archive" on the file, or are you thinking if HSM archives behind individual OSTs? I'm not against adding advices, but I think they need to have a real use case and not just something that might be used in the future. |
| Comment by Nathan Rutman [ 24/Feb/16 ] |
|
Most flags listed here are for cache / hot data; I'm suggesting it is helpful to know data non-re-use as well: ADVICE_CACHE_CLIENT -> ADVICE_CACHE_SERVER -> ADVICE_UNCACHE -> ADVICE_ARCHIVE. For example, if I'm writing a checkpoint file that I know I will not read, Lustre might choose to follow the DIO path and skip all caches. It was just a thought for consideration really; I'm not trying to push it. |
| Comment by Gerrit Updater [ 17/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10029/ |
| Comment by Gerrit Updater [ 13/May/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/20203 |
| Comment by Gerrit Updater [ 15/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12458/ |
| Comment by Gerrit Updater [ 16/Aug/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/21940 |
| Comment by Gerrit Updater [ 07/Sep/16 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/22361 |
| Comment by Andreas Dilger [ 08/Sep/16 ] |
|
Li Xi, I noticed just now in ofd_ladvise_prefetch() that this is allocating PTLRPC_MAX_BRW_PAGES * sizeof(niobuf_local) = 160KB for each ladvise willread call. Instead, this should be using struct tgt_thread_big_cache *tbc = req->rq_svc_thread->t_data as is done in tgt_brw_read(). |
| Comment by Andreas Dilger [ 08/Sep/16 ] |
|
Also, while you are in there, can you please fix the indenting for static int ofd_ladvise_hdl(struct tgt_session_info *tsi) { : : case LU_LADVISE_WILLREAD: req->rq_status = ofd_ladvise_prefetch(env, fo, ladvise->lla_start, ladvise->lla_end); to be req->rq_status = ofd_ladvise_prefetch(env, fo,
ladvise->lla_start,
ladvise->lla_end);
|
| Comment by Gerrit Updater [ 14/Sep/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/22489 |
| Comment by Gerrit Updater [ 15/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20203/ |
| Comment by Peter Jones [ 16/Sep/16 ] |
|
As per discussion with Ihara all further enhancements in this area will be tracked under a different JIRA ticket |
| Comment by Andreas Dilger [ 03/Oct/16 ] |
|
Reopen to land man page for 2.9.0: Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22910 |
| Comment by Gerrit Updater [ 05/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22910/ |
| Comment by Peter Jones [ 05/Oct/16 ] |
|
Man page has landed. Remaining patches tracked under this id will be landed under a new ticket |
| Comment by Gerrit Updater [ 08/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22489/ |
| Comment by Gerrit Updater [ 25/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21940/ |