[LU-5319] Support multiple slots per client in last_rcvd file Created: 10/Jul/14 Updated: 19/Oct/22 Resolved: 27/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Gregoire Pichon | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b, patch, performance, recovery | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 14856 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
While running mdtest benchmark, I have observed that file creation and unlink operations from a single Lustre client quickly saturates to around 8000 iops: maximum is reached as soon as with 4 tasks in parallel. Looking at the code, it appears that most metadata operations are serialized by a mutex in the MDC layer. After an email discussion with Andreas Dilger, it appears that the limitation is actually on the MDS, since it cannot handle more than a single filesystem-modifying RPC at one time. There is only one slot in the MDT last_rcvd file for each client to save the state for the reply in case it is lost. The aim of this ticket is to implement multiple slots per client in the last_rcvd file so that several filesystem-modifying RPCs can be handled in parallel. The single client metadata performance should be significantly improved while still ensuring a safe recovery mecanism. |
| Comments |
| Comment by Peter Jones [ 10/Jul/14 ] |
|
thanks Gregoire! |
| Comment by Peter Jones [ 10/Jul/14 ] |
|
Alex I understand that you have done some initial experimentation in this area Peter |
| Comment by Peter Jones [ 17/Jul/14 ] |
|
Alex Is this the prototype for this work? http://review.whamcloud.com/#/c/9871/ Peter |
| Comment by Alex Zhuravlev [ 17/Jul/14 ] |
|
sorry, missed the first comment.. yes, that's it. |
| Comment by Robert Read (Inactive) [ 28/Jul/14 ] |
|
Alex, it might be a good idea to land the client part and perhaps even the protocol changes (with backward compatibility) from that patch sooner rather than later. (Of course with maxslots = 1.) This will reduce the size of the patch, and also we'll have a larger base of potentially compatible clients once we have servers that do support multislot > 1. |
| Comment by Andreas Dilger [ 29/Jul/14 ] |
|
Robert, have you had any chance to test this patch out with workloads that might benefit? |
| Comment by Robert Read (Inactive) [ 29/Jul/14 ] |
|
No, I haven't. |
| Comment by Gregoire Pichon [ 30/Jul/14 ] |
|
I have run the mdtest benchmark on a single client in the following configuration:
Results are given in attachment: |
| Comment by Michael MacDonald (Inactive) [ 30/Jul/14 ] |
|
Gregoire, That's a pretty compelling result. Do you have a feel for how the patch performance compares to running with multiple mountpoints per client node? I would assume that the multi-mount performance would be similar (or even a bit worse), but I'm curious as to whether you have any data for that case. |
| Comment by Gregoire Pichon [ 31/Jul/14 ] |
|
Here are the results with one additional configuration: multiple mount points of the same fs on the single client node. It's slightly better for file creation and quite similar for file removal. see mdtest lustre file creation b.png |
| Comment by Robert Read (Inactive) [ 31/Jul/14 ] |
|
pichong, thanks for running that again, this is helpful. How many mounts did you use, by the way? bzzz, can we make maxslots tunable or even better have the client request a desired value? For some workloads at least I think it should even be a multiple of cores on the client. |
| Comment by Alex Zhuravlev [ 01/Aug/14 ] |
|
Robert, it was a proto. I think if we want to productize the patch, then it makes sense to collect requirements like maxslots, etc. |
| Comment by Gregoire Pichon [ 01/Aug/14 ] |
|
For multi-mount case there was one mount point per task (24 mount points at max). I am currently writing a solution architecture document, so we all agree on improvement requirements. Alexey, I have a question on your prototype... |
| Comment by Alex Zhuravlev [ 03/Aug/14 ] |
|
> Alexey, I have a question on your prototype... Why the reply data corresponding to export's last transno is never released ? The code comment says otherwise we risk to lose last_committed, but I don't see what ted_last_reply value is used for. it's used to retain corresponding slot in last_rcvd. see tgt_free_reply_data(). |
| Comment by Gregoire Pichon [ 04/Aug/14 ] |
|
Sorry, it's still not clear... |
| Comment by Alex Zhuravlev [ 04/Aug/14 ] |
|
if (ted->ted_last_reply != NULL) { then in tgt_free_reply_data(): since then corresponding slot in the new last_rcvd file can be re-used. |
| Comment by Gregoire Pichon [ 05/Aug/14 ] |
|
I have attached a solution architecture document (MDTReplyReconstructionImprovement.architecture.pdf). It's a proposal that describes the functional improvements to support multiple filesystem-modifying MDT requests per client and MDT reply reconstruction in that context. It would be great to have some feedbacks from anyone interested by this feature. |
| Comment by Gregoire Pichon [ 01/Sep/14 ] |
|
I would like to start working on a high level design document for this feature. |
| Comment by Alex Zhuravlev [ 04/Sep/14 ] |
|
Gregoire, the document looks good. the close case is specific because of openlock cache - a RPC being handled by MDT (so mdc semaphore/the slot are busy) may ask for openhandle to release (lock cancellation leading to close RPC) which requires another slot. |
| Comment by Andreas Dilger [ 17/Sep/14 ] |
|
Alex, what is left to be done with your patch before it can move from prototype to production? |
| Comment by Andreas Dilger [ 18/Sep/14 ] |
|
Gregoire,
Lustre typically does not set tunables via mount options, since that would be a large number of different possible options. This would typically handled by a read-write tunable lctl set_param mdc..max_modify_rpcs_in_flight=num to match the normal mdc..max_rpcs_in_flight=num tunable. It may be that there is no reason to limit this beyond the existing max_rpcs_in_flight=8 default value. Since OSTs are fine to have 8 writes in flight at one time, there is no reason to expect MDTs to have a problem with this either. Typically, the MDS will return the maximum number for max_rpcs_in_flight in obd_connect_data, and the client is free to chose any number less than or equal to this. Default is what server replies in ocd, or if unsupported us 1 for MDC and 8 for OSC. Setting max_rpcs_in_flight to a larger number via /proc should return an error. It probably makes sense to handle this OBD_CONNECT flag on the OSS as well, and both MDS and OSS services can tune this default parameter permanently in a single place (with lctl conf_param or lctl set_param [-P].
This is https://bugzilla.lustre.org/show_bug.cgi?id=3462 (really b=3166, but that one is private). The original problem was related to sending the close while there was another RPC in flight (keeping the main slot in the last_rcvd file was busy). Without this extra RPC in flight (and an extra slot to hold the reply reconstruction), the close could deadlock The document looks good otherwise. |
| Comment by Gregoire Pichon [ 18/Sep/14 ] |
|
thanks for your comments. There is however an issue with using the max_rpcs_in_flight to control the maximum modify metadata rpcs in parallel since the limit will also apply to the non-modify metadata rpcs (getattr, readdir, ...). Even if MDT has a single slot, mdc must be able to send up to max_rpcs_in_flight getattr requests in parallel. This is the current behavior. Therefore, I think it's necessary to introduce a specific parameter for the max number of rpcs that depend on slot allocation on the MDT, max_md_rpcs_in_flight for instance. It's value could not be higher than max_rpcs_in_flight. |
| Comment by Gregoire Pichon [ 18/Sep/14 ] |
|
Alexey, thanks. |
| Comment by Alex Zhuravlev [ 18/Sep/14 ] |
|
we do not resend REP-ACKs, they can be lost. meaning the corresponding slot can't be re-used as long as the client is alive. I consider few schemas.. e.g. ping can bring some information, but the schema with tags seem the most trivial one to me. notice it's perfect - we don't use REP-ACKs with OST. so if 8 slots were used, then the client use 2 slots at most for long, then the remaining 6 slots are "lost" again, but not till the client's eviction at least. probably ping can be used to tell the target the highest replied XID - given ping is used on the idling connection, this should be OK? |
| Comment by Gregoire Pichon [ 26/Sep/14 ] |
|
Here is a new version of the architecture document. |
| Comment by Gregoire Pichon [ 26/Nov/14 ] |
|
Here is the design document of the MDT reply reconstruction improvement (MDTReplyReconstructionImprovement.design.pdf). |
| Comment by Gerrit Updater [ 08/Jan/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13297 |
| Comment by Andreas Dilger [ 08/Jan/15 ] |
|
Gregoire, thank you for the excellent design document. I was reading through this and had a few comments:
|
| Comment by Gregoire Pichon [ 12/Jan/15 ] |
|
Andreas, thanks you for the design review.
|
| Comment by Alex Zhuravlev [ 12/Jan/15 ] |
|
I think we shouldn't limit this to MDT. OST does need this functionality as well. |
| Comment by Gregoire Pichon [ 13/Jan/15 ] |
|
I agree the feature could also apply to OST, and I have tried to take it into account in the design. |
| Comment by Alex Zhuravlev [ 13/Jan/15 ] |
|
given the code is shared on the server, it doesn't make sense to put the client's implementation to MDC and limit the scope, IMHO. there should be no changes from the architecture/design point of view. the patch I mentioned before addresses this already. |
| Comment by Alex Zhuravlev [ 16/Feb/15 ] |
|
from the document it's not clear how that highest consecutive XID is maintained. RPCs aren't sent in XID order. at least with the current code it's possible to have XID=5 assigned, then have few RPCs with XID=6,7,8,9,10 sent and replied and then RPC with XID=5 is sent. if RPC(XID=10) was sent with that "mark"=9 and it races with RPC(XID=5), then we can lose the slot for XID=5 ? |
| Comment by Gerrit Updater [ 25/Feb/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13862 |
| Comment by Gerrit Updater [ 26/Feb/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13895 |
| Comment by Gerrit Updater [ 26/Feb/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13900 |
| Comment by Gerrit Updater [ 02/Mar/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13931 |
| Comment by Gerrit Updater [ 02/Mar/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13932 |
| Comment by Gregoire Pichon [ 03/Mar/15 ] |
|
Here is a new version of the design document. |
| Comment by Gregoire Pichon [ 03/Mar/15 ] |
|
Alex, I don't see the objectives of the several patches you recently posted to gerrit, as the content of your patches does not implement what has been written and reviewed in the design document. For example, the more appropriate solution to release reply data on server side has been identified in the design as the solution based on XID, not the one based on tag reuse. The patches http://review.whamcloud.com/13895 and http://review.whamcloud.com/13900 implement the tag based solution, don't they ? |
| Comment by Alex Zhuravlev [ 03/Mar/15 ] |
|
I don't think we should rely on XID only as processing time and order of RPCs can vary significantly. Imagine there are 32 RPCs in flight (say, with XIDs 1-32). for a reason, XID=32 took a long, so we won't be able to re-use potentially free slots 1-31. IMO, this isn't very nice because reply_log file grows for no reason. I think XID-based approach is needed to address a different case when RPCs-in-flight decrease and some tags aren't used for a while. |
| Comment by Gregoire Pichon [ 04/Mar/15 ] |
|
Alex, If we want to coordinate our efforts, as we both agree by email last week, I think we need to fully agree on the design. So please review the design document attached to the ticket (current version is 0.3) and give back your comments before continuing working on patches. If you prefer, I am also open to setup a call to have live discussions on the design. |
| Comment by Alex Zhuravlev [ 04/Mar/15 ] |
|
sure.. from the design I read last time (few weeks ago) I got to think we'll be using both methods (tag + XID). also, I observed that XID approach has the issue I described on Feb 16. no one replied |
| Comment by Gregoire Pichon [ 04/Mar/15 ] |
|
I agree with your comment posted in Feb 16. |
| Comment by Gerrit Updater [ 04/Mar/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/13960 |
| Comment by Alex Zhuravlev [ 04/Mar/15 ] |
|
so, we haven't agreed on the design yet, but the patches are still coming |
| Comment by Gregoire Pichon [ 09/Mar/15 ] |
|
Here is a new version of the design document: 0.4. |
| Comment by Alex Zhuravlev [ 13/Mar/15 ] |
|
the document looks solid, few comments: > This will ensure the export's last transno can be rebuilt from disk in case of recovery. better to say "highest committed transno" > When the MDT will receive an RPC with the MSG_RESENT flag iirc, we have to check for MSG_REPLAY as well > This is because, without the client received xid information, the server would not be able to release the reply data in memory probably we should stop to support this after some future version. > 3. received xid as for downgrade, I'd think we shouldn't allow this if some of reply_data is valid - otherwise recovery can break. probably it makes sense to describe shortly what happens when a client disconnects: we have to release all the slots. performance improvements: some optimizations for bitmap scanning? testing should be mentioned: how do we verify execute-once semantics is retained. |
| Comment by Gregoire Pichon [ 17/Mar/15 ] |
|
> better to say "highest committed transno" > we have to check for MSG_REPLAY as well > probably we should stop to support this after some future version. > probably you should mention that currently imp_send_list isn't strictly ordered and XIDs "from the past" can join the list. also, looking at imp_send_list isn't enough - there is imp_delayed_list. > as for downgrade, I'd think we shouldn't allow this if some of reply_data is valid > probably it makes sense to describe shortly what happens when a client disconnects: we have to release all the slots. > testing should be mentioned: how do we verify execute-once semantics is retained. |
| Comment by Gerrit Updater [ 17/Mar/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14095 |
| Comment by Alex Zhuravlev [ 18/Mar/15 ] |
|
> Are you talking about old on-disk reply data of the client that disconnects ? That's a good idea. This would allow "reply data" file to be completely cleared when all clients have disconnected and it could be then truncated to avoid unnecessary parsing at next MDT start. It would also avoid the use of the slot generation number and reply data generation number lrd_generation. However, I am wondering if parsing the entire "reply data" file when client disconnects is acceptable in term of processing time. the primary goal is to support backward compatibility (even in some restricted form). for simplicity the schema can be as the following: when all the clients are disconnected (it's easy and cheap to detect), then the whole content of reply_log can be discarded. > testing should be mentioned: how do we verify execute-once semantics is retained. correct. we have to examine if RPCs A and B were sent concurrently (i.e. using different tags), then the server reconstructs the corresponding replies. |
| Comment by Gerrit Updater [ 24/Mar/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14153 |
| Comment by Gerrit Updater [ 27/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13297/ |
| Comment by Gerrit Updater [ 31/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13960/ |
| Comment by Gregoire Pichon [ 01/Apr/15 ] |
|
Alex, |
| Comment by Alex Zhuravlev [ 01/Apr/15 ] |
|
OSP represents remote OSD. say, there are MDT and OST nodes. MDT can talk to OST using regular OSD API and OSP is a proxy transferring the API's methods over a network, including modifying methods. at the moment it has to serialize modifications due to the single-slot last_rcvd and this is a big limitation given OSP is used for distributed metadata and other features. this is why I asked for a generic enough implementation which can be easily used by another components, including OSP. |
| Comment by Gregoire Pichon [ 02/Apr/15 ] |
|
Alex, |
| Comment by Alex Zhuravlev [ 02/Apr/15 ] |
|
there are two major features using that - LFSCK (check sanity-lfsck.sh) and DNE (in many scripts, look for "lfs mkdir"). |
| Comment by Gerrit Updater [ 06/Apr/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14095/ |
| Comment by Gerrit Updater [ 07/Apr/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14374 |
| Comment by Gerrit Updater [ 07/Apr/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14375 |
| Comment by Gerrit Updater [ 07/Apr/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14376 |
| Comment by Gregoire Pichon [ 09/Apr/15 ] |
|
Here is version 0.5 of the design document. |
| Comment by Gerrit Updater [ 01/May/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14655 |
| Comment by Gregoire Pichon [ 13/May/15 ] |
|
Here is version 1 of the test plan document. |
| Comment by Gerrit Updater [ 13/May/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14793 |
| Comment by Gerrit Updater [ 19/May/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14860 |
| Comment by Gerrit Updater [ 19/May/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14861 |
| Comment by Gerrit Updater [ 19/May/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14862 |
| Comment by Andreas Dilger [ 25/May/15 ] |
|
Grégoire, it looks like your patches are regularly causing conf-sanity test_32[abd] to fail or time out. This is not the case with other patches being tested recently, so it looks like there is a regression in your patches 14860 (test fail) and 14861 (test timeout). One failure in 14860 is: LustreError: 14561:0:(class_obd.c:684:cleanup_obdclass()) obd_memory max: 53634235, leaked: 152 shadow-10vm8: shadow-10vm8: Memory leaks detected Could you please investigate. |
| Comment by Gerrit Updater [ 31/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14153/ |
| Comment by Gerrit Updater [ 29/Jun/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14793/ |
| Comment by Gerrit Updater [ 01/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14374/ |
| Comment by Gerrit Updater [ 01/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14860/ |
| Comment by Gregoire Pichon [ 17/Jul/15 ] |
|
Here is version 2 of the test plan document, updated with tests results. |
| Comment by Gerrit Updater [ 09/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14862/ |
| Comment by Gerrit Updater [ 26/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14861/ |
| Comment by Peter Jones [ 27/Aug/15 ] |
|
Landed for 2.8 |
| Comment by Niu Yawei (Inactive) [ 09/Dec/15 ] |
|
The multi-slots implementation introduced a regression, see To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write). To fix this regression, obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list. obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list. Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here: 1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone. 2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details. |