[LU-5319] Support multiple slots per client in last_rcvd file Created: 10/Jul/14  Updated: 19/Oct/22  Resolved: 27/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: New Feature Priority: Minor
Reporter: Gregoire Pichon Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: p4b, patch, performance, recovery

Attachments: PDF File MDTReplyReconstructionImprovement.architecture.pdf     PDF File MDTReplyReconstructionImprovement.design.pdf     PDF File MDTReplyReconstructionImprovement.testplan.pdf     PNG File mdtest lustre 2.5.60 file creation b.png     PNG File mdtest lustre 2.5.60 file removal b.png    
Issue Links:
Duplicate
Related
is related to LU-6386 lower transno may overwrite the bigge... Resolved
is related to LU-7185 restore flags on ptlrpc_connect_impor... Resolved
is related to LU-7410 After downgrade from 2.8 to 2.5.5, hi... Resolved
is related to LU-6864 DNE3: Support multiple modify RPCs in... Resolved
is related to LUDOC-304 Updates related to support of multipl... Resolved
is related to LU-6840 update memory reply data in DNE updat... Resolved
is related to LU-5951 sanity test_39k: mtime is lost on close Resolved
is related to LU-6981 obd_last_committed is not updated in ... Resolved
is related to LU-7729 Don't return ptlrpc_error() in proces... Resolved
is related to LU-7028 racer:kernel:BUG: spinlock bad magic ... Resolved
is related to LU-7082 conf-sanity test_90b: MDT start failed Resolved
is related to LU-7408 multislot RPC support didn't declare ... Resolved
is related to LU-6841 replay-single test_30: multiop 20786 ... Closed
is related to LU-3285 Data on MDT Resolved
is related to LU-14144 get and set Lustre module parameters ... Open
is related to LU-933 allow disabling the mdc_rpc_lock for ... Resolved
is related to LU-6753 Fix several minor improvements to mul... Resolved
Rank (Obsolete): 14856

 Description   

While running mdtest benchmark, I have observed that file creation and unlink operations from a single Lustre client quickly saturates to around 8000 iops: maximum is reached as soon as with 4 tasks in parallel.
When using several Lustre mount points on a single client node, the file creation and unlink rate do scale with the number of tasks, up to the 16 cores of my client node.

Looking at the code, it appears that most metadata operations are serialized by a mutex in the MDC layer.
In mdc_reint() routine, request posting is protected by mdc_get_rpc_lock() and mdc_put_rpc_lock(), where the lock is :
struct client_obd -> struct mdc_rpc_lock *cl_rpc_lock -> struct mutex rpcl_mutex.

After an email discussion with Andreas Dilger, it appears that the limitation is actually on the MDS, since it cannot handle more than a single filesystem-modifying RPC at one time. There is only one slot in the MDT last_rcvd file for each client to save the state for the reply in case it is lost.

The aim of this ticket is to implement multiple slots per client in the last_rcvd file so that several filesystem-modifying RPCs can be handled in parallel.

The single client metadata performance should be significantly improved while still ensuring a safe recovery mecanism.



 Comments   
Comment by Peter Jones [ 10/Jul/14 ]

thanks Gregoire!

Comment by Peter Jones [ 10/Jul/14 ]

Alex

I understand that you have done some initial experimentation in this area

Peter

Comment by Peter Jones [ 17/Jul/14 ]

Alex

Is this the prototype for this work? http://review.whamcloud.com/#/c/9871/

Peter

Comment by Alex Zhuravlev [ 17/Jul/14 ]

sorry, missed the first comment.. yes, that's it.

Comment by Robert Read (Inactive) [ 28/Jul/14 ]

Alex, it might be a good idea to land the client part and perhaps even the protocol changes (with backward compatibility) from that patch sooner rather than later. (Of course with maxslots = 1.) This will reduce the size of the patch, and also we'll have a larger base of potentially compatible clients once we have servers that do support multislot > 1.

Comment by Andreas Dilger [ 29/Jul/14 ]

Robert, have you had any chance to test this patch out with workloads that might benefit?

Comment by Robert Read (Inactive) [ 29/Jul/14 ]

No, I haven't.

Comment by Gregoire Pichon [ 30/Jul/14 ]

I have run the mdtest benchmark on a single client in the following configuration:

  • lustre 2.5.60
  • lustre 2.5.60 with fail_loc=0x804 (it bypasses the mdc request serialization)
  • lustre 2.5.60 + patch #9871 from Alexey

Results are given in attachment:
mdtest lustre 2.5.60 file creation.png
mdtest lustre 2.5.60 file removal.png

Comment by Michael MacDonald (Inactive) [ 30/Jul/14 ]

Gregoire,

That's a pretty compelling result. Do you have a feel for how the patch performance compares to running with multiple mountpoints per client node? I would assume that the multi-mount performance would be similar (or even a bit worse), but I'm curious as to whether you have any data for that case.

Comment by Gregoire Pichon [ 31/Jul/14 ]

Here are the results with one additional configuration: multiple mount points of the same fs on the single client node. It's slightly better for file creation and quite similar for file removal.

see mdtest lustre file creation b.png
and mdtest lustre file removal b.png

Comment by Robert Read (Inactive) [ 31/Jul/14 ]

pichong, thanks for running that again, this is helpful. How many mounts did you use, by the way?

bzzz, can we make maxslots tunable or even better have the client request a desired value? For some workloads at least I think it should even be a multiple of cores on the client.

Comment by Alex Zhuravlev [ 01/Aug/14 ]

Robert, it was a proto. I think if we want to productize the patch, then it makes sense to collect requirements like maxslots, etc.

Comment by Gregoire Pichon [ 01/Aug/14 ]

For multi-mount case there was one mount point per task (24 mount points at max).

I am currently writing a solution architecture document, so we all agree on improvement requirements.

Alexey, I have a question on your prototype...
Why the reply data corresponding to export's last transno is never released ? The code comment says otherwise we risk to lose last_committed, but I don't see what ted_last_reply value is used for.

Comment by Alex Zhuravlev [ 03/Aug/14 ]

> Alexey, I have a question on your prototype... Why the reply data corresponding to export's last transno is never released ? The code comment says otherwise we risk to lose last_committed, but I don't see what ted_last_reply value is used for.

it's used to retain corresponding slot in last_rcvd. see tgt_free_reply_data().

Comment by Gregoire Pichon [ 04/Aug/14 ]

Sorry, it's still not clear...
I see ted_last_reply field updated to retain the reply data corresponding to last transno, but it is not used anywhere else.
And when client area of last_rcvd file is written, the last transno is written from ted_transno.

Comment by Alex Zhuravlev [ 04/Aug/14 ]

if (ted->ted_last_reply != NULL) {
tgt_free_reply_data(lut, ted, ted->ted_last_reply);
ted->ted_last_reply = NULL;

then in tgt_free_reply_data():
tgt_bitmap_clear(lut, trd->trd_index);

since then corresponding slot in the new last_rcvd file can be re-used.

Comment by Gregoire Pichon [ 05/Aug/14 ]

I have attached a solution architecture document (MDTReplyReconstructionImprovement.architecture.pdf).

It's a proposal that describes the functional improvements to support multiple filesystem-modifying MDT requests per client and MDT reply reconstruction in that context.

It would be great to have some feedbacks from anyone interested by this feature.

Comment by Gregoire Pichon [ 01/Sep/14 ]

I would like to start working on a high level design document for this feature.
Please send your feedbacks on the architecture document by the end of the week (2014/09/07), so I can take them into account.
thanks.

Comment by Alex Zhuravlev [ 04/Sep/14 ]

Gregoire, the document looks good.

the close case is specific because of openlock cache - a RPC being handled by MDT (so mdc semaphore/the slot are busy) may ask for openhandle to release (lock cancellation leading to close RPC) which requires another slot.

Comment by Andreas Dilger [ 17/Sep/14 ]

Alex, what is left to be done with your patch before it can move from prototype to production?

Comment by Andreas Dilger [ 18/Sep/14 ]

Gregoire,
some comments on the design doc.

The mount.lustre command will support a new option max_modify_md_rpcs_in_flight=num that specifies the desired maximum number of modify metadata RPCs in flight for that client. This will allow clients with different level of metadata activity, or different number of cores, to be able to send more or less modify metadata RPCs in parallel.

Lustre typically does not set tunables via mount options, since that would be a large number of different possible options. This would typically handled by a read-write tunable lctl set_param mdc..max_modify_rpcs_in_flight=num to match the normal mdc..max_rpcs_in_flight=num tunable. It may be that there is no reason to limit this beyond the existing max_rpcs_in_flight=8 default value. Since OSTs are fine to have 8 writes in flight at one time, there is no reason to expect MDTs to have a problem with this either.

Typically, the MDS will return the maximum number for max_rpcs_in_flight in obd_connect_data, and the client is free to chose any number less than or equal to this. Default is what server replies in ocd, or if unsupported us 1 for MDC and 8 for OSC. Setting max_rpcs_in_flight to a larger number via /proc should return an error. It probably makes sense to handle this OBD_CONNECT flag on the OSS as well, and both MDS and OSS services can tune this default parameter permanently in a single place (with lctl conf_param or lctl set_param [-P].

The existing code handles the close requests (MDS_CLOSE and MDS_DONE_WRITING) differently from other metadata requests. They benefit from a separate serialization allowing a close request to be sent in parallel of other requests, probably to avoid some deadlock situations (what is the exact use case, Jira or Bugzilla id?).

This is https://bugzilla.lustre.org/show_bug.cgi?id=3462 (really b=3166, but that one is private). The original problem was related to sending the close while there was another RPC in flight (keeping the main slot in the last_rcvd file was busy). Without this extra RPC in flight (and an extra slot to hold the reply reconstruction), the close could deadlock

The document looks good otherwise.

Comment by Gregoire Pichon [ 18/Sep/14 ]

thanks for your comments.
I understand the use of a mdc configuration parameter rather than the mount option.

There is however an issue with using the max_rpcs_in_flight to control the maximum modify metadata rpcs in parallel since the limit will also apply to the non-modify metadata rpcs (getattr, readdir, ...). Even if MDT has a single slot, mdc must be able to send up to max_rpcs_in_flight getattr requests in parallel. This is the current behavior.

Therefore, I think it's necessary to introduce a specific parameter for the max number of rpcs that depend on slot allocation on the MDT, max_md_rpcs_in_flight for instance. It's value could not be higher than max_rpcs_in_flight.

Comment by Gregoire Pichon [ 18/Sep/14 ]

Alexey,
Could you explain what was the motivation for using a tag for metadata requests ?
What was the inconvenient with releasing the reply data structure and slot in reply bitmap only when reply ack is received ?

thanks.

Comment by Alex Zhuravlev [ 18/Sep/14 ]

we do not resend REP-ACKs, they can be lost. meaning the corresponding slot can't be re-used as long as the client is alive. I consider few schemas.. e.g. ping can bring some information, but the schema with tags seem the most trivial one to me. notice it's perfect - we don't use REP-ACKs with OST. so if 8 slots were used, then the client use 2 slots at most for long, then the remaining 6 slots are "lost" again, but not till the client's eviction at least. probably ping can be used to tell the target the highest replied XID - given ping is used on the idling connection, this should be OK?

Comment by Gregoire Pichon [ 26/Sep/14 ]

Here is a new version of the architecture document.
It has been updated with comments posted by Alexey Zhuravlev and Andreas Dilger.

Comment by Gregoire Pichon [ 26/Nov/14 ]

Here is the design document of the MDT reply reconstruction improvement (MDTReplyReconstructionImprovement.design.pdf).
Please review it and post comments in the JIRA ticket.
Thanks.

Comment by Gerrit Updater [ 08/Jan/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13297
Subject: LU-5319 mdt: pass __u64 for storing opdata
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ab36efb03e47adc3a48d93bfad3b77d23b67cba

Comment by Andreas Dilger [ 08/Jan/15 ]

Gregoire, thank you for the excellent design document. I was reading through this and had a few comments:

  • is there even a need or benefit to the server tightly restricting the number of modifying RPCs in flight at connect time? It seems to me that it makes sense to limit this at the client by default (like max_rpcs_in_flight), but there is no technical reason why some clients (e.g. login nodes) couldn't have, say, 128 modifying RPCs in flight. What functional purpose does ocd_max_mod_rpcs serve?

    It might be the opportunity to record the per-operation data as a 64-bit field: lcd_last_data and lcd_last_close_data are 32-bit fields whereas intent disposition is 64bits. The lrd_data field could be defined as 64 bits.

  • This seems possible, though it would add some complexity to the code if we ever started to use this as a 64-bit field. I notice that mdt_get_disposition() and friends are all silently converting the passed value to a signed int, so I pushed patch http://review.whamcloud.com/13297 to change that code to pass the __u64 through unchanged.
  • struct reply_data should be padded to a power-of-two size so that it fits into disk blocks evenly, and named something like tgt_reply_data or lustre_reply_data. struct reply_header should be the same size as struct reply_data so that the structures are all aligned, and also renamed in a similar manner.
  • for the Reply Data Release section, you wrote that solution #3 (received XID) is the preferred one, where the sends "the highest consecutive XID" to the server in each RPC. Does this depend on the XID values actually being consecutive values? That would be a problem, since the XIDs are global on a client and shared by all servers that the client is sending RPCs to, so they will not be consecutive to a single MDT or OST. I think what you mean is the client will send to each server the highest RPC XID for which a reply has been received and does not have an unreplied lower-numbered RPC XID.
Comment by Gregoire Pichon [ 12/Jan/15 ]

Andreas, thanks you for the design review.

  • restricting the number of modifying RPCs in flight at connect time
    I recognize there is no strong need to have that restriction. It was mentionned in the architecture document and seemed to be accepted by everyone.
    More important, it seems to me a way to prevent a client to flood the MDT with its requests, preventing other clients to use the service at the expected performance level. But it's true there is no real checking on server side at the moment.
    The MDT provides a service, so it seems natural that it controls and limits its usage, and ocd_maxmodrpcs field is how it informs clients of the limit value.
  • 64-bit lrd_data field
    After adding the patch you have posted, I will try to make lrd_data field 64-bit.
  • on-disk structures
    The structures used in the reply_data file are actually named lsd_reply_header and lsd_reply_data, similarly to lsd_client_data.
    I agree about the power-of-two size rounding and the same size for both.
  • received XID
    Actually the term "the highest consecutive XID received by a client" is not exact. I had seen the XIDs are global on a client and shared among the client obd devices. But the only important point is that the XIDs are always increasing. The server manage this information for each client separately and releases the reply data having a lower xid for that client only.
    With "highest consecutive" I wanted to say there is no lower XID for which reply was not received for that client obd device. I agree with your rephrase "the highest RPC XID for which a reply has been received and does not have an unreplied lower-numbered RPC XID".
Comment by Alex Zhuravlev [ 12/Jan/15 ]

I think we shouldn't limit this to MDT. OST does need this functionality as well.

Comment by Gregoire Pichon [ 13/Jan/15 ]

I agree the feature could also apply to OST, and I have tried to take it into account in the design.
But, the initial goal was to improve the Metadata client performance (see the ticket initial description), and I will keep that scope since this is already a significant work for me.
This does not prevent someone else to extend later the architecture solution and design to provide this functionality to OST.

Comment by Alex Zhuravlev [ 13/Jan/15 ]

given the code is shared on the server, it doesn't make sense to put the client's implementation to MDC and limit the scope, IMHO. there should be no changes from the architecture/design point of view. the patch I mentioned before addresses this already.

Comment by Alex Zhuravlev [ 16/Feb/15 ]

from the document it's not clear how that highest consecutive XID is maintained. RPCs aren't sent in XID order. at least with the current code it's possible to have XID=5 assigned, then have few RPCs with XID=6,7,8,9,10 sent and replied and then RPC with XID=5 is sent. if RPC(XID=10) was sent with that "mark"=9 and it races with RPC(XID=5), then we can lose the slot for XID=5 ?

Comment by Gerrit Updater [ 25/Feb/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13862
Subject: LU-5319 ptlrpc: atomically assign XID and put on the sending list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fb8b7fb24e83f90bbb2fa11ff9aaebcecd257e4e

Comment by Gerrit Updater [ 26/Feb/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13895
Subject: LU-5319 protocol: support for multislot feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94b4151ec488ed6af9a4d996b5753c767a980ef3

Comment by Gerrit Updater [ 26/Feb/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13900
Subject: LU-5319 mdc: new wrappers to support multislot feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bf51ad2daead1017f34bc0c28f0462ee8df9c4e

Comment by Gerrit Updater [ 02/Mar/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13931
Subject: LU-5319 protocol: define reserved OBD_CONNECT bits
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: a3c540c609f8a092eb0a1c63806e71707745f3d3

Comment by Gerrit Updater [ 02/Mar/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13932
Subject: LU-5319 protocol: define OBD_CONNECT_MULTISLOT bit
Project: fs/lustre-release
Branch: b2_7
Current Patch Set: 1
Commit: 8635c06ebb3f7543d7e83b12ce6213940b1cf115

Comment by Gregoire Pichon [ 03/Mar/15 ]

Here is a new version of the design document.
It has been updated with comments posted by Andreas Dilger.

Comment by Gregoire Pichon [ 03/Mar/15 ]

Alex,

I don't see the objectives of the several patches you recently posted to gerrit, as the content of your patches does not implement what has been written and reviewed in the design document. For example, the more appropriate solution to release reply data on server side has been identified in the design as the solution based on XID, not the one based on tag reuse.

The patches http://review.whamcloud.com/13895 and http://review.whamcloud.com/13900 implement the tag based solution, don't they ?

Comment by Alex Zhuravlev [ 03/Mar/15 ]

I don't think we should rely on XID only as processing time and order of RPCs can vary significantly. Imagine there are 32 RPCs in flight (say, with XIDs 1-32). for a reason, XID=32 took a long, so we won't be able to re-use potentially free slots 1-31. IMO, this isn't very nice because reply_log file grows for no reason. I think XID-based approach is needed to address a different case when RPCs-in-flight decrease and some tags aren't used for a while.

Comment by Gregoire Pichon [ 04/Mar/15 ]

Alex,
This is the kind of information you should have shared at the time of design review !

If we want to coordinate our efforts, as we both agree by email last week, I think we need to fully agree on the design. So please review the design document attached to the ticket (current version is 0.3) and give back your comments before continuing working on patches.

If you prefer, I am also open to setup a call to have live discussions on the design.
Regards,
Grégoire.

Comment by Alex Zhuravlev [ 04/Mar/15 ]

sure.. from the design I read last time (few weeks ago) I got to think we'll be using both methods (tag + XID). also, I observed that XID approach has the issue I described on Feb 16. no one replied

Comment by Gregoire Pichon [ 04/Mar/15 ]

I agree with your comment posted in Feb 16.
Hopefully, your gerrit patch #13862 "atomically assign XID and put on the sending list" addresses the issue.

Comment by Gerrit Updater [ 04/Mar/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/13960
Subject: LU-5319 ptlrpc: Add OBD_CONNECT_MULTIMODRPCS flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 115af1c4a7568fc84aaa6225302bc0970ea61aa2

Comment by Alex Zhuravlev [ 04/Mar/15 ]

so, we haven't agreed on the design yet, but the patches are still coming

Comment by Gregoire Pichon [ 09/Mar/15 ]

Here is a new version of the design document: 0.4.
It has been updated to take into account Alex's comments.

Comment by Alex Zhuravlev [ 13/Mar/15 ]

the document looks solid, few comments:

> This will ensure the export's last transno can be rebuilt from disk in case of recovery.

better to say "highest committed transno"

> When the MDT will receive an RPC with the MSG_RESENT flag

iirc, we have to check for MSG_REPLAY as well

> This is because, without the client received xid information, the server would not be able to release the reply data in memory

probably we should stop to support this after some future version.

> 3. received xid
probably you should mention that currently imp_send_list isn't strictly ordered and XIDs "from the past" can join the list.
also, looking at imp_send_list isn't enough - there is imp_delayed_list.

as for downgrade, I'd think we shouldn't allow this if some of reply_data is valid - otherwise recovery can break.
I'd suggest to introduce an incompatibility flag preventing old code to mount multislot-enabled fs. to enable downgrade
a simple schema can be used. for example, once all the clients are disconnected, MDT can reset that incompatibility flag
on a disk.

probably it makes sense to describe shortly what happens when a client disconnects: we have to release all the slots.

performance improvements: some optimizations for bitmap scanning?

testing should be mentioned: how do we verify execute-once semantics is retained.

Comment by Gregoire Pichon [ 17/Mar/15 ]

> better to say "highest committed transno"
ok

> we have to check for MSG_REPLAY as well
ok

> probably we should stop to support this after some future version.
I agree, but I don't think there is much to do at the moment.

> probably you should mention that currently imp_send_list isn't strictly ordered and XIDs "from the past" can join the list. also, looking at imp_send_list isn't enough - there is imp_delayed_list.
ok

> as for downgrade, I'd think we shouldn't allow this if some of reply_data is valid
I agree

> probably it makes sense to describe shortly what happens when a client disconnects: we have to release all the slots.
Are you talking about old on-disk reply data of the client that disconnects ?
That's a good idea. This would allow "reply data" file to be completely cleared when all clients have disconnected and it could be then truncated to avoid unnecessary parsing at next MDT start. It would also avoid the use of the slot generation number and reply data generation number lrd_generation.
However, I am wondering if parsing the entire "reply data" file when client disconnects is acceptable in term of processing time.

> testing should be mentioned: how do we verify execute-once semantics is retained.
What do you mean by "execute-once" semantics ? Is it the fact that any client operation is treated only once by the server, even in case of RPC resend, target recovery, etc... ?

Comment by Gerrit Updater [ 17/Mar/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14095
Subject: LU-5319 ptlrpc: Add a tag field to ptlrpc messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4456dfe1242602fbead299c64041cb6bdbfe2408

Comment by Alex Zhuravlev [ 18/Mar/15 ]

> Are you talking about old on-disk reply data of the client that disconnects ? That's a good idea. This would allow "reply data" file to be completely cleared when all clients have disconnected and it could be then truncated to avoid unnecessary parsing at next MDT start. It would also avoid the use of the slot generation number and reply data generation number lrd_generation. However, I am wondering if parsing the entire "reply data" file when client disconnects is acceptable in term of processing time.

the primary goal is to support backward compatibility (even in some restricted form). for simplicity the schema can be as the following: when all the clients are disconnected (it's easy and cheap to detect), then the whole content of reply_log can be discarded.

> testing should be mentioned: how do we verify execute-once semantics is retained.
What do you mean by "execute-once" semantics ? Is it the fact that any client operation is treated only once by the server, even in case of RPC resend, target recovery, etc... ?

correct. we have to examine if RPCs A and B were sent concurrently (i.e. using different tags), then the server reconstructs the corresponding replies.

Comment by Gerrit Updater [ 24/Mar/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14153
Subject: LU-5319 mdc: Add max modify RPCs in flight variable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bad683c33942fbc47cb9ae6c22187f95669ab3a4

Comment by Gerrit Updater [ 27/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13297/
Subject: LU-5319 mdt: pass __u64 for storing opdata
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6cad330ded9c6ff21b35229139b7f60fbbfe80c6

Comment by Gerrit Updater [ 31/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13960/
Subject: LU-5319 ptlrpc: Add OBD_CONNECT_MULTIMODRPCS flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 807a40c5678109087e2c8759bd86253034804c8d

Comment by Gregoire Pichon [ 01/Apr/15 ]

Alex,
I have seen the patch for LU-6376 "Add RPC lock for OSP update RPC" has been landed last week.
What is the role of OSP component ?
What kind of metadata operations does it perform ?

Comment by Alex Zhuravlev [ 01/Apr/15 ]

OSP represents remote OSD. say, there are MDT and OST nodes. MDT can talk to OST using regular OSD API and OSP is a proxy transferring the API's methods over a network, including modifying methods. at the moment it has to serialize modifications due to the single-slot last_rcvd and this is a big limitation given OSP is used for distributed metadata and other features. this is why I asked for a generic enough implementation which can be easily used by another components, including OSP.

Comment by Gregoire Pichon [ 02/Apr/15 ]

Alex,
Could you give simple usecases/examples where OSP is used to transfer modifying methods ?
This will help me write testcases.
Thanks.

Comment by Alex Zhuravlev [ 02/Apr/15 ]

there are two major features using that - LFSCK (check sanity-lfsck.sh) and DNE (in many scripts, look for "lfs mkdir").

Comment by Gerrit Updater [ 06/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14095/
Subject: LU-5319 ptlrpc: Add a tag field to ptlrpc messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3998a8e474b58a5bb4bc47b620adc836c27ab70d

Comment by Gerrit Updater [ 07/Apr/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14374
Subject: LU-5319 mdc: manage number of modify RPCs in flight
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fc37353393a36f10420c6e6396dab1a9b5f42acb

Comment by Gerrit Updater [ 07/Apr/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14375
Subject: LU-5319 osp: manage number of modify RPCs in flight
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 29a0d43597a328c10d6167c734876736637db187

Comment by Gerrit Updater [ 07/Apr/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14376
Subject: LU-5319 mdc: remove deprecated metadata RPC serialization
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 38e9d11ef4e4a36a80f7bb66a203e1c1c3d4f537

Comment by Gregoire Pichon [ 09/Apr/15 ]

Here is version 0.5 of the design document.
It has been updated to address Alexey's last comments.

Comment by Gerrit Updater [ 01/May/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14655
Subject: LU-5319 client: support multiple modify RPCs in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5b122238c6f8f0d58ac74f2466e887098de01c2b

Comment by Gregoire Pichon [ 13/May/15 ]

Here is version 1 of the test plan document.
Please, feel free to review and give some feedbacks.

Comment by Gerrit Updater [ 13/May/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14793
Subject: LU-5319 ptlrpc: embed highest XID in each request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5f2bf2d0b79825bd2214809fee2c679a1f010fd9

Comment by Gerrit Updater [ 19/May/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14860
Subject: LU-5319 mdt: support multiple modify RCPs in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1d9870b493954d53d8cbb1ff994a99817df0b2d7

Comment by Gerrit Updater [ 19/May/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14861
Subject: LU-5319 tests: testcases for multiple modify RPCs feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1fcb722e9b0b22a0cd5ae4640a1abf146efd1877

Comment by Gerrit Updater [ 19/May/15 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/14862
Subject: LU-5319 utils: update lr_reader to display additional data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0738702663851f3d01f45b1725df2c623b507bb9

Comment by Andreas Dilger [ 25/May/15 ]

Grégoire, it looks like your patches are regularly causing conf-sanity test_32[abd] to fail or time out. This is not the case with other patches being tested recently, so it looks like there is a regression in your patches 14860 (test fail) and 14861 (test timeout).

One failure in 14860 is:

LustreError: 14561:0:(class_obd.c:684:cleanup_obdclass()) obd_memory max: 53634235, leaked: 152
shadow-10vm8: 
shadow-10vm8: Memory leaks detected

Could you please investigate.

Comment by Gerrit Updater [ 31/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14153/
Subject: LU-5319 mdc: add max modify RPCs in flight variable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 60c05ea9f66f9bd3f5fd35942a12edb1e311c455

Comment by Gerrit Updater [ 29/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14793/
Subject: LU-5319 ptlrpc: embed highest XID in each request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8

Comment by Gerrit Updater [ 01/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14374/
Subject: LU-5319 mdc: manage number of modify RPCs in flight
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1fc013f90175d1e50d7a22b404ad6abd31a43e38

Comment by Gerrit Updater [ 01/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14860/
Subject: LU-5319 mdt: support multiple modify RCPs in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5fc7aa3687daca5c14b0e479c58146e0987daf7f

Comment by Gregoire Pichon [ 17/Jul/15 ]

Here is version 2 of the test plan document, updated with tests results.

Comment by Gerrit Updater [ 09/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14862/
Subject: LU-5319 utils: update lr_reader to display additional data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6460ae59bf6e5175797dc66ecbe560eebc8b6333

Comment by Gerrit Updater [ 26/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14861/
Subject: LU-5319 tests: testcases for multiple modify RPCs feature
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c2d27a0f12688c0d029880919f8b002e557b540c

Comment by Peter Jones [ 27/Aug/15 ]

Landed for 2.8

Comment by Niu Yawei (Inactive) [ 09/Dec/15 ]

The multi-slots implementation introduced a regression, see LU-5951.

To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write).

To fix this regression, LU-5951 moved the xid assignment back to request packing stage, and introduced an unreplied list to track all the unreplied requests. Following is a brief description of the LU-5951 patch:

obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list.

obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list.

Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here:

1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone.

2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details.

Generated at Sat Feb 10 01:50:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.