[LU-7426] DNE3: Current llog format for remote update llog Created: 14/Nov/15 Updated: 05/Jul/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Di Wang | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Current llog bitmap header and records structure is not very nice for recording update llog of DNE, especially for remote update log. In DNE, update records will be written by top_trans_start()- But there are still other problems. In llog_osd_write_rec(), it will packed header information (bitmap, lgh_index, offset etc) into the RPC buffer, but the RPC will not be sent until the transaction stop. Then these information might be invalid when the RPC is being sent, especially if the previous llog update There are a few options to fix the problem 1. Once any llog update RPC fails, then all of the following RPC in the sending list fails, and OSP will update the header and the new RPC 2. Add special llog API in OSP->OUT interface, so the OUT can handle these add llog update records independently, instead of only using dumb 3.Using other format for update format. Any thoughts? |
| Comments |
| Comment by Andreas Dilger [ 14/Nov/15 ] |
|
What about having an OUT write that is only one bit in size, and using fmd (from ofd) to order updates to the llog on the OSD? That way, the one bit bitmap update can always be done (not affecting other bitmap updates), and changes to the index and offset would not overwrite newer updates that had already arrived. As long as the index and offset are the highest value, it doesn't matter if an earlier update is dropped because the request was never sent. |
| Comment by Di Wang [ 15/Nov/15 ] |
|
According to llog structure, the headset bit has to match with the llog index, so we do append write, then we can determine the bit in OSP write, i.e. the bit has to be set after appending write on remote OUT. So it is still a special write, i.e. And we need 1. change llog_osd_write_rec to treat local and remote llog object differently, 2 add special stuff in OSP->OUT write. Please correct me if I miss sth. Thanks. |
| Comment by Alex Zhuravlev [ 16/Nov/15 ] |
|
nothing special, please. DNE2 isn't the only user for llog where we observe limitations. instead of introducing special tweaks here and there we should think of a better solution. IMO, this should be a regular index. as for the original issue - the problem isn't in LLOG, but in OSP. consider what ldiskfs would do in this case. I think the model we imply is that if one transaction failed, then none of subsequent transactions should succeed. |
| Comment by Di Wang [ 16/Nov/15 ] |
"nothing special, please. DNE2 isn't the only user for llog where we observe limitations. instead of introducing special tweaks here and there we should think of a better solution. IMO, this should be a regular index." Yes, this bitmap header structure adds a lot constrain to the file format, not very good to update logs. Could you please specify more about this regular index file? the problem isn't in LLOG, but in OSP. consider what ldiskfs would do in this case. I am not so sure about this, even we have OSP lock, not sure how to use it protect the llog. Probably we should use new structure, if you do not want tricks here. I think the model we imply is that if one transaction failed, then none of subsequent transactions should succeed. Yes, that is what http://review.whamcloud.com/#/c/16969/ did. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
there is no need to protect llog. it's OSP who should be taking care of correct order - this is needed regardless the issue in the ticket, I think. the model we've been using so far doesn't support partial execution when from a set of transaction A,B,C,D,E,F we can execute A,B, then fail C, then continue and apply D,E,F - this is not possible with ldiskfs/zfs and I think OSP should be mimic this model too. as for a new format, I'm still wondering about using an index to store records and additional regular file to store big blobs. for simple records like destroy/attr_set it's should be OK to put those as index's values. the major benefits are: |
| Comment by Di Wang [ 17/Nov/15 ] |
there is no need to protect llog. it's OSP who should be taking care of correct order - this is needed regardless the issue in the ticket, I think. the model we've been using so far doesn't support partial execution when from a set of transaction A,B,C,D,E,F we can execute A,B, then fail C, then continue and apply D,E,F - this is not possible with ldiskfs/zfs and I think OSP should be mimic this model too. That is exactly patch 16969 does. as for a new format, I'm still wondering about using an index to store records. Hmm, do you still use bitmap, if not, how do you quickly locate the records? If yes, I do not see much difference here. Please correct me if I miss sth. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
what do you mean by "locate the records" - given it's an index you can: list it or get a value for a given key. |
| Comment by Di Wang [ 17/Nov/15 ] |
|
No, sorry I actually do not mean "locate the records" specially. I am concerned how do you build the adding/canceling by the normal index + read/write API, i.e. how to make index insert + append record atomic? Actually our current bitmap structure is also simple index file, but we just combine them with the blob in the same file. The reason I do not like it is that it has strong consistency requirement between header and records. I do not see how this proposal eliminate this. I probably miss sth, please correct me. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
ok, forget about blobs for a moment. it's an index managed totally by osd/behind. so, to add a record you call dt_insert(), to cancel a record you call dt_delete(), then you have the api to iterate items. is there any conflicts between insert/delete - no. and locality doesn't matter. the problem with the current implementation is that it's managed by our code in a very non-concurrent way. |
| Comment by Di Wang [ 17/Nov/15 ] |
|
Hmm, I am not sure what you mean "non-concurrent" way. For me, the problem is that the current llog format requires strong consistency (bitmap vs record), and dependency (the current record vs the next record), which make things even more complicate over the wire, i.e. we need something more independent for adding/deleting the record. That is why I am curious about your index file proposal. Sorry I still do not understand how it can remove the dependency. For example when adding one record, how do you make sure insert index is consistent with the blob, for consistency, I mean atomic and the key in the index file is matched with blob location. Note: I mean building this with normal index insert and write API, no special stuff. Otherwise I do not see how it is better than Andreas proposal. Please correct me if I miss sth. Thanks |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
how it can be inconsistent if insert+write in a same transaction? the benefit of an index is that all highly-concurrent operations (tree modification) are done locally and logically there is zero concurrency. one of the problem of current llog is that we can't modify a single bit (sorry, introducing a bit-wise operations don't sound exciting to me). with an index this isn't an issue at all. another problem - gaps. current llog can't work properly with gaps. again, with index + blob in a different file this isn't a problem, as a record itself may contain an offset to blob. of course, I mean a regular indices we already have - nothing special. probably binary keys (already supported by the api and ldiskfs/zfs) may be useful, need to check all the use cases. |
| Comment by Di Wang [ 17/Nov/15 ] |
|
Oh, I do not mean transaction consistency. I mean the consistency between index and blob. I assume you need write first, get the offset (or similar thing), then insert the offset to the index file. how do you do it with current OSP API? Another idea might be putting llog format below OSD (like IAM), then setbit actually can be an index insert API over OSP. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial? personally, I'm not a big fan of introducing some special things like setbit.. I think we have everything (or almost) in place to implement that. |
| Comment by Di Wang [ 17/Nov/15 ] |
Hmm, I do not think maintain this virtual counter over the wire is trivial, you still need handle the failure of previous transaction, i.e. this does eliminate the dependency between records. Maybe adding "special" stuff in write RPC or adding extra special smart RPC (append + insert) is not a bad idea. Another idea might be use the normal lock protect the update RPC in the whole time, so even the content of object is already inside the RPC, which can still be covered by lock. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
no need to handle failure as this offset is unique and we don't care about gaps. nothing special. and no extra locks. just unique offset and unique key. |
| Comment by Di Wang [ 17/Nov/15 ] |
|
oh, yes, unique offset. we only need the record size in this case. Sorry being dumb. |
| Comment by Alex Zhuravlev [ 17/Nov/15 ] |
|
no, thank you a lot for spending time on this! |