Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7426

DNE3: improve llog format for remote update llog

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Current llog bitmap header and records structure is not very nice for recording update llog of DNE, especially for remote update log.

      In DNE, update records will be written by top_trans_start()->sub_updates_write()->llog_osd_write_rec()->osp_md_write(), where the write buffer (header bitmap + record) will be packed into the RPC buffer, then it will be sent during transaction stop.
      To avoid bitmap being over-written, these RPC needs be serialized and also sent with certain order.

      But there are still other problems. In llog_osd_write_rec(), it will packed header information (bitmap, lgh_index, offset etc) into the RPC buffer, but the RPC will not be sent until the transaction stop. Then these information might be invalid when the RPC is being sent, especially if the previous llog update RPC fails. And these RPCs are executed on the remote MDT, it will definitly make update llog corrupted.

      There are a few options to fix the problem

      1. Once any llog update RPC fails, then all of the following RPC in the sending list fails, and OSP will update the header and the new RPC will be packed with updated header. This is 2.8 approach in patch http://review.whamcloud.com/16969 "LU-7039 llog: update llog header and size", and it is easy to implement but not very nice.

      2. Add special llog API in OSP->OUT interface, so the OUT can handle these add llog update records independently, instead of only using dumb write API driven by llog_osd_write_rec(). For example we can add such llog updates inside the RPC,
      1. append the record.
      2. get the llog index.
      3. update record with the index.
      4. set the bitmap by index.

      3.Using other format for update format.

      Any thoughts?

      Attachments

        Issue Links

          Activity

            [LU-7426] DNE3: improve llog format for remote update llog
            di.wang Di Wang added a comment -

            oh, yes, unique offset. we only need the record size in this case. Sorry being dumb.

            di.wang Di Wang added a comment - oh, yes, unique offset. we only need the record size in this case. Sorry being dumb.

            no need to handle failure as this offset is unique and we don't care about gaps. nothing special. and no extra locks. just unique offset and unique key.

            bzzz Alex Zhuravlev added a comment - no need to handle failure as this offset is unique and we don't care about gaps. nothing special. and no extra locks. just unique offset and unique key.
            di.wang Di Wang added a comment - - edited

            yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial?

            Hmm, I do not think maintain this virtual counter over the wire is trivial, you still need handle the failure of previous transaction, i.e. this does eliminate the dependency between records. Maybe adding "special" stuff in write RPC or adding extra special smart RPC (append + insert) is not a bad idea.

            Another idea might be use the normal lock protect the update RPC in the whole time, so even the content of object is already inside the RPC, which can still be covered by lock.

            di.wang Di Wang added a comment - - edited yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial? Hmm, I do not think maintain this virtual counter over the wire is trivial, you still need handle the failure of previous transaction, i.e. this does eliminate the dependency between records. Maybe adding "special" stuff in write RPC or adding extra special smart RPC (append + insert) is not a bad idea. Another idea might be use the normal lock protect the update RPC in the whole time, so even the content of object is already inside the RPC, which can still be covered by lock.

            yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial?

            personally, I'm not a big fan of introducing some special things like setbit.. I think we have everything (or almost) in place to implement that.

            bzzz Alex Zhuravlev added a comment - yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial? personally, I'm not a big fan of introducing some special things like setbit.. I think we have everything (or almost) in place to implement that.
            di.wang Di Wang added a comment -

            Oh, I do not mean transaction consistency. I mean the consistency between index and blob. I assume you need write first, get the offset (or similar thing), then insert the offset to the index file. how do you do it with current OSP API?

            Another idea might be putting llog format below OSD (like IAM), then setbit actually can be an index insert API over OSP.

            di.wang Di Wang added a comment - Oh, I do not mean transaction consistency. I mean the consistency between index and blob. I assume you need write first, get the offset (or similar thing), then insert the offset to the index file. how do you do it with current OSP API? Another idea might be putting llog format below OSD (like IAM), then setbit actually can be an index insert API over OSP.
            bzzz Alex Zhuravlev added a comment - - edited

            how it can be inconsistent if insert+write in a same transaction? the benefit of an index is that all highly-concurrent operations (tree modification) are done locally and logically there is zero concurrency.

            one of the problem of current llog is that we can't modify a single bit (sorry, introducing a bit-wise operations don't sound exciting to me). with an index this isn't an issue at all.

            another problem - gaps. current llog can't work properly with gaps. again, with index + blob in a different file this isn't a problem, as a record itself may contain an offset to blob.

            of course, I mean a regular indices we already have - nothing special. probably binary keys (already supported by the api and ldiskfs/zfs) may be useful, need to check all the use cases.

            bzzz Alex Zhuravlev added a comment - - edited how it can be inconsistent if insert+write in a same transaction? the benefit of an index is that all highly-concurrent operations (tree modification) are done locally and logically there is zero concurrency. one of the problem of current llog is that we can't modify a single bit (sorry, introducing a bit-wise operations don't sound exciting to me). with an index this isn't an issue at all. another problem - gaps. current llog can't work properly with gaps. again, with index + blob in a different file this isn't a problem, as a record itself may contain an offset to blob. of course, I mean a regular indices we already have - nothing special. probably binary keys (already supported by the api and ldiskfs/zfs) may be useful, need to check all the use cases.
            di.wang Di Wang added a comment - - edited

            Hmm, I am not sure what you mean "non-concurrent" way. For me, the problem is that the current llog format requires strong consistency (bitmap vs record), and dependency (the current record vs the next record), which make things even more complicate over the wire, i.e. we need something more independent for adding/deleting the record. That is why I am curious about your index file proposal. Sorry I still do not understand how it can remove the dependency.

            For example when adding one record, how do you make sure insert index is consistent with the blob, for consistency, I mean atomic and the key in the index file is matched with blob location. Note: I mean building this with normal index insert and write API, no special stuff. Otherwise I do not see how it is better than Andreas proposal. Please correct me if I miss sth. Thanks

            di.wang Di Wang added a comment - - edited Hmm, I am not sure what you mean "non-concurrent" way. For me, the problem is that the current llog format requires strong consistency (bitmap vs record), and dependency (the current record vs the next record), which make things even more complicate over the wire, i.e. we need something more independent for adding/deleting the record. That is why I am curious about your index file proposal. Sorry I still do not understand how it can remove the dependency. For example when adding one record, how do you make sure insert index is consistent with the blob, for consistency, I mean atomic and the key in the index file is matched with blob location. Note: I mean building this with normal index insert and write API, no special stuff. Otherwise I do not see how it is better than Andreas proposal. Please correct me if I miss sth. Thanks

            ok, forget about blobs for a moment. it's an index managed totally by osd/behind. so, to add a record you call dt_insert(), to cancel a record you call dt_delete(), then you have the api to iterate items. is there any conflicts between insert/delete - no. and locality doesn't matter.

            the problem with the current implementation is that it's managed by our code in a very non-concurrent way.

            bzzz Alex Zhuravlev added a comment - ok, forget about blobs for a moment. it's an index managed totally by osd/behind. so, to add a record you call dt_insert(), to cancel a record you call dt_delete(), then you have the api to iterate items. is there any conflicts between insert/delete - no. and locality doesn't matter. the problem with the current implementation is that it's managed by our code in a very non-concurrent way.
            di.wang Di Wang added a comment -

            No, sorry I actually do not mean "locate the records" specially. I am concerned how do you build the adding/canceling by the normal index + read/write API, i.e. how to make index insert + append record atomic?

            Actually our current bitmap structure is also simple index file, but we just combine them with the blob in the same file. The reason I do not like it is that it has strong consistency requirement between header and records. I do not see how this proposal eliminate this. I probably miss sth, please correct me.

            di.wang Di Wang added a comment - No, sorry I actually do not mean "locate the records" specially. I am concerned how do you build the adding/canceling by the normal index + read/write API, i.e. how to make index insert + append record atomic? Actually our current bitmap structure is also simple index file, but we just combine them with the blob in the same file. The reason I do not like it is that it has strong consistency requirement between header and records. I do not see how this proposal eliminate this. I probably miss sth, please correct me.

            what do you mean by "locate the records" - given it's an index you can: list it or get a value for a given key.

            bzzz Alex Zhuravlev added a comment - what do you mean by "locate the records" - given it's an index you can: list it or get a value for a given key.
            di.wang Di Wang added a comment -
            there is no need to protect llog. it's OSP who should be taking care of correct order - this is needed regardless the issue in the ticket, I think. the model we've been using so far doesn't support partial execution when from a set of transaction A,B,C,D,E,F we can execute A,B, then fail C, then continue and apply D,E,F - this is not possible with ldiskfs/zfs and I think OSP should be mimic this model too.
            

            That is exactly patch 16969 does.

            as for a new format, I'm still wondering about using an index to store records.
            

            Hmm, do you still use bitmap, if not, how do you quickly locate the records? If yes, I do not see much difference here. Please correct me if I miss sth.

            di.wang Di Wang added a comment - there is no need to protect llog. it's OSP who should be taking care of correct order - this is needed regardless the issue in the ticket, I think. the model we've been using so far doesn't support partial execution when from a set of transaction A,B,C,D,E,F we can execute A,B, then fail C, then continue and apply D,E,F - this is not possible with ldiskfs/zfs and I think OSP should be mimic this model too. That is exactly patch 16969 does. as for a new format, I'm still wondering about using an index to store records. Hmm, do you still use bitmap, if not, how do you quickly locate the records? If yes, I do not see much difference here. Please correct me if I miss sth.

            People

              wc-triage WC Triage
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: