Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7426

DNE3: improve llog format for remote update llog

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Current llog bitmap header and records structure is not very nice for recording update llog of DNE, especially for remote update log.

      In DNE, update records will be written by top_trans_start()->sub_updates_write()->llog_osd_write_rec()->osp_md_write(), where the write buffer (header bitmap + record) will be packed into the RPC buffer, then it will be sent during transaction stop.
      To avoid bitmap being over-written, these RPC needs be serialized and also sent with certain order.

      But there are still other problems. In llog_osd_write_rec(), it will packed header information (bitmap, lgh_index, offset etc) into the RPC buffer, but the RPC will not be sent until the transaction stop. Then these information might be invalid when the RPC is being sent, especially if the previous llog update RPC fails. And these RPCs are executed on the remote MDT, it will definitly make update llog corrupted.

      There are a few options to fix the problem

      1. Once any llog update RPC fails, then all of the following RPC in the sending list fails, and OSP will update the header and the new RPC will be packed with updated header. This is 2.8 approach in patch http://review.whamcloud.com/16969 "LU-7039 llog: update llog header and size", and it is easy to implement but not very nice.

      2. Add special llog API in OSP->OUT interface, so the OUT can handle these add llog update records independently, instead of only using dumb write API driven by llog_osd_write_rec(). For example we can add such llog updates inside the RPC,
      1. append the record.
      2. get the llog index.
      3. update record with the index.
      4. set the bitmap by index.

      3.Using other format for update format.

      Any thoughts?

      Attachments

        Issue Links

          Activity

            [LU-7426] DNE3: improve llog format for remote update llog

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57261
            Subject: LU-7426 obdclass: indexed llog
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f35ad4ef85c96432424e1625c50158d82305ae69

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57261 Subject: LU-7426 obdclass: indexed llog Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f35ad4ef85c96432424e1625c50158d82305ae69
            di.wang Di Wang added a comment -

            I Just measured the change log performance by mdtest.
            1. mdtest -i 1 -n 5000 -F -u -a POSIX. 20 clients, 60 threads per client
            2. Default changelog mask(mark).
            3. sudo lctl set_param mdd.fszeta0-MDT0000.changelog_deniednext=1200
            4. lustre 2.15.5

            without changelog enabled.
            SUMMARY rate (in ops/sec): (of 3 iterations)
            Operation Max Min Mean Std Dev
            --------- — — ---- -------
            File creation 120085.547 109367.150 113125.584 6033.933
            File stat 431459.420 420643.784 425807.279 5424.351
            File read 166476.390 161662.221 164014.216 2408.975
            File removal 102514.932 101913.583 102251.922 307.671
            Tree creation 16.867 9.483 14.099 4.024
            Tree removal 9.211 8.388 8.672 0.467
            – finished at 08/29/2024 21:53:12 –

            With changelog enabled.
            SUMMARY rate (in ops/sec): (of 1 iterations)
            Operation Max Min Mean Std Dev
            --------- — — ---- -------
            File creation 69770.840 69770.840 69770.840 0.000
            File stat 423102.635 423102.635 423102.635 0.000
            File read 160163.030 160163.030 160163.030 0.000
            File removal 62689.775 62689.775 62689.775 0.000
            Tree creation 20.055 20.055 20.055 0.000
            Tree removal 7.858 7.858 7.858 0.000
            – finished at 09/04/2024 04:09:01 –

            So open/create drop around 50%, and unlink drop around 40%.

            Here is the perf trace during open/create (see attachment)

            It looks like llog_cat_add_rec()->down() is the culprit, so I think Alex's proposal might fix this performance.

            di.wang Di Wang added a comment - I Just measured the change log performance by mdtest. 1. mdtest -i 1 -n 5000 -F -u -a POSIX. 20 clients, 60 threads per client 2. Default changelog mask(mark). 3. sudo lctl set_param mdd.fszeta0-MDT0000.changelog_deniednext=1200 4. lustre 2.15.5 without changelog enabled. SUMMARY rate (in ops/sec): (of 3 iterations) Operation Max Min Mean Std Dev --------- — — ---- ------- File creation 120085.547 109367.150 113125.584 6033.933 File stat 431459.420 420643.784 425807.279 5424.351 File read 166476.390 161662.221 164014.216 2408.975 File removal 102514.932 101913.583 102251.922 307.671 Tree creation 16.867 9.483 14.099 4.024 Tree removal 9.211 8.388 8.672 0.467 – finished at 08/29/2024 21:53:12 – With changelog enabled. SUMMARY rate (in ops/sec): (of 1 iterations) Operation Max Min Mean Std Dev --------- — — ---- ------- File creation 69770.840 69770.840 69770.840 0.000 File stat 423102.635 423102.635 423102.635 0.000 File read 160163.030 160163.030 160163.030 0.000 File removal 62689.775 62689.775 62689.775 0.000 Tree creation 20.055 20.055 20.055 0.000 Tree removal 7.858 7.858 7.858 0.000 – finished at 09/04/2024 04:09:01 – So open/create drop around 50%, and unlink drop around 40%. Here is the perf trace during open/create (see attachment) It looks like llog_cat_add_rec()->down() is the culprit, so I think Alex's proposal might fix this performance.

            no, thank you a lot for spending time on this!

            bzzz Alex Zhuravlev added a comment - no, thank you a lot for spending time on this!
            di.wang Di Wang added a comment -

            oh, yes, unique offset. we only need the record size in this case. Sorry being dumb.

            di.wang Di Wang added a comment - oh, yes, unique offset. we only need the record size in this case. Sorry being dumb.

            no need to handle failure as this offset is unique and we don't care about gaps. nothing special. and no extra locks. just unique offset and unique key.

            bzzz Alex Zhuravlev added a comment - no need to handle failure as this offset is unique and we don't care about gaps. nothing special. and no extra locks. just unique offset and unique key.
            di.wang Di Wang added a comment - - edited

            yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial?

            Hmm, I do not think maintain this virtual counter over the wire is trivial, you still need handle the failure of previous transaction, i.e. this does eliminate the dependency between records. Maybe adding "special" stuff in write RPC or adding extra special smart RPC (append + insert) is not a bad idea.

            Another idea might be use the normal lock protect the update RPC in the whole time, so even the content of object is already inside the RPC, which can still be covered by lock.

            di.wang Di Wang added a comment - - edited yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial? Hmm, I do not think maintain this virtual counter over the wire is trivial, you still need handle the failure of previous transaction, i.e. this does eliminate the dependency between records. Maybe adding "special" stuff in write RPC or adding extra special smart RPC (append + insert) is not a bad idea. Another idea might be use the normal lock protect the update RPC in the whole time, so even the content of object is already inside the RPC, which can still be covered by lock.

            yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial?

            personally, I'm not a big fan of introducing some special things like setbit.. I think we have everything (or almost) in place to implement that.

            bzzz Alex Zhuravlev added a comment - yes, write at the offset you want (can be object's size as now or can be some virtual counter we maintain in memory, doesn't matter), then insert with a value containing this offset. dt_write() + dt_insert - should be trivial? personally, I'm not a big fan of introducing some special things like setbit.. I think we have everything (or almost) in place to implement that.
            di.wang Di Wang added a comment -

            Oh, I do not mean transaction consistency. I mean the consistency between index and blob. I assume you need write first, get the offset (or similar thing), then insert the offset to the index file. how do you do it with current OSP API?

            Another idea might be putting llog format below OSD (like IAM), then setbit actually can be an index insert API over OSP.

            di.wang Di Wang added a comment - Oh, I do not mean transaction consistency. I mean the consistency between index and blob. I assume you need write first, get the offset (or similar thing), then insert the offset to the index file. how do you do it with current OSP API? Another idea might be putting llog format below OSD (like IAM), then setbit actually can be an index insert API over OSP.
            bzzz Alex Zhuravlev added a comment - - edited

            how it can be inconsistent if insert+write in a same transaction? the benefit of an index is that all highly-concurrent operations (tree modification) are done locally and logically there is zero concurrency.

            one of the problem of current llog is that we can't modify a single bit (sorry, introducing a bit-wise operations don't sound exciting to me). with an index this isn't an issue at all.

            another problem - gaps. current llog can't work properly with gaps. again, with index + blob in a different file this isn't a problem, as a record itself may contain an offset to blob.

            of course, I mean a regular indices we already have - nothing special. probably binary keys (already supported by the api and ldiskfs/zfs) may be useful, need to check all the use cases.

            bzzz Alex Zhuravlev added a comment - - edited how it can be inconsistent if insert+write in a same transaction? the benefit of an index is that all highly-concurrent operations (tree modification) are done locally and logically there is zero concurrency. one of the problem of current llog is that we can't modify a single bit (sorry, introducing a bit-wise operations don't sound exciting to me). with an index this isn't an issue at all. another problem - gaps. current llog can't work properly with gaps. again, with index + blob in a different file this isn't a problem, as a record itself may contain an offset to blob. of course, I mean a regular indices we already have - nothing special. probably binary keys (already supported by the api and ldiskfs/zfs) may be useful, need to check all the use cases.
            di.wang Di Wang added a comment - - edited

            Hmm, I am not sure what you mean "non-concurrent" way. For me, the problem is that the current llog format requires strong consistency (bitmap vs record), and dependency (the current record vs the next record), which make things even more complicate over the wire, i.e. we need something more independent for adding/deleting the record. That is why I am curious about your index file proposal. Sorry I still do not understand how it can remove the dependency.

            For example when adding one record, how do you make sure insert index is consistent with the blob, for consistency, I mean atomic and the key in the index file is matched with blob location. Note: I mean building this with normal index insert and write API, no special stuff. Otherwise I do not see how it is better than Andreas proposal. Please correct me if I miss sth. Thanks

            di.wang Di Wang added a comment - - edited Hmm, I am not sure what you mean "non-concurrent" way. For me, the problem is that the current llog format requires strong consistency (bitmap vs record), and dependency (the current record vs the next record), which make things even more complicate over the wire, i.e. we need something more independent for adding/deleting the record. That is why I am curious about your index file proposal. Sorry I still do not understand how it can remove the dependency. For example when adding one record, how do you make sure insert index is consistent with the blob, for consistency, I mean atomic and the key in the index file is matched with blob location. Note: I mean building this with normal index insert and write API, no special stuff. Otherwise I do not see how it is better than Andreas proposal. Please correct me if I miss sth. Thanks

            People

              wc-triage WC Triage
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: