Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18556

Rework HSM to use llog cookie for a record modification

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      llog_write could use llog cookie for a single record modification. In this case there is no need to process all records before an update of the record. HSM regular updates record, and cancels it after timeout only, so there could be many records at the same time. And this introduce a bottleneck for HSM.

      llog subsystem requires some additional functions to do it. Also HSM need to store a whole llog record for a modification, and this leads to a different structures in memory for a HSM.
      Now, Lustre operates and stores hsm_action_item, however llog_agent_req_rec include hai already. So this change is some trade of for a memory vs additional processing.

      Attachments

        Activity

          [LU-18556] Rework HSM to use llog cookie for a record modification

          We have a testing results for this patch at perf cluster, many thanks to Nikitas.
          Test 1

          • Adding 1000.000 archive requests
          • start copy tool to process requests with no operation (get req and reply, to avoid copytool processing time, to measure only Lustre)
          • measuring the time needed for the copytool to handle the first 1,000,000 requests

          Test 2
          Similar to test 1, but also queueing an additional 1,000,000 requests in the MDT, immediately after start of processing, measuring the time needed for the copytool to handle the first 1,000,000 requests.

            base LU-18556 patch
          Test 1 seconds 572 187
          Test 2 seconds 558 392

          The patch should offer a significant performance benefit compared to current 2.16

          aboyko Alexander Boyko added a comment - We have a testing results for this patch at perf cluster, many thanks to Nikitas. Test 1 Adding 1000.000 archive requests start copy tool to process requests with no operation (get req and reply, to avoid copytool processing time, to measure only Lustre) measuring the time needed for the copytool to handle the first 1,000,000 requests Test 2 Similar to test 1, but also queueing an additional 1,000,000 requests in the MDT, immediately after start of processing, measuring the time needed for the copytool to handle the first 1,000,000 requests.   base LU-18556 patch Test 1 seconds 572 187 Test 2 seconds 558 392 The patch should offer a significant performance benefit compared to current 2.16

          the idea of updating a llog record through llog processing - has much overhead.

          there is no need to go through the llog looking for a specific record – if you have something like an index.

          bzzz Alex Zhuravlev added a comment - the idea of updating a llog record through llog processing - has much overhead. there is no need to go through the llog looking for a specific record – if you have something like an index.

          I'll take a look. But llog improvement would not help much to HSM, because the idea of updating a llog record through llog processing - has much overhead. With massive HSM archive, release etc., it is disaster. Llog was not designed for such using.
          We had an attempt to replace the HSM actions queue llog with index, in an effort to address a performance issues. But the massive testing showed it became worse then llog.

          aboyko Alexander Boyko added a comment - I'll take a look. But llog improvement would not help much to HSM, because the idea of updating a llog record through llog processing - has much overhead. With massive HSM archive, release etc., it is disaster. Llog was not designed for such using. We had an attempt to replace the HSM actions queue llog with index, in an effort to address a performance issues. But the massive testing showed it became worse then llog.

          HSM uses local llog, so it is not related to DNE3 problem.

          that approach solves "local" problems as well - removes need for expensive serialization.

          bzzz Alex Zhuravlev added a comment - HSM uses local llog, so it is not related to DNE3 problem. that approach solves "local" problems as well - removes need for expensive serialization.

          HSM uses local llog, so it is not related to DNE3 problem. Also I've pushed an improvement for changelog+mdtest https://review.whamcloud.com/c/fs/lustre-release/+/56342

          aboyko Alexander Boyko added a comment - HSM uses local llog, so it is not related to DNE3 problem. Also I've pushed an improvement for changelog+mdtest https://review.whamcloud.com/c/fs/lustre-release/+/56342

          I'd suggest to look at LU-7426

          bzzz Alex Zhuravlev added a comment - I'd suggest to look at LU-7426

          "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57428
          Subject: LU-18556 hsm: use direct modifying llog record
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 780d31e9b83896d18aa7cda52ec59c2a4a035f66

          gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57428 Subject: LU-18556 hsm: use direct modifying llog record Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 780d31e9b83896d18aa7cda52ec59c2a4a035f66

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: