Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5939

Error: trying to overwrite bigger transno

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • OpenSFS cluster running lustre-master tag 2.6.90 build #2745 with one MDS/MDT, three OSSs with two OSTs each and three clients.
    • 3
    • 16583

    Description

      I've been running sanity-hsm test 90 several time on this cluster and nearly every time I run the test, I see the following in dmesg on the MDS:

      Lustre: DEBUG MARKER: == sanity-hsm test 90: Archive/restore a file list == 15:39:24 (1416440364)
      Lustre: HSM agent bb8c2497-7403-4909-0e46-6614668e8ed7 already registered
      LustreError: 26047:0:(mdt_coordinator.c:957:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
      LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617.
      LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) Skipped 5 previous similar messages
      Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 37 sec == 15:39:50 (1416440390)
      

      From the kernel logs, I see:

      ...
      00000001:00020000:9.0:1416440377.839622:0:19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617.
      ...
      00000001:00080000:8.0:1416440377.869378:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000001:00080000:8.0:1416440377.869423:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000001:00080000:8.0:1416440377.869508:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000100:00100000:8.0:1416440377.869685:0:30331:0:(service.c:2116:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:bb8c2497-7403-4909-0e46-6614668e8ed7+713:21533:x1485210712561904:12345-192.168.2.111@o2ib:57 Request procesed in 30116us (30167us total) trans 25769818612 rc 0/0
      

      Similarly for other transaction numbers:

      00000001:00020000:0.0:1416440378.133498:0:19955:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818617, new: 25769818614 replay: 0. see LU-617.
      

      and

      00000001:00020000:1.0F:1416440378.133518:0:31313:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818619, new: 25769818618 replay: 0. see LU-617.
      

      Before running sanity-hsm test 90, the copytool was started on the agent, c11.

      Attachments

        Issue Links

          Activity

            [LU-5939] Error: trying to overwrite bigger transno

            Yes, I agree, that would be better

            tappro Mikhail Pershin added a comment - Yes, I agree, that would be better

            hmm I'm not comfortable with mutilple transactions can be made by HSM requests because I'm afraid it may have problems down the road. OUT can have multiple transaction for one RPC because it was carefully designed for this, but it can't be applied to HSM. I'd like to have an alternative way to fix this problem by limiting that HSM request can have only one trans per RPC.

            jay Jinshan Xiong (Inactive) added a comment - hmm I'm not comfortable with mutilple transactions can be made by HSM requests because I'm afraid it may have problems down the road. OUT can have multiple transaction for one RPC because it was carefully designed for this, but it can't be applied to HSM. I'd like to have an alternative way to fix this problem by limiting that HSM request can have only one trans per RPC.
            tappro Mikhail Pershin added a comment - - edited

            Note, the patch above makes just HSM requests repayable, it contains no tests because I found that HSM actions can't recover for reasons not related to this particular patch. I've create LU-6223 for HSM recovery testing.

            So this particular patch should solve just problem with "overwrite bigger transno" message.

            tappro Mikhail Pershin added a comment - - edited Note, the patch above makes just HSM requests repayable, it contains no tests because I found that HSM actions can't recover for reasons not related to this particular patch. I've create LU-6223 for HSM recovery testing. So this particular patch should solve just problem with "overwrite bigger transno" message.

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13684
            Subject: LU-5939 hsm: make HSM modification requests replayable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 726362d38cb95233acabacb3fe98ed484f0bff4e

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13684 Subject: LU-5939 hsm: make HSM modification requests replayable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 726362d38cb95233acabacb3fe98ed484f0bff4e

            Thanks for help, I identified first two as well, but not MDS_HSM_REQUEST, is it specific action which update data on disk or it does that upon any request?

            tappro Mikhail Pershin added a comment - Thanks for help, I identified first two as well, but not MDS_HSM_REQUEST, is it specific action which update data on disk or it does that upon any request?

            On my side, I've identified 3 RPCs which can update data on server

            • MDS_HSM_PROGRESS
            • MDS_HSM_STATE_SET
            • MDS_HSM_REQUEST

            Serializing those requests can slow request ingestion rate but I think it is acceptable.

            If you're working on a patch, please also add MUTABOR flag to MDS_HSM_REQUEST RPC.

            adegremont Aurelien Degremont (Inactive) added a comment - On my side, I've identified 3 RPCs which can update data on server MDS_HSM_PROGRESS MDS_HSM_STATE_SET MDS_HSM_REQUEST Serializing those requests can slow request ingestion rate but I think it is acceptable. If you're working on a patch, please also add MUTABOR flag to MDS_HSM_REQUEST RPC.

            Yes, I am working on patch already, as for HSM agent performance, I am trying to serialize only requests that might update data on server

            tappro Mikhail Pershin added a comment - Yes, I am working on patch already, as for HSM agent performance, I am trying to serialize only requests that might update data on server

            If the MDC is serializing change requests, is this going to hurt HSM performance, or is there only a single HSM request active on any client/agent node at once?

            Adding the HSM opcodes to the recovery list seems like a simple and low-risk change, and serializing the HSM requests on the MDC is also only limited to HSM usage so doesn't seem too high risk.

            adilger Andreas Dilger added a comment - If the MDC is serializing change requests, is this going to hurt HSM performance, or is there only a single HSM request active on any client/agent node at once? Adding the HSM opcodes to the recovery list seems like a simple and low-risk change, and serializing the HSM requests on the MDC is also only limited to HSM usage so doesn't seem too high risk.

            I think this was forgotten. If somebody think this is bad, tell it, but IMO, we should replay them, and so, serialize them.

            adegremont Aurelien Degremont (Inactive) added a comment - I think this was forgotten. If somebody think this is bad, tell it, but IMO, we should replay them, and so, serialize them.
            tappro Mikhail Pershin added a comment - - edited

            Aurelien, I decided that HSM requests don't rely on Lustre recovery (replays) because HSM requests are filtered out during recovery, check tgt_filter_recovery_requests(), there are no any MDS_HSM_ opcode listed. This must be fixed if they are supposed to be replayed, also that means we should serialize HSM requests from mdc by mdc_rpc_lock(). This should solve also 'overwrite bigger transno' issue.

            tappro Mikhail Pershin added a comment - - edited Aurelien, I decided that HSM requests don't rely on Lustre recovery (replays) because HSM requests are filtered out during recovery, check tgt_filter_recovery_requests(), there are no any MDS_HSM_ opcode listed. This must be fixed if they are supposed to be replayed, also that means we should serialize HSM requests from mdc by mdc_rpc_lock(). This should solve also 'overwrite bigger transno' issue.

            Sorry, I did not see this issue.

            HSM code is doing 2 kind of disk access:

            • It manages the coordinator llog. This stores all requests and especially their status. Each time the request is added, started, completed, the corresponding llog records are updated.
            • File HSM status is also updated (LMA EA is updated) when a request is started or finished.
              IIRC, those 2 modifications are not in the same transaction.

            Nothing in HSM code will resend the HSM request registrations. HSM (wrongly?) rely on Lustre recovery mechanism. HSM actions (archive this file, restore this one, etc...) are supposed to be kept between a MDT restarts.
            I think those requests should be replayed and consider as any other RPCs.

            Another thing that could lead to very big transaction is when sending an archive request for a big list of files. MDT will try to add them (one llog record per file) in one transaction. So far, this file list was limited by other Lustre limitation (LNET message size, KUC, ...). Don't know if this limitation changed but the test is still using only 51 files.

            2. Don’t update last_rcvd and don’t assign transaction to such requests if they don’t need transaction and/or recovery by replaying them.

            AFAIK, replay and recovery is tightly coupled. I do know how you can manage replay without recovery. Or doing this by hand? But is there a code somewhere in Lustre which does that?

            adegremont Aurelien Degremont (Inactive) added a comment - Sorry, I did not see this issue. HSM code is doing 2 kind of disk access: It manages the coordinator llog. This stores all requests and especially their status. Each time the request is added, started, completed, the corresponding llog records are updated. File HSM status is also updated (LMA EA is updated) when a request is started or finished. IIRC, those 2 modifications are not in the same transaction. Nothing in HSM code will resend the HSM request registrations. HSM (wrongly?) rely on Lustre recovery mechanism. HSM actions (archive this file, restore this one, etc...) are supposed to be kept between a MDT restarts. I think those requests should be replayed and consider as any other RPCs. Another thing that could lead to very big transaction is when sending an archive request for a big list of files. MDT will try to add them (one llog record per file) in one transaction. So far, this file list was limited by other Lustre limitation (LNET message size, KUC, ...). Don't know if this limitation changed but the test is still using only 51 files. 2. Don’t update last_rcvd and don’t assign transaction to such requests if they don’t need transaction and/or recovery by replaying them. AFAIK, replay and recovery is tightly coupled. I do know how you can manage replay without recovery. Or doing this by hand? But is there a code somewhere in Lustre which does that?

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: