Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5939

Error: trying to overwrite bigger transno

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • OpenSFS cluster running lustre-master tag 2.6.90 build #2745 with one MDS/MDT, three OSSs with two OSTs each and three clients.
    • 3
    • 16583

    Description

      I've been running sanity-hsm test 90 several time on this cluster and nearly every time I run the test, I see the following in dmesg on the MDS:

      Lustre: DEBUG MARKER: == sanity-hsm test 90: Archive/restore a file list == 15:39:24 (1416440364)
      Lustre: HSM agent bb8c2497-7403-4909-0e46-6614668e8ed7 already registered
      LustreError: 26047:0:(mdt_coordinator.c:957:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
      LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617.
      LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) Skipped 5 previous similar messages
      Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 37 sec == 15:39:50 (1416440390)
      

      From the kernel logs, I see:

      ...
      00000001:00020000:9.0:1416440377.839622:0:19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617.
      ...
      00000001:00080000:8.0:1416440377.869378:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000001:00080000:8.0:1416440377.869423:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000001:00080000:8.0:1416440377.869508:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612
      ...
      00000100:00100000:8.0:1416440377.869685:0:30331:0:(service.c:2116:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:bb8c2497-7403-4909-0e46-6614668e8ed7+713:21533:x1485210712561904:12345-192.168.2.111@o2ib:57 Request procesed in 30116us (30167us total) trans 25769818612 rc 0/0
      

      Similarly for other transaction numbers:

      00000001:00020000:0.0:1416440378.133498:0:19955:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818617, new: 25769818614 replay: 0. see LU-617.
      

      and

      00000001:00020000:1.0F:1416440378.133518:0:31313:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818619, new: 25769818618 replay: 0. see LU-617.
      

      Before running sanity-hsm test 90, the copytool was started on the agent, c11.

      Attachments

        Issue Links

          Activity

            [LU-5939] Error: trying to overwrite bigger transno
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13684/
            Subject: LU-5939 hsm: make HSM modification requests replayable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9eda825b1b449baaf2676cc80ccae79d4297cf2d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13684/ Subject: LU-5939 hsm: make HSM modification requests replayable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9eda825b1b449baaf2676cc80ccae79d4297cf2d

            I've got trace for last occurrence of 'Multiple transactions'. It is not about llog, it is HSM_PROGRESS, and MDT does several disk changes:

            • mdt_hsm_attr_set()
            • another mdt_hsm_attr_set()
            • mo_swap_layouts()

            this is him_archive operation from sanity-hsm.sh. Each call to lower level (MDD) will cause separate transaction. I am not sure how to solve that right now, ideally transaction should start in MDT but mo_... interface cannot pass transaction details to the MDD. I think we might allow multiple transactions for this case and for restore, with additional checks. This is to be continued in LU-6223.

            tappro Mikhail Pershin added a comment - I've got trace for last occurrence of 'Multiple transactions'. It is not about llog, it is HSM_PROGRESS, and MDT does several disk changes: mdt_hsm_attr_set() another mdt_hsm_attr_set() mo_swap_layouts() this is him_archive operation from sanity-hsm.sh. Each call to lower level (MDD) will cause separate transaction. I am not sure how to solve that right now, ideally transaction should start in MDT but mo_... interface cannot pass transaction details to the MDD. I think we might allow multiple transactions for this case and for restore, with additional checks. This is to be continued in LU-6223 .

            Hi Mike, you have the expertise on recovery - so if you think it's better to go for multiple transactions, I'm good. Sorry for noise.

            jay Jinshan Xiong (Inactive) added a comment - Hi Mike, you have the expertise on recovery - so if you think it's better to go for multiple transactions, I'm good. Sorry for noise.

            I am not sure it is about llog records only, llog_cat_add() cause local transaction which produce no transaction number, there must be another update, maybe attributes of file or something like that? I can give more details about HSM request type and operations behind multiple transno later today. Meanwhile, llog_cat_add() should be replaced with llog_add() in any case.

            As for putting everything into single transaction, we still have another way to go - use the same mechanism as OUT uses to control batch of updates. This will cause compatibility problem but maybe it is not so difficult to solve. I mean we shouldn't deny this case completely and review it too. This is context of LU-6223 though.

            tappro Mikhail Pershin added a comment - I am not sure it is about llog records only, llog_cat_add() cause local transaction which produce no transaction number, there must be another update, maybe attributes of file or something like that? I can give more details about HSM request type and operations behind multiple transno later today. Meanwhile, llog_cat_add() should be replaced with llog_add() in any case. As for putting everything into single transaction, we still have another way to go - use the same mechanism as OUT uses to control batch of updates. This will cause compatibility problem but maybe it is not so difficult to solve. I mean we shouldn't deny this case completely and review it too. This is context of LU-6223 though.

            The problem is we should declare the number of credits we need for the transaction in advance. So we need to also update the credit declaration.

            adegremont Aurelien Degremont (Inactive) added a comment - The problem is we should declare the number of credits we need for the transaction in advance. So we need to also update the credit declaration.

            after a second thought, we don't even need to add a parameter into llog_cat_add(). We just need to call llog_add() series of interfaces instead, just as what we do for changelog.

            jay Jinshan Xiong (Inactive) added a comment - after a second thought, we don't even need to add a parameter into llog_cat_add(). We just need to call llog_add() series of interfaces instead, just as what we do for changelog.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Exactly, llog_cat_add() can be revised to carry a transaction handler parameter therefore we can start a transaction in mdt_hsm_add_actions() and use it for all llog operations later.

            The only concern is about the size of the transaction. I remember that there is a limitation for it, but I'm not an OSD expert. If that is the case, we also need to take log file creation into account for the transaction size.

            jay Jinshan Xiong (Inactive) added a comment - - edited Exactly, llog_cat_add() can be revised to carry a transaction handler parameter therefore we can start a transaction in mdt_hsm_add_actions() and use it for all llog operations later. The only concern is about the size of the transaction. I remember that there is a limitation for it, but I'm not an OSD expert. If that is the case, we also need to take log file creation into account for the transaction size.

            IIRC, HSM_REQUEST store a list of requests to be done in a llog. One RPC can send request for the same action (archive, restore, ...) for a list of files. One llog record will be added for each files (with the same compound_id to be able to rebuilt this request later).

            Records are added using llog_cat_add(). If we want to have only one transaction, we need a special version which can add several records in one call, and update mdt_hsm_add_actions() accordingly.

            adegremont Aurelien Degremont (Inactive) added a comment - IIRC, HSM_REQUEST store a list of requests to be done in a llog. One RPC can send request for the same action (archive, restore, ...) for a list of files. One llog record will be added for each files (with the same compound_id to be able to rebuilt this request later). Records are added using llog_cat_add() . If we want to have only one transaction, we need a special version which can add several records in one call, and update mdt_hsm_add_actions() accordingly.

            Yes, I agree, that would be better

            tappro Mikhail Pershin added a comment - Yes, I agree, that would be better

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: