Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.1
    • Centos 6.5, Lustre 2.5.56
    • 3
    • 14550

    Description

      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf
      

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      However lfs hsm operation still works on other files.

      Attachments

        Issue Links

          Activity

            [LU-5216] HSM: restore doesn't restart for new copytool

            Sergey, I've got a patch worked up for the second part of the patch, but it depends on the first.  As soon as the first lands, I'll push the second one.

            bevans Ben Evans (Inactive) added a comment - Sergey, I've got a patch worked up for the second part of the patch, but it depends on the first.  As soon as the first lands, I'll push the second one.

            Could someone clarify the question about the second part of patch ?
            Do you have plans for land it ?

            Right now the 1nd part of patch (https://review.whamcloud.com/#/c/31105/) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed.

            If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch.

            Please let me know your position.

            Thanks Sergey

            scherementsev Sergey Cheremencev added a comment - Could someone clarify the question about the second part of patch ? Do you have plans for land it ? Right now the 1nd part of patch ( https://review.whamcloud.com/#/c/31105/ ) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed. If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch. Please let me know your position. Thanks Sergey
            spitzcor Cory Spitz added a comment -

            We discussed this at the 3/8 LWG. JB, we'll try to help you land your patch and work out any issues with John H. Thanks.

            spitzcor Cory Spitz added a comment - We discussed this at the 3/8 LWG. JB, we'll try to help you land your patch and work out any issues with John H. Thanks.
            pjones Peter Jones added a comment -

            spitzcor why don't we discuss at the upcoming LWG meeting?

            pjones Peter Jones added a comment - spitzcor why don't we discuss at the upcoming LWG meeting?
            spitzcor Cory Spitz added a comment - - edited

            pjones, I'd like to get this on the 2.11.0 radar. May we set the Fix Version? bevans & sergey from Cray will follow with more to better kick-start the conversation.

            spitzcor Cory Spitz added a comment - - edited pjones , I'd like to get this on the 2.11.0 radar. May we set the Fix Version? bevans & sergey from Cray will follow with more to better kick-start the conversation.

            New patch (https://review.whamcloud.com/#/c/31105/) available including only first part of previous patch (https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/).

            This patch keeps only the first fix:
            Unexpected client (data mover node) eviction could
            cause on going hsm requests to be stuck in "STARTED"
            state as the copy tool running on the data mover node
            is not available anymore and requests could not be
            finished. This patch unregisters the copy tool and
            cancels all the requests on the copytool's agent.

            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - New patch ( https://review.whamcloud.com/#/c/31105/ ) available including only first part of previous patch ( https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/ ). This patch keeps only the first fix: Unexpected client (data mover node) eviction could cause on going hsm requests to be stuck in "STARTED" state as the copy tool running on the data mover node is not available anymore and requests could not be finished. This patch unregisters the copy tool and cancels all the requests on the copytool's agent.

            Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105
            Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

            gerrit Gerrit Updater added a comment - Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105 Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/
            Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: be4507fb45074ad24208c494f98a00da90b13665

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/ Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: Commit: be4507fb45074ad24208c494f98a00da90b13665

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615
            Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a57dcdbca18f9721169f06c312519c619856a656

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615 Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a57dcdbca18f9721169f06c312519c619856a656
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            People

              riauxjb Jean-Baptiste Riaux (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: