Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.1
    • Centos 6.5, Lustre 2.5.56
    • 3
    • 14550

    Description

      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf
      

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      However lfs hsm operation still works on other files.

      Attachments

        Issue Links

          Activity

            [LU-5216] HSM: restore doesn't restart for new copytool

            Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/24238
            Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c2c9531d7005b63b1c8afd64c2d5ed22c404526e

            gerrit Gerrit Updater added a comment - Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/24238 Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c2c9531d7005b63b1c8afd64c2d5ed22c404526e

            Hello Frank,

            >> 1. How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die?
            In case of normal kill (machine still up), we need to wait for the completion of hsm actions otherwise the process waiting on them will get stuck. Ex: md5sum on archived and released file will trigger restore action.

            >> 2. For instance that prevents a machine from being shut-down as long as a restore is in progress.
            Force shutdown will cause client eviction and the actions running on that CT on that node will be cancelled and lay out lock will be released in case of hsm restore.

            I agree that reset the actions is better than cancelling them in case of client eviction (2nd case) but it requires lot of things to be done mainly in case of restore.
            1. If restore is half way finished and CT gets killed. IMO, we cannot simply put the request into the queue so that other CT can pick up. Restore results in layout changes, for this half way restored file should be cleaned from the lustre storage and then we can re-queue that.

            2. If there is no CT running on any client then request will stay in the queue until new CT with the same archive ID comes.

            3. I am not sure about this case though I am giving my thoughts. If new CT comes online but with different archive ID then hsm request for the same file go into WAITING state as already there is action in the list.

            Please put your thoughts forward I will improvise the patch.

            vinayakh Vinayak (Inactive) added a comment - Hello Frank, >> 1. How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? In case of normal kill (machine still up), we need to wait for the completion of hsm actions otherwise the process waiting on them will get stuck. Ex: md5sum on archived and released file will trigger restore action. >> 2. For instance that prevents a machine from being shut-down as long as a restore is in progress. Force shutdown will cause client eviction and the actions running on that CT on that node will be cancelled and lay out lock will be released in case of hsm restore. I agree that reset the actions is better than cancelling them in case of client eviction (2nd case) but it requires lot of things to be done mainly in case of restore. 1. If restore is half way finished and CT gets killed. IMO, we cannot simply put the request into the queue so that other CT can pick up. Restore results in layout changes, for this half way restored file should be cleaned from the lustre storage and then we can re-queue that. 2. If there is no CT running on any client then request will stay in the queue until new CT with the same archive ID comes. 3. I am not sure about this case though I am giving my thoughts. If new CT comes online but with different archive ID then hsm request for the same file go into WAITING state as already there is action in the list. Please put your thoughts forward I will improvise the patch.

            Hi Vinayak,

            How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? I think Lustre preventing a process from being killed is a bad thing, especially when kill -KILL is used. For instance that prevents a machine from being shut-down as long as a restore is in progress. This is not introduced by this fix though.

            fzago Frank Zago (Inactive) added a comment - Hi Vinayak, How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? I think Lustre preventing a process from being killed is a bad thing, especially when kill -KILL is used. For instance that prevents a machine from being shut-down as long as a restore is in progress. This is not introduced by this fix though.

            Hello Frank,

            Any updates/feed back from you ?

            Thanks,

            vinayakh Vinayak (Inactive) added a comment - Hello Frank, Any updates/feed back from you ? Thanks,
            vinayakh Vinayak (Inactive) added a comment - - edited

            Hello Frank,

            I have considered 2 cases here

            1. Kill the copy tool (kill without -9)

            In this case (as this issue explains) if the copy tool is killed during any hsm activity then the copy tool un-register process will wait for the on going hsm activity to finish and then un-registration continues.
            Ex: md5sum on hsm_released file calls for hsm_restore of the file. During this hsm_restore process if we kill the copy tool then the restore process will get stuck in state=STARTED and in turn md5sum keeps waiting for hsm_restore to complete till the request_timeout. The patch what I have submitted will wait till this activity to finish and md5sum return smoothly.

            2. If the data mover node (client) powered down - No need to wait for the request to complete

            In this case the data mover node itself is down and no copy tool available at the moment to continue the activity. So I thought to un-register the copy tool and cancel all the requests running on that CT as the process waiting on the hsm action also gets killed when the powered down. So that other copy tools which are looking to do any hsm activity on the same file will be processed successfully.

            Answer for your questions.
            >> Why cancel the requests? Can't they be just reset?
            1. reset means add the request back to the queue and process the same when the new copy tool with same archive id comes alive ? This will still make the process to keep waiting for hsm action to complete.

            2. Point #1 explains this. I am not resetting the requests but waiting for them to complete which also helps returning to the process waiting on hsm activity. The order in the patch might be confusing. Earlier I was trying align the running the requests to any other copy tool with same archive id in case if the current copy tool gets killed but stopped there as it is against the present hsm design.

            I would like bring one more point here. HSMA_CANCEL in copy tool is not yet implemented. Even if we send HSMA_CANCEL to the copy tool, it does nothing.

                    case HSMA_CANCEL:
                            CT_TRACE("cancel not implemented for file system '%s'",
                                     opt.o_mnt);
                            /* Don't report progress to coordinator for this cookie:
                             * the copy function will get ECANCELED when reporting
                             * progress. */
                            err_minor++;
                            return 0;
                            break;
            
            

            Many a times if CT gets killed during hsm_restore process poses lay out change problem. I mean if the file is halfway restored and CT gets killed, then we cannot really reset the request. It is quiet a bit copy tool centered and sequence of actions are controlled by copy tool.

            Please correct me if wrong.

            vinayakh Vinayak (Inactive) added a comment - - edited Hello Frank, I have considered 2 cases here 1. Kill the copy tool (kill without -9) In this case (as this issue explains) if the copy tool is killed during any hsm activity then the copy tool un-register process will wait for the on going hsm activity to finish and then un-registration continues. Ex: md5sum on hsm_released file calls for hsm_restore of the file. During this hsm_restore process if we kill the copy tool then the restore process will get stuck in state=STARTED and in turn md5sum keeps waiting for hsm_restore to complete till the request_timeout. The patch what I have submitted will wait till this activity to finish and md5sum return smoothly. 2. If the data mover node (client) powered down - No need to wait for the request to complete In this case the data mover node itself is down and no copy tool available at the moment to continue the activity. So I thought to un-register the copy tool and cancel all the requests running on that CT as the process waiting on the hsm action also gets killed when the powered down. So that other copy tools which are looking to do any hsm activity on the same file will be processed successfully. Answer for your questions. >> Why cancel the requests? Can't they be just reset? 1. reset means add the request back to the queue and process the same when the new copy tool with same archive id comes alive ? This will still make the process to keep waiting for hsm action to complete. 2. Point #1 explains this. I am not resetting the requests but waiting for them to complete which also helps returning to the process waiting on hsm activity. The order in the patch might be confusing. Earlier I was trying align the running the requests to any other copy tool with same archive id in case if the current copy tool gets killed but stopped there as it is against the present hsm design. I would like bring one more point here. HSMA_CANCEL in copy tool is not yet implemented. Even if we send HSMA_CANCEL to the copy tool, it does nothing. case HSMA_CANCEL: CT_TRACE("cancel not implemented for file system '%s'", opt.o_mnt); /* Don't report progress to coordinator for this cookie: * the copy function will get ECANCELED when reporting * progress. */ err_minor++; return 0; break; Many a times if CT gets killed during hsm_restore process poses lay out change problem. I mean if the file is halfway restored and CT gets killed, then we cannot really reset the request. It is quiet a bit copy tool centered and sequence of actions are controlled by copy tool. Please correct me if wrong.

            Why cancel the requests? Can't they be just reset? IMO if a copytool fails, we should not cancel those requests because that will force a user to restart the archiving requests, and all failed restore command will create issues with the application wanting to access the files.

            The archive requests can be reset in the queue, and restarted later when a new copytool shows up. As for the restore requests, they have a timeout that should be respected, and they shouldn't fail just because the copytool was terminated.

            fzago Frank Zago (Inactive) added a comment - Why cancel the requests? Can't they be just reset? IMO if a copytool fails, we should not cancel those requests because that will force a user to restart the archiving requests, and all failed restore command will create issues with the application wanting to access the files. The archive requests can be reset in the queue, and restarted later when a new copytool shows up. As for the restore requests, they have a timeout that should be respected, and they shouldn't fail just because the copytool was terminated.

            Thanks Vinayak. I'll try to test it soon.

            fzago Frank Zago (Inactive) added a comment - Thanks Vinayak. I'll try to test it soon.

            I have submitted a patch to address this issue (also considering client eviction case). Please review and give your feedback to cover all possible scenarios.

            Patch can be tracked at : http://review.whamcloud.com/#/c/19369/

            vinayakh Vinayak (Inactive) added a comment - I have submitted a patch to address this issue (also considering client eviction case). Please review and give your feedback to cover all possible scenarios. Patch can be tracked at : http://review.whamcloud.com/#/c/19369/

            Okay Frank I got your point.
            I was thinking about hsm storage. Even if copy tool starts again on the same copy tool there is no guarantee that same hsm storage is used. I agree that storing uuid of CT does not make any sense.

            Apart from this any suggestion you want to provide or should I proceed with this solution ?

            gaurav_mahajan gaurav mahajan (Inactive) added a comment - Okay Frank I got your point. I was thinking about hsm storage. Even if copy tool starts again on the same copy tool there is no guarantee that same hsm storage is used. I agree that storing uuid of CT does not make any sense. Apart from this any suggestion you want to provide or should I proceed with this solution ?

            But the copytool may not restart on the same node, ever. So why keep the uuid of that copytool around?

            fzago Frank Zago (Inactive) added a comment - But the copytool may not restart on the same node, ever. So why keep the uuid of that copytool around?

            Yes Frank.
            >> That won't work if the copytool on node A dies, and a new copytool is started on node B.
            That's the reason why I wish store the md5sum of a client name into car->car_uuid (place to store uuid of the copy tool. When copy tool dies then this uuid will not give any meaning. So we shall use it as a place holder for client name). This part will help in identifying the right node to re-queue the requests when the new copy tool comes alive on the same node.

            >> If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request

            Yes. This part is also handled. Old requests will be added to the tail of the car_request_list by placing uuid of the next copy tool in car->car_uuid (next copy tool on the same node treats the old requests as new).

            >> The biggest code change would be to detect when a copytool is no longer reachable/functional.
            Currently few of the signals caught by the copy tool will indicate coordinator to un register the copy tool. Need to look into it. But above logic works I guess.

            gaurav_mahajan gaurav mahajan (Inactive) added a comment - Yes Frank. >> That won't work if the copytool on node A dies, and a new copytool is started on node B. That's the reason why I wish store the md5sum of a client name into car->car_uuid (place to store uuid of the copy tool. When copy tool dies then this uuid will not give any meaning. So we shall use it as a place holder for client name). This part will help in identifying the right node to re-queue the requests when the new copy tool comes alive on the same node. >> If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request Yes. This part is also handled. Old requests will be added to the tail of the car_request_list by placing uuid of the next copy tool in car->car_uuid (next copy tool on the same node treats the old requests as new). >> The biggest code change would be to detect when a copytool is no longer reachable/functional. Currently few of the signals caught by the copy tool will indicate coordinator to un register the copy tool. Need to look into it. But above logic works I guess.

            People

              riauxjb Jean-Baptiste Riaux (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: