Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.1
    • Centos 6.5, Lustre 2.5.56
    • 3
    • 14550

    Description

      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf
      

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      However lfs hsm operation still works on other files.

      Attachments

        Issue Links

          Activity

            [LU-5216] HSM: restore doesn't restart for new copytool

            Hi Vinayak,

            How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? I think Lustre preventing a process from being killed is a bad thing, especially when kill -KILL is used. For instance that prevents a machine from being shut-down as long as a restore is in progress. This is not introduced by this fix though.

            fzago Frank Zago (Inactive) added a comment - Hi Vinayak, How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? I think Lustre preventing a process from being killed is a bad thing, especially when kill -KILL is used. For instance that prevents a machine from being shut-down as long as a restore is in progress. This is not introduced by this fix though.

            Hello Frank,

            Any updates/feed back from you ?

            Thanks,

            vinayakh Vinayak (Inactive) added a comment - Hello Frank, Any updates/feed back from you ? Thanks,
            vinayakh Vinayak (Inactive) added a comment - - edited

            Hello Frank,

            I have considered 2 cases here

            1. Kill the copy tool (kill without -9)

            In this case (as this issue explains) if the copy tool is killed during any hsm activity then the copy tool un-register process will wait for the on going hsm activity to finish and then un-registration continues.
            Ex: md5sum on hsm_released file calls for hsm_restore of the file. During this hsm_restore process if we kill the copy tool then the restore process will get stuck in state=STARTED and in turn md5sum keeps waiting for hsm_restore to complete till the request_timeout. The patch what I have submitted will wait till this activity to finish and md5sum return smoothly.

            2. If the data mover node (client) powered down - No need to wait for the request to complete

            In this case the data mover node itself is down and no copy tool available at the moment to continue the activity. So I thought to un-register the copy tool and cancel all the requests running on that CT as the process waiting on the hsm action also gets killed when the powered down. So that other copy tools which are looking to do any hsm activity on the same file will be processed successfully.

            Answer for your questions.
            >> Why cancel the requests? Can't they be just reset?
            1. reset means add the request back to the queue and process the same when the new copy tool with same archive id comes alive ? This will still make the process to keep waiting for hsm action to complete.

            2. Point #1 explains this. I am not resetting the requests but waiting for them to complete which also helps returning to the process waiting on hsm activity. The order in the patch might be confusing. Earlier I was trying align the running the requests to any other copy tool with same archive id in case if the current copy tool gets killed but stopped there as it is against the present hsm design.

            I would like bring one more point here. HSMA_CANCEL in copy tool is not yet implemented. Even if we send HSMA_CANCEL to the copy tool, it does nothing.

                    case HSMA_CANCEL:
                            CT_TRACE("cancel not implemented for file system '%s'",
                                     opt.o_mnt);
                            /* Don't report progress to coordinator for this cookie:
                             * the copy function will get ECANCELED when reporting
                             * progress. */
                            err_minor++;
                            return 0;
                            break;
            
            

            Many a times if CT gets killed during hsm_restore process poses lay out change problem. I mean if the file is halfway restored and CT gets killed, then we cannot really reset the request. It is quiet a bit copy tool centered and sequence of actions are controlled by copy tool.

            Please correct me if wrong.

            vinayakh Vinayak (Inactive) added a comment - - edited Hello Frank, I have considered 2 cases here 1. Kill the copy tool (kill without -9) In this case (as this issue explains) if the copy tool is killed during any hsm activity then the copy tool un-register process will wait for the on going hsm activity to finish and then un-registration continues. Ex: md5sum on hsm_released file calls for hsm_restore of the file. During this hsm_restore process if we kill the copy tool then the restore process will get stuck in state=STARTED and in turn md5sum keeps waiting for hsm_restore to complete till the request_timeout. The patch what I have submitted will wait till this activity to finish and md5sum return smoothly. 2. If the data mover node (client) powered down - No need to wait for the request to complete In this case the data mover node itself is down and no copy tool available at the moment to continue the activity. So I thought to un-register the copy tool and cancel all the requests running on that CT as the process waiting on the hsm action also gets killed when the powered down. So that other copy tools which are looking to do any hsm activity on the same file will be processed successfully. Answer for your questions. >> Why cancel the requests? Can't they be just reset? 1. reset means add the request back to the queue and process the same when the new copy tool with same archive id comes alive ? This will still make the process to keep waiting for hsm action to complete. 2. Point #1 explains this. I am not resetting the requests but waiting for them to complete which also helps returning to the process waiting on hsm activity. The order in the patch might be confusing. Earlier I was trying align the running the requests to any other copy tool with same archive id in case if the current copy tool gets killed but stopped there as it is against the present hsm design. I would like bring one more point here. HSMA_CANCEL in copy tool is not yet implemented. Even if we send HSMA_CANCEL to the copy tool, it does nothing. case HSMA_CANCEL: CT_TRACE("cancel not implemented for file system '%s'", opt.o_mnt); /* Don't report progress to coordinator for this cookie: * the copy function will get ECANCELED when reporting * progress. */ err_minor++; return 0; break; Many a times if CT gets killed during hsm_restore process poses lay out change problem. I mean if the file is halfway restored and CT gets killed, then we cannot really reset the request. It is quiet a bit copy tool centered and sequence of actions are controlled by copy tool. Please correct me if wrong.

            Why cancel the requests? Can't they be just reset? IMO if a copytool fails, we should not cancel those requests because that will force a user to restart the archiving requests, and all failed restore command will create issues with the application wanting to access the files.

            The archive requests can be reset in the queue, and restarted later when a new copytool shows up. As for the restore requests, they have a timeout that should be respected, and they shouldn't fail just because the copytool was terminated.

            fzago Frank Zago (Inactive) added a comment - Why cancel the requests? Can't they be just reset? IMO if a copytool fails, we should not cancel those requests because that will force a user to restart the archiving requests, and all failed restore command will create issues with the application wanting to access the files. The archive requests can be reset in the queue, and restarted later when a new copytool shows up. As for the restore requests, they have a timeout that should be respected, and they shouldn't fail just because the copytool was terminated.

            Thanks Vinayak. I'll try to test it soon.

            fzago Frank Zago (Inactive) added a comment - Thanks Vinayak. I'll try to test it soon.

            I have submitted a patch to address this issue (also considering client eviction case). Please review and give your feedback to cover all possible scenarios.

            Patch can be tracked at : http://review.whamcloud.com/#/c/19369/

            vinayakh Vinayak (Inactive) added a comment - I have submitted a patch to address this issue (also considering client eviction case). Please review and give your feedback to cover all possible scenarios. Patch can be tracked at : http://review.whamcloud.com/#/c/19369/

            Okay Frank I got your point.
            I was thinking about hsm storage. Even if copy tool starts again on the same copy tool there is no guarantee that same hsm storage is used. I agree that storing uuid of CT does not make any sense.

            Apart from this any suggestion you want to provide or should I proceed with this solution ?

            gaurav_mahajan gaurav mahajan (Inactive) added a comment - Okay Frank I got your point. I was thinking about hsm storage. Even if copy tool starts again on the same copy tool there is no guarantee that same hsm storage is used. I agree that storing uuid of CT does not make any sense. Apart from this any suggestion you want to provide or should I proceed with this solution ?

            But the copytool may not restart on the same node, ever. So why keep the uuid of that copytool around?

            fzago Frank Zago (Inactive) added a comment - But the copytool may not restart on the same node, ever. So why keep the uuid of that copytool around?

            Yes Frank.
            >> That won't work if the copytool on node A dies, and a new copytool is started on node B.
            That's the reason why I wish store the md5sum of a client name into car->car_uuid (place to store uuid of the copy tool. When copy tool dies then this uuid will not give any meaning. So we shall use it as a place holder for client name). This part will help in identifying the right node to re-queue the requests when the new copy tool comes alive on the same node.

            >> If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request

            Yes. This part is also handled. Old requests will be added to the tail of the car_request_list by placing uuid of the next copy tool in car->car_uuid (next copy tool on the same node treats the old requests as new).

            >> The biggest code change would be to detect when a copytool is no longer reachable/functional.
            Currently few of the signals caught by the copy tool will indicate coordinator to un register the copy tool. Need to look into it. But above logic works I guess.

            gaurav_mahajan gaurav mahajan (Inactive) added a comment - Yes Frank. >> That won't work if the copytool on node A dies, and a new copytool is started on node B. That's the reason why I wish store the md5sum of a client name into car->car_uuid (place to store uuid of the copy tool. When copy tool dies then this uuid will not give any meaning. So we shall use it as a place holder for client name). This part will help in identifying the right node to re-queue the requests when the new copy tool comes alive on the same node. >> If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request Yes. This part is also handled. Old requests will be added to the tail of the car_request_list by placing uuid of the next copy tool in car->car_uuid (next copy tool on the same node treats the old requests as new). >> The biggest code change would be to detect when a copytool is no longer reachable/functional. Currently few of the signals caught by the copy tool will indicate coordinator to un register the copy tool. Need to look into it. But above logic works I guess.

            That won't work if the copytool on node A dies, and a new copytool is started on node B (like in a HA setup for instance). If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request. The biggest code change would be to detect when a copytool is no longer reachable/functional.

            fzago Frank Zago (Inactive) added a comment - That won't work if the copytool on node A dies, and a new copytool is started on node B (like in a HA setup for instance). If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request. The biggest code change would be to detect when a copytool is no longer reachable/functional.

            Below is my new approach for LU-5216. Please suggest

            While un registering a copytool agent from coordinator.

            1. Find the requests which are under process by particular agent using the uuid of the copy tool in the car_request_list.

            2. Once the car (cdt_agent_req) of the particular copy tool agent is found, place the client details (got from ptlrpc, take md5sum of client name) in "car->car_uuid" (This hack of storing client name is required to run the stuck requests when the new copy tool comes alive on the particular client as we need to assign the requests only to copy tool running on the particular client)

            While registering the copy tool to coordinator.

            3. Find whether the request for registration of a copy tool agent is on a client on which copy tool got killed (compared taking md5sum of client name from ptlrpc and already stored details about the client)

            4. If the request is on the same client then go through the car_request_list, get the car (cdt_agent_req), remove from the existing list and add the same request to list by by placing uuid of the new copy tool in car->car_uuid.

            gaurav_mahajan gaurav mahajan (Inactive) added a comment - Below is my new approach for LU-5216 . Please suggest While un registering a copytool agent from coordinator. 1. Find the requests which are under process by particular agent using the uuid of the copy tool in the car_request_list. 2. Once the car (cdt_agent_req) of the particular copy tool agent is found, place the client details (got from ptlrpc, take md5sum of client name) in "car->car_uuid" (This hack of storing client name is required to run the stuck requests when the new copy tool comes alive on the particular client as we need to assign the requests only to copy tool running on the particular client) While registering the copy tool to coordinator. 3. Find whether the request for registration of a copy tool agent is on a client on which copy tool got killed (compared taking md5sum of client name from ptlrpc and already stored details about the client) 4. If the request is on the same client then go through the car_request_list, get the car (cdt_agent_req), remove from the existing list and add the same request to list by by placing uuid of the new copy tool in car->car_uuid.

            People

              riauxjb Jean-Baptiste Riaux (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: