Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.1
    • Centos 6.5, Lustre 2.5.56
    • 3
    • 14550

    Description

      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf
      

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      However lfs hsm operation still works on other files.

      Attachments

        Issue Links

          Activity

            [LU-5216] HSM: restore doesn't restart for new copytool
            tappro Mikhail Pershin made changes -
            Link New: This issue is related to LU-11284 [ LU-11284 ]
            tappro Mikhail Pershin made changes -
            Link New: This issue is blocking LU-10175 [ LU-10175 ]
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.11.0 [ 13091 ]
            jhammond John Hammond made changes -
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.11.0 [ 13091 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            riauxjb Jean-Baptiste Riaux (Inactive) made changes -
            Assignee Original: Robert Read [ rread ] New: Jean-Baptiste Riaux [ riauxjb ]
            pjones Peter Jones made changes -
            Link New: This issue is duplicated by SEA-308 [ SEA-308 ]
            jhammond John Hammond made changes -
            Description Original: When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

            For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:
            {noformat}
            # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
            # lfs hsm_archive /mnt/lustre/bigf
            # lfs hsm_release /mnt/lustre/bigf
            # sleep 5
            # md5sum /mnt/lustre/bigf
            {noformat}

            During the restoration, kill the copytool, so no complete event is sent to the MDS.

            Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

            When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

            {noformat}
            # cat /proc/1675/stack
            [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
            [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
            [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
            [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
            [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
            [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
            [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
            [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
            [<ffffffff81189365>] vfs_read+0xb5/0x1a0
            [<ffffffff811894a1>] sys_read+0x51/0x90
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
            {noformat}

            It is also unkillable.

            If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:
            {noformat}
            # cat /proc/1723/stack
            [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
            [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
            [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
            [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
            [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
            [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
            [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
            [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
            [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
            [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
            {noformat}

            However lfs hsm operation still works on other files.

            New: When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

            For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:
            {noformat}
            # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
            # lfs hsm_archive /mnt/lustre/bigf
            # lfs hsm_release /mnt/lustre/bigf
            # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
            # md5sum /mnt/lustre/bigf
            {noformat}

            During the restoration, kill the copytool, so no complete event is sent to the MDS.

            Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

            When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

            {noformat}
            # cat /proc/1675/stack
            [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
            [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
            [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
            [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
            [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
            [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
            [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
            [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
            [<ffffffff81189365>] vfs_read+0xb5/0x1a0
            [<ffffffff811894a1>] sys_read+0x51/0x90
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
            {noformat}

            It is also unkillable.

            If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:
            {noformat}
            # cat /proc/1723/stack
            [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
            [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
            [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
            [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
            [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
            [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
            [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
            [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
            [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
            [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
            {noformat}

            However lfs hsm operation still works on other files.

            parinay parinay v kondekar (Inactive) made changes -
            Link New: This issue is related to SEA-128 [ SEA-128 ]
            vinayakh Vinayak (Inactive) made changes -
            Comment [ Hello Intel,

            I am not able to cherry pick the patches from https://review.whamcloud.com/ but git pull is working fine. Can you please help me on this ?

            Failing with below error.
            {noformat}
            error: while accessing https://review.whamcloud.com/fs/lustre-release/info/refs

            fatal: HTTP request failed

            {noformat}

            Thanks,
            ]

            People

              riauxjb Jean-Baptiste Riaux (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: