Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool


    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.5.1
    • Fix Version/s: None
    • Labels:
    • Environment:
      Centos 6.5, Lustre 2.5.56
    • Severity:
    • Rank (Obsolete):


      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff

      However lfs hsm operation still works on other files.


          Issue Links



              • Assignee:
                riauxjb Jean-Baptiste Riaux (Inactive)
                fzago Frank Zago
              • Votes:
                3 Vote for this issue
                31 Start watching this issue


                • Created: