Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8010

lfs hsm command hangs up after lfs hsm_cancel

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • HSM
    • 9223372036854775807

    Description

      When hsm_restore are canceled, the following lfs hsm command hangs up.
      To reproduce, setup HSM and do the following
      (We found 2 patterns. They seem to be hanged at different points.):

      (1)

      # lfs hsm_state /lustre/file
      /lustre/file: (0x0000000d) released exists archived, archive_id:1
      # lfs hsm_restore /lustre/file
      # lfs hsm_cancel /lustre/file
      # lfs hsm_restore /lustre/file (quickly after lfs hsm_cancel)
      # lfs hsm_restore /lustre/file  *hang up
      

      The call trace is as following:

      PID: 9550   TASK: ffff880c33f86040  CPU: 13  COMMAND: "lfs"
       #0 [ffff880c1ca8f568] schedule at ffffffff81539170
       #1 [ffff880c1ca8f640] schedule_timeout at ffffffff8153a042
       #2 [ffff880c1ca8f6f0] ldlm_completion_ast at ffffffffa0847fc9 [ptlrpc]
       #3 [ffff880c1ca8f7a0] ldlm_cli_enqueue_fini at ffffffffa08420e6 [ptlrpc]
       #4 [ffff880c1ca8f840] ldlm_cli_enqueue at ffffffffa08429a1 [ptlrpc]
       #5 [ffff880c1ca8f8f0] mdc_enqueue at ffffffffa0a6e8aa [mdc]
       #6 [ffff880c1ca8fa40] lmv_enqueue at ffffffffa0a25bfb [lmv]
       #7 [ffff880c1ca8fac0] ll_layout_refresh_locked at ffffffffa0b084f6 [lustre]
       #8 [ffff880c1ca8fc00] ll_layout_refresh at ffffffffa0b09159 [lustre]
       #9 [ffff880c1ca8fc50] vvp_io_init at ffffffffa0b5227f [lustre]
      #10 [ffff880c1ca8fcc0] cl_io_init0 at ffffffffa06a8e78 [obdclass]
      #11 [ffff880c1ca8fd00] cl_io_init at ffffffffa06abdf4 [obdclass]
      #12 [ffff880c1ca8fd40] cl_glimpse_size0 at ffffffffa0b4bc05 [lustre]
      #13 [ffff880c1ca8fda0] ll_getattr at ffffffffa0b07cb8 [lustre]
      #14 [ffff880c1ca8fe40] vfs_getattr at ffffffff81197c61
      #15 [ffff880c1ca8fe80] vfs_fstatat at ffffffff81197cf4
      #16 [ffff880c1ca8fee0] vfs_lstat at ffffffff81197d9e
      #17 [ffff880c1ca8fef0] sys_newlstat at ffffffff81197dc4
      #18 [ffff880c1ca8ff80] tracesys at ffffffff8100b2e8 (via system_call)
      

      (2)

      # lfs hsm_state /lustre/file
      /lustre/file: (0x0000000d) released exists archived, archive_id:1
      # lfs hsm_restore /lustre/file
      <MDS Immediate reset>
      <MDS recover and HSM setup>
      # lfs hsm_action /lustre/file
      /lustre/file: RESTORE waiting (from 0 to EOF)
      # lfs hsm_state /lustre/file
      /lustre/file: (0x0000000d) released exists archived, archive_id:1
      # lfs hsm_cancel /lustre/file
      # lfs hsm_restore /lustre/file   *hang up
      

      The call trace is as following:

      PID: 3731   TASK: ffff880c35087520  CPU: 13  COMMAND: "lfs"
       #0 [ffff880c327df598] schedule at ffffffff81539170
       #1 [ffff880c327df670] schedule_timeout at ffffffff8153a042
       #2 [ffff880c327df720] ptlrpc_set_wait at ffffffffa0860c41 [ptlrpc]
       #3 [ffff880c327df7e0] ptlrpc_queue_wait at ffffffffa0861301 [ptlrpc]
       #4 [ffff880c327df800] mdc_ioc_hsm_request at ffffffffa0a65138 [mdc]
       #5 [ffff880c327df830] mdc_iocontrol at ffffffffa0a66df9 [mdc]
       #6 [ffff880c327df950] obd_iocontrol at ffffffffa0a16aa5 [lmv]
       #7 [ffff880c327df9a0] lmv_iocontrol at ffffffffa0a2d9dc [lmv]
       #8 [ffff880c327dfb90] obd_iocontrol at ffffffffa0af2d15 [lustre]
       #9 [ffff880c327dfbe0] ll_dir_ioctl at ffffffffa0af9ae8 [lustre]
      #10 [ffff880c327dfe60] vfs_ioctl at ffffffff811a7972
      #11 [ffff880c327dfea0] do_vfs_ioctl at ffffffff811a7b14
      #12 [ffff880c327dff30] sys_ioctl at ffffffff811a8091
      #13 [ffff880c327dff80] tracesys at ffffffff8100b2e8 (via system_call)
      

      Attachments

        Activity

          [LU-8010] lfs hsm command hangs up after lfs hsm_cancel
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19710/
          Subject: LU-8010 mdt: fix orphan layout_lock cases for restore
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c13ddec8e1cd3a63c16e08f28749771200b92f1b

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19710/ Subject: LU-8010 mdt: fix orphan layout_lock cases for restore Project: fs/lustre-release Branch: master Current Patch Set: Commit: c13ddec8e1cd3a63c16e08f28749771200b92f1b

          Sorry for the late reply. I have checked with your new patch and both reproducers did not trigger a hang-up.
          Thanks

          takamura Tatsushi Takamura added a comment - Sorry for the late reply. I have checked with your new patch and both reproducers did not trigger a hang-up. Thanks

          Right, I double-checked and reproducer #2 is an other condition, leading to the same consequence (orphan layout lock), that must be fixed alternatively. In a few words, my understatnding is that a restore presently started when a MDT failure occurs must be set back in WAITING state to be replayed (if legal) upon CDT restart. This will allow to cancel it without leaving associated layout locked.
          I have pushed a new version/patch-set, let see how it works.

          bfaccini Bruno Faccini (Inactive) added a comment - Right, I double-checked and reproducer #2 is an other condition, leading to the same consequence (orphan layout lock), that must be fixed alternatively. In a few words, my understatnding is that a restore presently started when a MDT failure occurs must be set back in WAITING state to be replayed (if legal) upon CDT restart. This will allow to cancel it without leaving associated layout locked. I have pushed a new version/patch-set, let see how it works.

          Hello Bruno,

          I have confirmed that command hang-up from the reproducer(1) is resolved, but the restore command still hangs up using the the reproducer(2).
          Could you check again?

          takamura Tatsushi Takamura added a comment - Hello Bruno, I have confirmed that command hang-up from the reproducer(1) is resolved, but the restore command still hangs up using the the reproducer(2). Could you check again?

          Hello Tatsushi,
          My local testing, using the reproducers you had detailed, with my patch did not trigger a new hang situation.
          Can you give it a try at your site ?

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Tatsushi, My local testing, using the reproducers you had detailed, with my patch did not trigger a new hang situation. Can you give it a try at your site ?

          Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19710
          Subject: LU-8010 mdt: unlock layout when failing restore
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 656abdd8a342902e67dc77a86e30fb0e0a30c8da

          gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19710 Subject: LU-8010 mdt: unlock layout when failing restore Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 656abdd8a342902e67dc77a86e30fb0e0a30c8da

          In both reproducers cases/scenarios, the root cause is the layout_lock not being released and leading to hung threads on both Client/MDS sides during handling of HSM requests on same file.

          I think I have found a hole in HSM code that can lead to a layout_lock taken for a restore not being released. I will push a tentative patch to fix it.

          bfaccini Bruno Faccini (Inactive) added a comment - In both reproducers cases/scenarios, the root cause is the layout_lock not being released and leading to hung threads on both Client/MDS sides during handling of HSM requests on same file. I think I have found a hole in HSM code that can lead to a layout_lock taken for a restore not being released. I will push a tentative patch to fix it.

          People

            bfaccini Bruno Faccini (Inactive)
            takamura Tatsushi Takamura
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: