Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4727

Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.5.1
    • 3
    • 13001

    Description

      This is easy to reproduce. I hit this problem every time when I trying to run following commands.

      rm /mnt/lustre/XXXX -f;
      echo XXX > /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_release /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX; # This will restore automatically
      lfs hsm_release /mnt/lustre/XXXX;
      lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
      cat /mnt/lustre/XXXX; # this will stuck

      And after some time, following messages shew up.

      INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      flush-lustre- D 0000000000000005 0 4106 2 0x00000080
      ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
      0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
      ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
      Call Trace:
      [<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff814fc89b>] mutex_lock+0x2b/0x50
      [<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
      [<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
      [<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
      [<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      [<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
      [<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
      [<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      [<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
      [<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
      [<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
      [<ffffffff81128d81>] do_writepages+0x21/0x40
      [<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
      [<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
      [<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
      [<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
      [<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
      [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
      [<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
      [<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
      [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81137a66>] bdi_start_fn+0x86/0x100
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81091d66>] kthread+0x96/0xa0
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffff81091cd0>] ? kthread+0x0/0xa0
      [<ffffffff8100c140>] ? child_rip+0x0/0x20

      It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.

      Attachments

        Issue Links

          Activity

            [LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring
            lixi Li Xi (Inactive) added a comment - - edited

            Hi John,

            Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that LU-4727 is fixed for me too.

            I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone.

            However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks!

            lixi Li Xi (Inactive) added a comment - - edited Hi John, Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that LU-4727 is fixed for me too. I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone. However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks!
            jhammond John Hammond added a comment -

            > Hmm, I still got following error on recent master branch with 13750 patch.
            > ...

            Please post your reproducer.

            jhammond John Hammond added a comment - > Hmm, I still got following error on recent master branch with 13750 patch. > ... Please post your reproducer.

            Hmm, I still got following error on recent master branch with 13750 patch.

            May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds.
            May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1
            May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080
            May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0
            May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980
            May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8
            May 6 19:13:44 server1 kernel: Call Trace:
            May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc]
            May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180
            May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50
            May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre]
            May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre]
            May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre]
            May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340
            May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0
            May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20
            May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90
            May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160
            May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140
            May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30
            May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

            lixi Li Xi (Inactive) added a comment - Hmm, I still got following error on recent master branch with 13750 patch. May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds. May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1 May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080 May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0 May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980 May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8 May 6 19:13:44 server1 kernel: Call Trace: May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc] May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180 May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50 May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre] May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre] May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre] May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340 May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0 May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20 May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90 May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160 May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140 May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30 May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            pjones Peter Jones added a comment -

            This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

            pjones Peter Jones added a comment - This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

            Yes, the patch can fix this problem

            jay Jinshan Xiong (Inactive) added a comment - Yes, the patch can fix this problem

            The process is stat, which does a stat() on a file.
            Is it the problem your patch should fix, or is it fixing something else?

            fzago Frank Zago (Inactive) added a comment - The process is stat, which does a stat() on a file. Is it the problem your patch should fix, or is it fixing something else?

            People

              jay Jinshan Xiong (Inactive)
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: