Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4727

Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.5.1
    • 3
    • 13001

    Description

      This is easy to reproduce. I hit this problem every time when I trying to run following commands.

      rm /mnt/lustre/XXXX -f;
      echo XXX > /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_release /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX; # This will restore automatically
      lfs hsm_release /mnt/lustre/XXXX;
      lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
      cat /mnt/lustre/XXXX; # this will stuck

      And after some time, following messages shew up.

      INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      flush-lustre- D 0000000000000005 0 4106 2 0x00000080
      ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
      0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
      ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
      Call Trace:
      [<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff814fc89b>] mutex_lock+0x2b/0x50
      [<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
      [<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
      [<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
      [<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      [<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
      [<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
      [<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      [<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
      [<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
      [<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
      [<ffffffff81128d81>] do_writepages+0x21/0x40
      [<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
      [<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
      [<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
      [<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
      [<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
      [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
      [<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
      [<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
      [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81137a66>] bdi_start_fn+0x86/0x100
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81091d66>] kthread+0x96/0xa0
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffff81091cd0>] ? kthread+0x0/0xa0
      [<ffffffff8100c140>] ? child_rip+0x0/0x20

      It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.

      Attachments

        Issue Links

          Activity

            [LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring
            pjones Peter Jones added a comment -

            This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

            pjones Peter Jones added a comment - This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

            Yes, the patch can fix this problem

            jay Jinshan Xiong (Inactive) added a comment - Yes, the patch can fix this problem

            The process is stat, which does a stat() on a file.
            Is it the problem your patch should fix, or is it fixing something else?

            fzago Frank Zago (Inactive) added a comment - The process is stat, which does a stat() on a file. Is it the problem your patch should fix, or is it fixing something else?

            What's this process? From the name I guess it's not copy tool. John's patch can only fix copy tool case.

            You will need patch 13138 too to address this case if the restoring will take longer than 120 seconds.

            jay Jinshan Xiong (Inactive) added a comment - What's this process? From the name I guess it's not copy tool. John's patch can only fix copy tool case. You will need patch 13138 too to address this case if the restoring will take longer than 120 seconds.

            Jinshan,

            I got this trace when using stat on a file I was restoring (or was restored). I haven't been able to reproduce it so far. Is it the error you mention in your patch? John's patch is applied on that tree, and works well otherwise.

            Mar 20 15:00:55 tasclient01 kernel: INFO: task stat:951 blocked for more than 120 seconds.
            Mar 20 15:00:55 tasclient01 kernel:      Tainted: P           ---------------    2.6.32-431.17.1.el6.x86_64 #1
            Mar 20 15:00:55 tasclient01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables thi
            s message.
            Mar 20 15:00:55 tasclient01 kernel: stat          D 0000000000000000     0   951  29769 0x00000084
            Mar 20 15:00:55 tasclient01 kernel: ffff8801d0471a58 0000000000000082 0000000000000000 000000000000000d
            Mar 20 15:00:55 tasclient01 kernel: 0000000000000004 ffff880237fee800 ffff880116db5610 0000000000000630
            Mar 20 15:00:55 tasclient01 kernel: ffff8802143f1098 ffff8801d0471fd8 000000000000fbc8 ffff8802143f1098
            Mar 20 15:00:55 tasclient01 kernel: Call Trace:
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09e009a>] ll_layout_refresh+0x1da/0xc60 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a04ab0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa06996a0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04dfe31>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a2d8f0>] vvp_io_init+0x340/0x490 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04df278>] cl_io_init0+0x98/0x160 [obdclass]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04e1ea4>] cl_io_init+0x64/0xe0 [obdclass]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a23161>] cl_glimpse_size0+0x91/0x1d0 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c25>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c99>] ll_getattr_it+0x49/0x170 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4df7>] ll_getattr+0x37/0x40 [lustre]
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff81227163>] ? security_inode_getattr+0x23/0x30
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e631>] vfs_getattr+0x51/0x80
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e6c4>] vfs_fstatat+0x64/0xa0
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e76e>] vfs_lstat+0x1e/0x20
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e794>] sys_newlstat+0x24/0x50
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810e1cc7>] ? audit_syscall_entry+0x1d7/0x200
            Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            
            fzago Frank Zago (Inactive) added a comment - Jinshan, I got this trace when using stat on a file I was restoring (or was restored). I haven't been able to reproduce it so far. Is it the error you mention in your patch? John's patch is applied on that tree, and works well otherwise. Mar 20 15:00:55 tasclient01 kernel: INFO: task stat:951 blocked for more than 120 seconds. Mar 20 15:00:55 tasclient01 kernel: Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1 Mar 20 15:00:55 tasclient01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables thi s message. Mar 20 15:00:55 tasclient01 kernel: stat D 0000000000000000 0 951 29769 0x00000084 Mar 20 15:00:55 tasclient01 kernel: ffff8801d0471a58 0000000000000082 0000000000000000 000000000000000d Mar 20 15:00:55 tasclient01 kernel: 0000000000000004 ffff880237fee800 ffff880116db5610 0000000000000630 Mar 20 15:00:55 tasclient01 kernel: ffff8802143f1098 ffff8801d0471fd8 000000000000fbc8 ffff8802143f1098 Mar 20 15:00:55 tasclient01 kernel: Call Trace: Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50 Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09e009a>] ll_layout_refresh+0x1da/0xc60 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a04ab0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa06996a0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04dfe31>] ? cl_io_slice_add+0xc1/0x190 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a2d8f0>] vvp_io_init+0x340/0x490 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04df278>] cl_io_init0+0x98/0x160 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04e1ea4>] cl_io_init+0x64/0xe0 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a23161>] cl_glimpse_size0+0x91/0x1d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c25>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c99>] ll_getattr_it+0x49/0x170 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4df7>] ll_getattr+0x37/0x40 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff81227163>] ? security_inode_getattr+0x23/0x30 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e631>] vfs_getattr+0x51/0x80 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e6c4>] vfs_fstatat+0x64/0xa0 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e76e>] vfs_lstat+0x1e/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e794>] sys_newlstat+0x24/0x50 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810e1cc7>] ? audit_syscall_entry+0x1d7/0x200 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13750/
            Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 96dbac2eaef7a5d1090807bedc9951279c06d037

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13750/ Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore Project: fs/lustre-release Branch: master Current Patch Set: Commit: 96dbac2eaef7a5d1090807bedc9951279c06d037

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750
            Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8ec6354ded37f3e1f39d6e0336c9e17b1a97785b

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750 Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8ec6354ded37f3e1f39d6e0336c9e17b1a97785b

            the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh.

            jay Jinshan Xiong (Inactive) added a comment - the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh.

            http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan

            vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan

            Please try patch 13138 and check if it can fix the problem.

            jay Jinshan Xiong (Inactive) added a comment - Please try patch 13138 and check if it can fix the problem.

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138
            Subject: LU-4727 hsm: flush UPDATE lock for restore
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bf2e4b958f60cb7eda9303ad0c079fd23ff2d16b

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138 Subject: LU-4727 hsm: flush UPDATE lock for restore Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bf2e4b958f60cb7eda9303ad0c079fd23ff2d16b

            People

              jay Jinshan Xiong (Inactive)
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: