Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4727

Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.5.1
    • 3
    • 13001

    Description

      This is easy to reproduce. I hit this problem every time when I trying to run following commands.

      rm /mnt/lustre/XXXX -f;
      echo XXX > /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX;
      lfs hsm_release /mnt/lustre/XXXX;
      cat /mnt/lustre/XXXX; # This will restore automatically
      lfs hsm_release /mnt/lustre/XXXX;
      lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
      cat /mnt/lustre/XXXX; # this will stuck

      And after some time, following messages shew up.

      INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      flush-lustre- D 0000000000000005 0 4106 2 0x00000080
      ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
      0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
      ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
      Call Trace:
      [<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff814fc89b>] mutex_lock+0x2b/0x50
      [<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
      [<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
      [<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
      [<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      [<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
      [<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
      [<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      [<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
      [<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
      [<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
      [<ffffffff81128d81>] do_writepages+0x21/0x40
      [<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
      [<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
      [<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
      [<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
      [<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
      [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
      [<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
      [<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
      [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81137a66>] bdi_start_fn+0x86/0x100
      [<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
      [<ffffffff81091d66>] kthread+0x96/0xa0
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffff81091cd0>] ? kthread+0x0/0xa0
      [<ffffffff8100c140>] ? child_rip+0x0/0x20

      It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: