Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5196

HSM: client task stuck waiting for mutex in ll_layout_refresh

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.1
    • 3
    • 14514

    Description

      Internal Cray bug report (from Frank Zago):
      After trying to migrate a file back on Lustre, from the TAS backend, I noticed nothing was happening. I ctrl-C the TAS copytool (CMM), and it doesn't exit.

      The kernel has the following trace:

      Lustre: Layout lock feature supported.
      Lustre: Mounted tas01-client
      INFO: task tas_cmm:18139 blocked for more than 120 seconds.
      Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tas_cmm D 0000000000000000 0 18139 14601 0x00000080
      ffff8801052c5a58 0000000000000082 0000000000000000 0000000000000000
      ffff880106ef8580 0000000000000000 ffff8801313c70b8 ffff8801389e8800
      ffff88007e070638 ffff8801052c5fd8 000000000000fbc8 ffff88007e070638
      Call Trace:
      [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
      [<ffffffff815291fb>] mutex_lock+0x2b/0x50
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa067bb1a>] ? ldlm_lock_add_to_lru_nolock+0x4a/0x110 [ptlrpc]
      [<ffffffffa03830d7>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      [<ffffffffa0a01040>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
      [<ffffffffa0694550>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      [<ffffffffa04db721>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04ce995>] ? cl_env_get+0x195/0x350 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa0a1f751>] cl_glimpse_size0+0x91/0x1d0 [lustre]
      [<ffffffffa09d0855>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre]
      [<ffffffff81196666>] ? final_putname+0x26/0x50
      [<ffffffffa09d08c9>] ll_getattr_it+0x49/0x170 [lustre]
      [<ffffffffa09d0a27>] ll_getattr+0x37/0x40 [lustre]
      [<ffffffff81227163>] ? security_inode_getattr+0x23/0x30
      [<ffffffff8118e631>] vfs_getattr+0x51/0x80
      [<ffffffff8118e6c4>] vfs_fstatat+0x64/0xa0
      [<ffffffff8118e82b>] vfs_stat+0x1b/0x20
      [<ffffffff8118e854>] sys_newstat+0x24/0x50
      [<ffffffff810e1cc7>] ? audit_syscall_entry+0x1d7/0x200
      [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

      Attachments

        Issue Links

          Activity

            [LU-5196] HSM: client task stuck waiting for mutex in ll_layout_refresh
            pjones Peter Jones added a comment -

            ok. Thanks Cory

            pjones Peter Jones added a comment - ok. Thanks Cory
            spitzcor Cory Spitz added a comment -

            This issue should be resolved as a dup of LU-4727.

            spitzcor Cory Spitz added a comment - This issue should be resolved as a dup of LU-4727 .

            Looks like a dup of LU-4727.

            fzago Frank Zago (Inactive) added a comment - Looks like a dup of LU-4727 .

            My apologies...

            After further investigation, I think we're not going to have information to debug this one. I hadn't looked closely enough. The other thread which is holding the mutex is waiting for a response from the MDS, and we do not have an MDS dump.

            Here's the stack trace for that other thread:

            PID: 18218 TASK: ffff88013c8b8aa0 CPU: 0 COMMAND: "md5sum"
            #0 [ffff8801152396d8] schedule at ffffffff81527bb0
            #1 [ffff8801152397a0] ldlm_completion_ast at ffffffffa0694a95 [ptlrpc]
            #2 [ffff880115239830] ldlm_cli_enqueue_fini at ffffffffa068efe6 [ptlrpc]
            #3 [ffff8801152398d0] ldlm_cli_enqueue at ffffffffa068f8c5 [ptlrpc]
            #4 [ffff880115239980] mdc_enqueue at ffffffffa0876d1e [mdc]
            #5 [ffff880115239ac0] lmv_enqueue at ffffffffa0b04a84 [lmv]
            #6 [ffff880115239b90] ll_layout_refresh at ffffffffa09dc305 [lustre]
            #7 [ffff880115239cd0] vvp_io_fini at ffffffffa0a27b33 [lustre]
            #8 [ffff880115239d30] vvp_io_read_fini at ffffffffa0a27cec [lustre]
            #9 [ffff880115239d60] cl_io_fini at ffffffffa04dd947 [obdclass]
            #10 [ffff880115239d90] ll_file_io_generic at ffffffffa09cbd27 [lustre]
            #11 [ffff880115239e20] ll_file_aio_read at ffffffffa09ccf8f [lustre]
            #12 [ffff880115239e80] ll_file_read at ffffffffa09cd27c [lustre]
            #13 [ffff880115239ef0] vfs_read at ffffffff81189365
            #14 [ffff880115239f30] sys_read at ffffffff811894a1
            #15 [ffff880115239f80] system_call_fastpath at ffffffff8100b072

            paf Patrick Farrell (Inactive) added a comment - My apologies... After further investigation, I think we're not going to have information to debug this one. I hadn't looked closely enough. The other thread which is holding the mutex is waiting for a response from the MDS, and we do not have an MDS dump. Here's the stack trace for that other thread: PID: 18218 TASK: ffff88013c8b8aa0 CPU: 0 COMMAND: "md5sum" #0 [ffff8801152396d8] schedule at ffffffff81527bb0 #1 [ffff8801152397a0] ldlm_completion_ast at ffffffffa0694a95 [ptlrpc] #2 [ffff880115239830] ldlm_cli_enqueue_fini at ffffffffa068efe6 [ptlrpc] #3 [ffff8801152398d0] ldlm_cli_enqueue at ffffffffa068f8c5 [ptlrpc] #4 [ffff880115239980] mdc_enqueue at ffffffffa0876d1e [mdc] #5 [ffff880115239ac0] lmv_enqueue at ffffffffa0b04a84 [lmv] #6 [ffff880115239b90] ll_layout_refresh at ffffffffa09dc305 [lustre] #7 [ffff880115239cd0] vvp_io_fini at ffffffffa0a27b33 [lustre] #8 [ffff880115239d30] vvp_io_read_fini at ffffffffa0a27cec [lustre] #9 [ffff880115239d60] cl_io_fini at ffffffffa04dd947 [obdclass] #10 [ffff880115239d90] ll_file_io_generic at ffffffffa09cbd27 [lustre] #11 [ffff880115239e20] ll_file_aio_read at ffffffffa09ccf8f [lustre] #12 [ffff880115239e80] ll_file_read at ffffffffa09cd27c [lustre] #13 [ffff880115239ef0] vfs_read at ffffffff81189365 #14 [ffff880115239f30] sys_read at ffffffff811894a1 #15 [ffff880115239f80] system_call_fastpath at ffffffff8100b072

            Dump uploading to ftp ftp.whamcloud.com:/uploads/LU-5196/LU-5196_ll_layout_refresh.tar.gz

            paf Patrick Farrell (Inactive) added a comment - Dump uploading to ftp ftp.whamcloud.com:/uploads/ LU-5196 / LU-5196 _ll_layout_refresh.tar.gz

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: