Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3621

during failover testing, statahead hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.5
    • None
    • RHEL 6.4 clients
    • 3
    • 9320

    Description

      KIT has recently been doing failover testing with 2.4.0 clients and 2.1.5 servers. During a tree copy and deletion test, the client hung in the deletion phase with a lot of statahead stack traces, like:
      Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for more than 120 seconds.
      Jul 15 15:37:13 ucbn001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 15 15:37:13 ucbn001 kernel: ll_sa_71331 D 000000000000001e 0 90417 2 0x00000080
      Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7b10 0000000000000046 ffff881030ce7a80 ffffffff00000050
      Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7ab0 0000000000000246 0000000000000010 ffff88086aac23c0
      Jul 15 15:37:13 ucbn001 kernel: ffff881042c59af8 ffff881030ce7fd8 000000000000fb88 ffff881042c59af8
      Jul 15 15:37:13 ucbn001 kernel: Call Trace:
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096f8e>] ? prepare_to_wait+0x4e/0x80
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119c928>] __wait_on_freeing_inode+0x98/0xc0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119cad4>] find_inode+0x64/0x90
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119dc1d>] ifind+0x4d/0xd0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119e099>] iget5_locked+0x59/0x1b0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fe00>] ? ll_set_inode+0x0/0x1a0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fb05>] ll_iget+0x65/0x360 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0bf88d3>] ? lmv_get_lustre_md+0x153/0x3d0 [lmv]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa076d9a1>] ? ldlm_revalidate_lock_handle+0x81/0x250 [ptlrpc]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b13080>] ll_prep_inode+0x6e0/0xf60 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0c0d8c4>] ? lmv_revalidate_lock+0x2b4/0x550 [lmv]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b36a47>] ll_post_statahead+0x2f7/0xa80 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3b4b8>] ll_statahead_thread+0xd38/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      The client hung until the directory was rm'd on a different client, at which point the rm finished on the original client. There don't appear to be any messages in the server logs, except failover messages.

      Here is a description of the test from KIT:

      I did 2 failover tests.

      First I rebooted oss2 at Jul 15 14:30. During the reboot the file tree
      was copied multiple times. The delete process started around
      Jul 15 14:35 on ucbn001. The following message corresponds to the
      hanging situation:
      Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for
      more than 120 seconds.
      Ctrl+C did not stop the rm process. Around Jul 15 15:44 I deleted the
      rest of the file tree on ucbn006 and the rm process completed.

      Second I started all tests again and rebooted mds2 near Jul 15 15:55.
      This time the delete process started around Jul 15 16:35 on ucbn001.
      The following message corresponds to the hanging situation:
      Jul 15 16:37:13 ucbn001 kernel: INFO: task rm:91092 blocked for more
      than 120 seconds.
      This time I waited for half an hour but the rm did not complete.
      Around Jul 15 17:00 I deleted the rest of the file tree on ucbn006
      and the rm process completed.

      Are there any debug logs we should try to get?

      Thanks.

      Attachments

        1. mds1.kern
          4 kB
        2. mds2.kern
          4 kB
        3. oss1.kern
          4 kB
        4. oss2.kern
          6 kB
        5. ucbn001.log
          30 kB

        Activity

          People

            bobijam Zhenyu Xu
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: