Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3621

during failover testing, statahead hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.5
    • None
    • RHEL 6.4 clients
    • 3
    • 9320

    Description

      KIT has recently been doing failover testing with 2.4.0 clients and 2.1.5 servers. During a tree copy and deletion test, the client hung in the deletion phase with a lot of statahead stack traces, like:
      Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for more than 120 seconds.
      Jul 15 15:37:13 ucbn001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 15 15:37:13 ucbn001 kernel: ll_sa_71331 D 000000000000001e 0 90417 2 0x00000080
      Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7b10 0000000000000046 ffff881030ce7a80 ffffffff00000050
      Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7ab0 0000000000000246 0000000000000010 ffff88086aac23c0
      Jul 15 15:37:13 ucbn001 kernel: ffff881042c59af8 ffff881030ce7fd8 000000000000fb88 ffff881042c59af8
      Jul 15 15:37:13 ucbn001 kernel: Call Trace:
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096f8e>] ? prepare_to_wait+0x4e/0x80
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119c928>] __wait_on_freeing_inode+0x98/0xc0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119cad4>] find_inode+0x64/0x90
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119dc1d>] ifind+0x4d/0xd0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119e099>] iget5_locked+0x59/0x1b0
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fe00>] ? ll_set_inode+0x0/0x1a0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fb05>] ll_iget+0x65/0x360 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0bf88d3>] ? lmv_get_lustre_md+0x153/0x3d0 [lmv]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa076d9a1>] ? ldlm_revalidate_lock_handle+0x81/0x250 [ptlrpc]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b13080>] ll_prep_inode+0x6e0/0xf60 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0c0d8c4>] ? lmv_revalidate_lock+0x2b4/0x550 [lmv]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b36a47>] ll_post_statahead+0x2f7/0xa80 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3b4b8>] ll_statahead_thread+0xd38/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
      Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      The client hung until the directory was rm'd on a different client, at which point the rm finished on the original client. There don't appear to be any messages in the server logs, except failover messages.

      Here is a description of the test from KIT:

      I did 2 failover tests.

      First I rebooted oss2 at Jul 15 14:30. During the reboot the file tree
      was copied multiple times. The delete process started around
      Jul 15 14:35 on ucbn001. The following message corresponds to the
      hanging situation:
      Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for
      more than 120 seconds.
      Ctrl+C did not stop the rm process. Around Jul 15 15:44 I deleted the
      rest of the file tree on ucbn006 and the rm process completed.

      Second I started all tests again and rebooted mds2 near Jul 15 15:55.
      This time the delete process started around Jul 15 16:35 on ucbn001.
      The following message corresponds to the hanging situation:
      Jul 15 16:37:13 ucbn001 kernel: INFO: task rm:91092 blocked for more
      than 120 seconds.
      This time I waited for half an hour but the rm did not complete.
      Around Jul 15 17:00 I deleted the rest of the file tree on ucbn006
      and the rm process completed.

      Are there any debug logs we should try to get?

      Thanks.

      Attachments

        1. ucbn001.log
          30 kB
        2. oss2.kern
          6 kB
        3. oss1.kern
          4 kB
        4. mds2.kern
          4 kB
        5. mds1.kern
          4 kB

        Activity

          People

            bobijam Zhenyu Xu
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: