Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.0, Lustre 2.1.5
-
None
-
RHEL 6.4 clients
-
3
-
9320
Description
KIT has recently been doing failover testing with 2.4.0 clients and 2.1.5 servers. During a tree copy and deletion test, the client hung in the deletion phase with a lot of statahead stack traces, like:
Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for more than 120 seconds.
Jul 15 15:37:13 ucbn001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 15 15:37:13 ucbn001 kernel: ll_sa_71331 D 000000000000001e 0 90417 2 0x00000080
Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7b10 0000000000000046 ffff881030ce7a80 ffffffff00000050
Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7ab0 0000000000000246 0000000000000010 ffff88086aac23c0
Jul 15 15:37:13 ucbn001 kernel: ffff881042c59af8 ffff881030ce7fd8 000000000000fb88 ffff881042c59af8
Jul 15 15:37:13 ucbn001 kernel: Call Trace:
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096f8e>] ? prepare_to_wait+0x4e/0x80
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119c928>] __wait_on_freeing_inode+0x98/0xc0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119cad4>] find_inode+0x64/0x90
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119dc1d>] ifind+0x4d/0xd0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119e099>] iget5_locked+0x59/0x1b0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fe00>] ? ll_set_inode+0x0/0x1a0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fb05>] ll_iget+0x65/0x360 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0bf88d3>] ? lmv_get_lustre_md+0x153/0x3d0 [lmv]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa076d9a1>] ? ldlm_revalidate_lock_handle+0x81/0x250 [ptlrpc]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b13080>] ll_prep_inode+0x6e0/0xf60 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0c0d8c4>] ? lmv_revalidate_lock+0x2b4/0x550 [lmv]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b36a47>] ll_post_statahead+0x2f7/0xa80 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3b4b8>] ll_statahead_thread+0xd38/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
The client hung until the directory was rm'd on a different client, at which point the rm finished on the original client. There don't appear to be any messages in the server logs, except failover messages.
Here is a description of the test from KIT:
I did 2 failover tests.
First I rebooted oss2 at Jul 15 14:30. During the reboot the file tree
was copied multiple times. The delete process started around
Jul 15 14:35 on ucbn001. The following message corresponds to the
hanging situation:
Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for
more than 120 seconds.
Ctrl+C did not stop the rm process. Around Jul 15 15:44 I deleted the
rest of the file tree on ucbn006 and the rm process completed.Second I started all tests again and rebooted mds2 near Jul 15 15:55.
This time the delete process started around Jul 15 16:35 on ucbn001.
The following message corresponds to the hanging situation:
Jul 15 16:37:13 ucbn001 kernel: INFO: task rm:91092 blocked for more
than 120 seconds.
This time I waited for half an hour but the rm did not complete.
Around Jul 15 17:00 I deleted the rest of the file tree on ucbn006
and the rm process completed.
Are there any debug logs we should try to get?
Thanks.