[LU-3621] during failover testing, statahead hangs Created: 23/Jul/13  Updated: 08/Aug/16  Resolved: 08/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Rajeshwaran Ganesan Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL 6.4 clients


Attachments: File mds1.kern     File mds2.kern     File oss1.kern     File oss2.kern     File ucbn001.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9320

 Description   

KIT has recently been doing failover testing with 2.4.0 clients and 2.1.5 servers. During a tree copy and deletion test, the client hung in the deletion phase with a lot of statahead stack traces, like:
Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for more than 120 seconds.
Jul 15 15:37:13 ucbn001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 15 15:37:13 ucbn001 kernel: ll_sa_71331 D 000000000000001e 0 90417 2 0x00000080
Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7b10 0000000000000046 ffff881030ce7a80 ffffffff00000050
Jul 15 15:37:13 ucbn001 kernel: ffff881030ce7ab0 0000000000000246 0000000000000010 ffff88086aac23c0
Jul 15 15:37:13 ucbn001 kernel: ffff881042c59af8 ffff881030ce7fd8 000000000000fb88 ffff881042c59af8
Jul 15 15:37:13 ucbn001 kernel: Call Trace:
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096f8e>] ? prepare_to_wait+0x4e/0x80
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119c928>] __wait_on_freeing_inode+0x98/0xc0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119cad4>] find_inode+0x64/0x90
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119dc1d>] ifind+0x4d/0xd0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1f630>] ? ll_test_inode+0x0/0x90 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8119e099>] iget5_locked+0x59/0x1b0
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fe00>] ? ll_set_inode+0x0/0x1a0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b1fb05>] ll_iget+0x65/0x360 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0bf88d3>] ? lmv_get_lustre_md+0x153/0x3d0 [lmv]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa076d9a1>] ? ldlm_revalidate_lock_handle+0x81/0x250 [ptlrpc]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b13080>] ll_prep_inode+0x6e0/0xf60 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0c0d8c4>] ? lmv_revalidate_lock+0x2b4/0x550 [lmv]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b36a47>] ll_post_statahead+0x2f7/0xa80 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3b4b8>] ll_statahead_thread+0xd38/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffffa0b3a780>] ? ll_statahead_thread+0x0/0xfa0 [lustre]
Jul 15 15:37:13 ucbn001 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The client hung until the directory was rm'd on a different client, at which point the rm finished on the original client. There don't appear to be any messages in the server logs, except failover messages.

Here is a description of the test from KIT:

I did 2 failover tests.

First I rebooted oss2 at Jul 15 14:30. During the reboot the file tree
was copied multiple times. The delete process started around
Jul 15 14:35 on ucbn001. The following message corresponds to the
hanging situation:
Jul 15 15:37:13 ucbn001 kernel: INFO: task ll_sa_71331:90417 blocked for
more than 120 seconds.
Ctrl+C did not stop the rm process. Around Jul 15 15:44 I deleted the
rest of the file tree on ucbn006 and the rm process completed.

Second I started all tests again and rebooted mds2 near Jul 15 15:55.
This time the delete process started around Jul 15 16:35 on ucbn001.
The following message corresponds to the hanging situation:
Jul 15 16:37:13 ucbn001 kernel: INFO: task rm:91092 blocked for more
than 120 seconds.
This time I waited for half an hour but the rm did not complete.
Around Jul 15 17:00 I deleted the rest of the file tree on ucbn006
and the rm process completed.

Are there any debug logs we should try to get?

Thanks.



 Comments   
Comment by Peter Jones [ 24/Jul/13 ]

Bobijam

Could you please help with this one?

Thanks

Peter

Comment by Kit Westneat (Inactive) [ 26/Jul/13 ]

I was wondering what the status of this was. Is there anything else we should get?

Thanks

Comment by Zhenyu Xu [ 26/Jul/13 ]

Can you get all processes stack on ucbn001?

Comment by Kit Westneat (Inactive) [ 26/Jul/13 ]

Hi Xu,

I have asked the customer to reproduce the issue and capture sysrq-t.

Thanks.

Comment by Chris Hunter (Inactive) [ 08/Aug/16 ]

Please close this case. Customer upgraded to b2.5; unable to reproduce after upgrade.

Comment by Peter Jones [ 08/Aug/16 ]

ok - thanks Chris

Generated at Sat Feb 10 01:35:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.