[LU-5816] Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds" Created: 28/Oct/14  Updated: 04/Aug/15  Resolved: 03/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.4
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-5805 tgt_recov blocked and "waking for gap... Resolved
Severity: 3
Rank (Obsolete): 16310

 Description   

when recovery takes prolonged time due to waiting for all of the clients to reconnect, the recovery task triggers a the hangcheck timer.

There is likely a way to tell the kernel we are fine or otherwise have some sort of a ping happen to avoid this message that unnecessary scares users.



 Comments   
Comment by Andreas Dilger [ 28/Oct/14 ]

Oleg, can you please paste the stack into the bug, so we can see where it is blocked?

Comment by Jodi Levi (Inactive) [ 28/Oct/14 ]

Bob,
Could you have a look at this one and comment please?
Thank you!

Comment by Bob Glossman (Inactive) [ 28/Oct/14 ]

not a lot of info to work with here yet. the "blocked for more than" message comes from check_hung_task() in the kernel when some task is in an uninterruptible sleep for too long. need some clue as to where lustre is sleeping. don't know if there is any kernel internal API to delay or suppress these warnings.

Oh, I see Andreas already requested a stack trace on the hang. I agree that would be very useful info.

Comment by Matt Ezell [ 31/Oct/14 ]

Messages during recovery like

[  844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
[  844.509756]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
[  844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  844.525393] tgt_recov     D 0000000000000003     0 19764      2 0x00000000
[  844.533247]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64
[  844.541824]  0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400
[  844.550425]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
[  844.559077] Call Trace:
[  844.561987]  [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc]
[  844.569538]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
[  844.577874]  [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
[  844.587551]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
[  844.594898]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
[  844.603226]  [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc]
[  844.611534]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
[  844.618690]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
[  844.627004]  [<ffffffff8109ab56>] kthread+0x96/0xa0
[  844.632593]  [<ffffffff8100c20a>] child_rip+0xa/0x20
[  844.638274]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
[  844.643947]  [<ffffffff8100c200>] ? child_rip+0x0/0x20

and

[ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
[ 1805.708227]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
[ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1805.723897] tgt_recov     D 000000000000000b     0 19764      2 0x00000000
[ 1805.731776]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21
[ 1805.740357]  ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece
[ 1805.748926]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
[ 1805.757503] Call Trace:
[ 1805.760426]  [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
[ 1805.768553]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
[ 1805.776888]  [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
[ 1805.784921]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
[ 1805.792266]  [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc]
[ 1805.800565]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
[ 1805.807732]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
[ 1805.816039]  [<ffffffff8109ab56>] kthread+0x96/0xa0
[ 1805.821621]  [<ffffffff8100c20a>] child_rip+0xa/0x20
[ 1805.827287]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
[ 1805.832961]  [<ffffffff8100c200>] ? child_rip+0x0/0x20

target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed.

Comment by Hongchao Zhang [ 11/Nov/14 ]

the patch is tracked at http://review.whamcloud.com/#/c/12672/

Comment by Jian Yu [ 16/Dec/14 ]

While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung.

On OSS:

06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds.
06:13:45:      Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1
06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
06:13:45:tgt_recov     D 0000000000000000     0  3507      2 0x00000080
06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4
06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000
06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638
06:13:45:Call Trace:
06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc]
06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc]
06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc]
06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc]
06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0
06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20
06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20

Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2

Comment by Gerrit Updater [ 03/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/
Subject: LU-5816 target: don't trigger watchdog waiting in recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ba0a1b36870807e8182189bcb08f7b105aff6c57

Comment by Peter Jones [ 03/Feb/15 ]

Landed for 2.7

Comment by Gerrit Updater [ 04/Feb/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623
Subject: LU-5816 target: don't trigger watchdog waiting in recovery
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: c1d6ebbaaf7d598622fd7a6319c97bb29ce68805

Generated at Sat Feb 10 01:54:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.