[LU-5816] Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds" Created: 28/Oct/14 Updated: 04/Aug/15 Resolved: 03/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16310 | ||||||||
| Description |
|
when recovery takes prolonged time due to waiting for all of the clients to reconnect, the recovery task triggers a the hangcheck timer. There is likely a way to tell the kernel we are fine or otherwise have some sort of a ping happen to avoid this message that unnecessary scares users. |
| Comments |
| Comment by Andreas Dilger [ 28/Oct/14 ] |
|
Oleg, can you please paste the stack into the bug, so we can see where it is blocked? |
| Comment by Jodi Levi (Inactive) [ 28/Oct/14 ] |
|
Bob, |
| Comment by Bob Glossman (Inactive) [ 28/Oct/14 ] |
|
not a lot of info to work with here yet. the "blocked for more than" message comes from check_hung_task() in the kernel when some task is in an uninterruptible sleep for too long. need some clue as to where lustre is sleeping. don't know if there is any kernel internal API to delay or suppress these warnings. Oh, I see Andreas already requested a stack trace on the hang. I agree that would be very useful info. |
| Comment by Matt Ezell [ 31/Oct/14 ] |
|
Messages during recovery like [ 844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 844.509756] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 844.525393] tgt_recov D 0000000000000003 0 19764 2 0x00000000 [ 844.533247] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64 [ 844.541824] 0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400 [ 844.550425] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 844.559077] Call Trace: [ 844.561987] [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc] [ 844.569538] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 844.577874] [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [ 844.587551] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 844.594898] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.603226] [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc] [ 844.611534] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 844.618690] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.627004] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 844.632593] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 844.638274] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 844.643947] [<ffffffff8100c200>] ? child_rip+0x0/0x20 and [ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 1805.708227] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1805.723897] tgt_recov D 000000000000000b 0 19764 2 0x00000000 [ 1805.731776] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21 [ 1805.740357] ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece [ 1805.748926] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 1805.757503] Call Trace: [ 1805.760426] [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc] [ 1805.768553] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 1805.776888] [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] [ 1805.784921] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 1805.792266] [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc] [ 1805.800565] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 1805.807732] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 1805.816039] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 1805.821621] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 1805.827287] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 1805.832961] [<ffffffff8100c200>] ? child_rip+0x0/0x20 target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed. |
| Comment by Hongchao Zhang [ 11/Nov/14 ] |
|
the patch is tracked at http://review.whamcloud.com/#/c/12672/ |
| Comment by Jian Yu [ 16/Dec/14 ] |
|
While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung. On OSS: 06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds. 06:13:45: Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1 06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:13:45:tgt_recov D 0000000000000000 0 3507 2 0x00000080 06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4 06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000 06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638 06:13:45:Call Trace: 06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc] 06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] 06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc] 06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc] 06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc] 06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0 06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20 06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2 |
| Comment by Gerrit Updater [ 03/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/ |
| Comment by Peter Jones [ 03/Feb/15 ] |
|
Landed for 2.7 |
| Comment by Gerrit Updater [ 04/Feb/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623 |