Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5816

Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.4
    • 3
    • 16310

    Description

      when recovery takes prolonged time due to waiting for all of the clients to reconnect, the recovery task triggers a the hangcheck timer.

      There is likely a way to tell the kernel we are fine or otherwise have some sort of a ping happen to avoid this message that unnecessary scares users.

      Attachments

        Issue Links

          Activity

            [LU-5816] Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds"

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623
            Subject: LU-5816 target: don't trigger watchdog waiting in recovery
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: c1d6ebbaaf7d598622fd7a6319c97bb29ce68805

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623 Subject: LU-5816 target: don't trigger watchdog waiting in recovery Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: c1d6ebbaaf7d598622fd7a6319c97bb29ce68805
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/
            Subject: LU-5816 target: don't trigger watchdog waiting in recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ba0a1b36870807e8182189bcb08f7b105aff6c57

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/ Subject: LU-5816 target: don't trigger watchdog waiting in recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: ba0a1b36870807e8182189bcb08f7b105aff6c57
            yujian Jian Yu added a comment -

            While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung.

            On OSS:

            06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds.
            06:13:45:      Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1
            06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            06:13:45:tgt_recov     D 0000000000000000     0  3507      2 0x00000080
            06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4
            06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000
            06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638
            06:13:45:Call Trace:
            06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
            06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc]
            06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc]
            06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
            06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc]
            06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc]
            06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0
            06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20
            06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2

            yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung. On OSS: 06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds. 06:13:45: Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1 06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:13:45:tgt_recov D 0000000000000000 0 3507 2 0x00000080 06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4 06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000 06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638 06:13:45:Call Trace: 06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc] 06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] 06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc] 06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc] 06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc] 06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0 06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20 06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2
            hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#/c/12672/
            ezell Matt Ezell added a comment -

            Messages during recovery like

            [  844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
            [  844.509756]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
            [  844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [  844.525393] tgt_recov     D 0000000000000003     0 19764      2 0x00000000
            [  844.533247]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64
            [  844.541824]  0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400
            [  844.550425]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
            [  844.559077] Call Trace:
            [  844.561987]  [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc]
            [  844.569538]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            [  844.577874]  [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
            [  844.587551]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
            [  844.594898]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [  844.603226]  [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc]
            [  844.611534]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            [  844.618690]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [  844.627004]  [<ffffffff8109ab56>] kthread+0x96/0xa0
            [  844.632593]  [<ffffffff8100c20a>] child_rip+0xa/0x20
            [  844.638274]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
            [  844.643947]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            and

            [ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
            [ 1805.708227]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
            [ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [ 1805.723897] tgt_recov     D 000000000000000b     0 19764      2 0x00000000
            [ 1805.731776]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21
            [ 1805.740357]  ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece
            [ 1805.748926]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
            [ 1805.757503] Call Trace:
            [ 1805.760426]  [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
            [ 1805.768553]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            [ 1805.776888]  [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
            [ 1805.784921]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
            [ 1805.792266]  [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc]
            [ 1805.800565]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            [ 1805.807732]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [ 1805.816039]  [<ffffffff8109ab56>] kthread+0x96/0xa0
            [ 1805.821621]  [<ffffffff8100c20a>] child_rip+0xa/0x20
            [ 1805.827287]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
            [ 1805.832961]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed.

            ezell Matt Ezell added a comment - Messages during recovery like [ 844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 844.509756] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 844.525393] tgt_recov D 0000000000000003 0 19764 2 0x00000000 [ 844.533247] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64 [ 844.541824] 0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400 [ 844.550425] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 844.559077] Call Trace: [ 844.561987] [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc] [ 844.569538] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 844.577874] [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [ 844.587551] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 844.594898] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.603226] [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc] [ 844.611534] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 844.618690] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.627004] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 844.632593] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 844.638274] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 844.643947] [<ffffffff8100c200>] ? child_rip+0x0/0x20 and [ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 1805.708227] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1805.723897] tgt_recov D 000000000000000b 0 19764 2 0x00000000 [ 1805.731776] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21 [ 1805.740357] ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece [ 1805.748926] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 1805.757503] Call Trace: [ 1805.760426] [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc] [ 1805.768553] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 1805.776888] [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] [ 1805.784921] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 1805.792266] [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc] [ 1805.800565] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 1805.807732] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 1805.816039] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 1805.821621] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 1805.827287] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 1805.832961] [<ffffffff8100c200>] ? child_rip+0x0/0x20 target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed.
            bogl Bob Glossman (Inactive) added a comment - - edited

            not a lot of info to work with here yet. the "blocked for more than" message comes from check_hung_task() in the kernel when some task is in an uninterruptible sleep for too long. need some clue as to where lustre is sleeping. don't know if there is any kernel internal API to delay or suppress these warnings.

            Oh, I see Andreas already requested a stack trace on the hang. I agree that would be very useful info.

            bogl Bob Glossman (Inactive) added a comment - - edited not a lot of info to work with here yet. the "blocked for more than" message comes from check_hung_task() in the kernel when some task is in an uninterruptible sleep for too long. need some clue as to where lustre is sleeping. don't know if there is any kernel internal API to delay or suppress these warnings. Oh, I see Andreas already requested a stack trace on the hang. I agree that would be very useful info.

            Bob,
            Could you have a look at this one and comment please?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Bob, Could you have a look at this one and comment please? Thank you!

            Oleg, can you please paste the stack into the bug, so we can see where it is blocked?

            adilger Andreas Dilger added a comment - Oleg, can you please paste the stack into the bug, so we can see where it is blocked?

            People

              hongchao.zhang Hongchao Zhang
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: