Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5816

Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.4
    • 3
    • 16310

    Description

      when recovery takes prolonged time due to waiting for all of the clients to reconnect, the recovery task triggers a the hangcheck timer.

      There is likely a way to tell the kernel we are fine or otherwise have some sort of a ping happen to avoid this message that unnecessary scares users.

      Attachments

        Issue Links

          Activity

            [LU-5816] Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds"

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623
            Subject: LU-5816 target: don't trigger watchdog waiting in recovery
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: c1d6ebbaaf7d598622fd7a6319c97bb29ce68805

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13623 Subject: LU-5816 target: don't trigger watchdog waiting in recovery Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: c1d6ebbaaf7d598622fd7a6319c97bb29ce68805
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/
            Subject: LU-5816 target: don't trigger watchdog waiting in recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ba0a1b36870807e8182189bcb08f7b105aff6c57

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12672/ Subject: LU-5816 target: don't trigger watchdog waiting in recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: ba0a1b36870807e8182189bcb08f7b105aff6c57
            yujian Jian Yu added a comment -

            While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung.

            On OSS:

            06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds.
            06:13:45:      Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1
            06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            06:13:45:tgt_recov     D 0000000000000000     0  3507      2 0x00000080
            06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4
            06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000
            06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638
            06:13:45:Call Trace:
            06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
            06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc]
            06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc]
            06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
            06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc]
            06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc]
            06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0
            06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20
            06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2

            yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, conf-sanity test 47 hung. On OSS: 06:13:45:INFO: task tgt_recov:3507 blocked for more than 120 seconds. 06:13:45: Not tainted 2.6.32-431.29.2.el6_lustre.gbb46ddc.x86_64 #1 06:13:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:13:45:tgt_recov D 0000000000000000 0 3507 2 0x00000080 06:13:45: ffff880066fdbda0 0000000000000046 0000000000000000 ffff880074f078c4 06:13:45: 00000053ffffff0a ffffffffa07f79a6 0000000000000064 0008000000010000 06:13:45: ffff880071556638 ffff880066fdbfd8 000000000000fbc8 ffff880071556638 06:13:45:Call Trace: 06:13:45: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 06:13:45: [<ffffffffa073b910>] ? check_for_next_lock+0x0/0x270 [ptlrpc] 06:13:45: [<ffffffffa073cc9d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] 06:13:45: [<ffffffffa073b370>] ? exp_lock_replay_healthy+0x0/0x30 [ptlrpc] 06:13:45: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 06:13:45: [<ffffffffa074121e>] target_recovery_thread+0x96e/0x1960 [ptlrpc] 06:13:45: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 06:13:45: [<ffffffffa07408b0>] ? target_recovery_thread+0x0/0x1960 [ptlrpc] 06:13:45: [<ffffffff8109abf6>] kthread+0x96/0xa0 06:13:45: [<ffffffff8100c20a>] child_rip+0xa/0x20 06:13:45: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 06:13:45: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Maloo report: https://testing.hpdd.intel.com/test_sets/97a07574-8473-11e4-8915-5254006e85c2
            hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#/c/12672/
            ezell Matt Ezell added a comment -

            Messages during recovery like

            [  844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
            [  844.509756]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
            [  844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [  844.525393] tgt_recov     D 0000000000000003     0 19764      2 0x00000000
            [  844.533247]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64
            [  844.541824]  0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400
            [  844.550425]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
            [  844.559077] Call Trace:
            [  844.561987]  [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc]
            [  844.569538]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            [  844.577874]  [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
            [  844.587551]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
            [  844.594898]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [  844.603226]  [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc]
            [  844.611534]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            [  844.618690]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [  844.627004]  [<ffffffff8109ab56>] kthread+0x96/0xa0
            [  844.632593]  [<ffffffff8100c20a>] child_rip+0xa/0x20
            [  844.638274]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
            [  844.643947]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            and

            [ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds.
            [ 1805.708227]       Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1
            [ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [ 1805.723897] tgt_recov     D 000000000000000b     0 19764      2 0x00000000
            [ 1805.731776]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21
            [ 1805.740357]  ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece
            [ 1805.748926]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
            [ 1805.757503] Call Trace:
            [ 1805.760426]  [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
            [ 1805.768553]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            [ 1805.776888]  [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
            [ 1805.784921]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
            [ 1805.792266]  [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc]
            [ 1805.800565]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
            [ 1805.807732]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
            [ 1805.816039]  [<ffffffff8109ab56>] kthread+0x96/0xa0
            [ 1805.821621]  [<ffffffff8100c20a>] child_rip+0xa/0x20
            [ 1805.827287]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
            [ 1805.832961]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed.

            ezell Matt Ezell added a comment - Messages during recovery like [ 844.502040] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 844.509756] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 844.516404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 844.525393] tgt_recov D 0000000000000003 0 19764 2 0x00000000 [ 844.533247] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64 [ 844.541824] 0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400 [ 844.550425] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 844.559077] Call Trace: [ 844.561987] [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc] [ 844.569538] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 844.577874] [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [ 844.587551] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 844.594898] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.603226] [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc] [ 844.611534] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 844.618690] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 844.627004] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 844.632593] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 844.638274] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 844.643947] [<ffffffff8100c200>] ? child_rip+0x0/0x20 and [ 1805.700512] INFO: task tgt_recov:19764 blocked for more than 120 seconds. [ 1805.708227] Not tainted 2.6.32-431.17.1.el6.wc.x86_64 #1 [ 1805.714870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1805.723897] tgt_recov D 000000000000000b 0 19764 2 0x00000000 [ 1805.731776] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff883f9ac0ba21 [ 1805.740357] ffffc900802298f8 ffff88402a8af430 0000000000004ed2 0000000000004ece [ 1805.748926] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 [ 1805.757503] Call Trace: [ 1805.760426] [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc] [ 1805.768553] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] [ 1805.776888] [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] [ 1805.784921] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [ 1805.792266] [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc] [ 1805.800565] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [ 1805.807732] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] [ 1805.816039] [<ffffffff8109ab56>] kthread+0x96/0xa0 [ 1805.821621] [<ffffffff8100c20a>] child_rip+0xa/0x20 [ 1805.827287] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [ 1805.832961] [<ffffffff8100c200>] ? child_rip+0x0/0x20 target_recovery_overseer() calls wait_event(), but it may take more than 120 seconds for the event to happen. The solution might be to use wait_event_timeout() in a loop to wake back up after 90% of sysctl_hung_task_timeout_secs has elapsed.

            People

              hongchao.zhang Hongchao Zhang
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: