Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      As we can not monitoring the debug trace all the time, especially when
      the error will happen. So if the dump trace can start once CERROR, that would be very helpful, and we can collect as more debug info as possible.

      Attachments

        Issue Links

          Activity

            [LU-8145] start dump trace thread once CERROR

            Can we close this ticket.

            simmonsja James A Simmons added a comment - Can we close this ticket.

            Yeah, it maybe not very helpful to production environment, but, IMO, it is useful to debug, especially when we hit error on production but hard to reproduce.

            cengku9660 Gu Zheng (Inactive) added a comment - Yeah, it maybe not very helpful to production environment, but, IMO, it is useful to debug, especially when we hit error on production but hard to reproduce.
            green Oleg Drokin added a comment -

            Note that Fujitsu has a similar thing where all lustre messages are dumped into a debug-deamon-like buffer not to clog dmesg.

            But anyway the biggest problem here is such that we run with very lean debug mask by default, so there's hardly anything you get from it outside of CERROR/CWARN that is already logged in syslog/dmesg.

            Now you could increase the debug level, but if you do it by default, this suddenly makes your FS slower and nobody likes that. If you do it on first CERROR, it's kind of too late too.

            green Oleg Drokin added a comment - Note that Fujitsu has a similar thing where all lustre messages are dumped into a debug-deamon-like buffer not to clog dmesg. But anyway the biggest problem here is such that we run with very lean debug mask by default, so there's hardly anything you get from it outside of CERROR/CWARN that is already logged in syslog/dmesg. Now you could increase the debug level, but if you do it by default, this suddenly makes your FS slower and nobody likes that. If you do it on first CERROR, it's kind of too late too.

            I thought that messages like:

            Lustre: myth-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect
            

            were printed with LustreError, but they are not. Maybe it will be OK. Best to check some of your customer systems for LustreError in the console logs starting at boot time to see if having debug_daemon enabled after CERROR() would be useful for debugging or just dumping the debug log once. I suspect that in most cases the valuable debugging information will have happened before the CERROR() and not after, so dumping the current logs would be enough.

            Note also that there are already module parameters like dump_on_eviction, dump_on_timeout, and dump_on_peer_timeout that will dump the logs once without having to enable debug_daemon to collect a large amount of debug information.

            adilger Andreas Dilger added a comment - I thought that messages like: Lustre: myth-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect were printed with LustreError, but they are not. Maybe it will be OK. Best to check some of your customer systems for LustreError in the console logs starting at boot time to see if having debug_daemon enabled after CERROR() would be useful for debugging or just dumping the debug log once. I suspect that in most cases the valuable debugging information will have happened before the CERROR() and not after, so dumping the current logs would be enough. Note also that there are already module parameters like dump_on_eviction , dump_on_timeout , and dump_on_peer_timeout that will dump the logs once without having to enable debug_daemon to collect a large amount of debug information.

            Hi Andreas,
            Thanks for your comments.
            Just simple tests, it works well, but not used on production yet.
            “It seems to me that virtually every time that Lustre is started”
            Is there CERROR message in the dumped log?

            cengku9660 Gu Zheng (Inactive) added a comment - Hi Andreas, Thanks for your comments. Just simple tests, it works well, but not used on production yet. “It seems to me that virtually every time that Lustre is started” Is there CERROR message in the dumped log?

            Have you been using this patch in production anywhere? It seems to me that virtually every time that Lustre is started, especially after a recovery, it will print a CERROR() to the console and start debug daemon. In that case, it will always be running on all of the nodes.

            adilger Andreas Dilger added a comment - Have you been using this patch in production anywhere? It seems to me that virtually every time that Lustre is started, especially after a recovery, it will print a CERROR() to the console and start debug daemon. In that case, it will always be running on all of the nodes.

            Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20218
            Subject: LU-8145 libcfs: add dump debug trace on error support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 831bd3adc5294a5d6a91bc46440062f9499f9bf6

            gerrit Gerrit Updater added a comment - Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20218 Subject: LU-8145 libcfs: add dump debug trace on error support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 831bd3adc5294a5d6a91bc46440062f9499f9bf6

            People

              jgmitter Joseph Gmitter (Inactive)
              cengku9660 Gu Zheng (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: