[LU-8145] start dump trace thread once CERROR Created: 16/May/16  Updated: 07/Jun/18  Resolved: 07/Jun/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Gu Zheng (Inactive) Assignee: Joseph Gmitter (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-8980 Add tracepoint support to Lustre Reopened
Rank (Obsolete): 9223372036854775807

 Description   

As we can not monitoring the debug trace all the time, especially when
the error will happen. So if the dump trace can start once CERROR, that would be very helpful, and we can collect as more debug info as possible.



 Comments   
Comment by Gerrit Updater [ 16/May/16 ]

Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20218
Subject: LU-8145 libcfs: add dump debug trace on error support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 831bd3adc5294a5d6a91bc46440062f9499f9bf6

Comment by Andreas Dilger [ 16/May/16 ]

Have you been using this patch in production anywhere? It seems to me that virtually every time that Lustre is started, especially after a recovery, it will print a CERROR() to the console and start debug daemon. In that case, it will always be running on all of the nodes.

Comment by Gu Zheng (Inactive) [ 17/May/16 ]

Hi Andreas,
Thanks for your comments.
Just simple tests, it works well, but not used on production yet.
“It seems to me that virtually every time that Lustre is started”
Is there CERROR message in the dumped log?

Comment by Andreas Dilger [ 17/May/16 ]

I thought that messages like:

Lustre: myth-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect

were printed with LustreError, but they are not. Maybe it will be OK. Best to check some of your customer systems for LustreError in the console logs starting at boot time to see if having debug_daemon enabled after CERROR() would be useful for debugging or just dumping the debug log once. I suspect that in most cases the valuable debugging information will have happened before the CERROR() and not after, so dumping the current logs would be enough.

Note also that there are already module parameters like dump_on_eviction, dump_on_timeout, and dump_on_peer_timeout that will dump the logs once without having to enable debug_daemon to collect a large amount of debug information.

Comment by Oleg Drokin [ 17/May/16 ]

Note that Fujitsu has a similar thing where all lustre messages are dumped into a debug-deamon-like buffer not to clog dmesg.

But anyway the biggest problem here is such that we run with very lean debug mask by default, so there's hardly anything you get from it outside of CERROR/CWARN that is already logged in syslog/dmesg.

Now you could increase the debug level, but if you do it by default, this suddenly makes your FS slower and nobody likes that. If you do it on first CERROR, it's kind of too late too.

Comment by Gu Zheng (Inactive) [ 20/May/16 ]

Yeah, it maybe not very helpful to production environment, but, IMO, it is useful to debug, especially when we hit error on production but hard to reproduce.

Comment by James A Simmons [ 01/Dec/17 ]

Can we close this ticket.

Generated at Sat Feb 10 02:15:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.