[LU-8145] start dump trace thread once CERROR Created: 16/May/16 Updated: 07/Jun/18 Resolved: 07/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Gu Zheng (Inactive) | Assignee: | Joseph Gmitter (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
As we can not monitoring the debug trace all the time, especially when |
| Comments |
| Comment by Gerrit Updater [ 16/May/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20218 |
| Comment by Andreas Dilger [ 16/May/16 ] |
|
Have you been using this patch in production anywhere? It seems to me that virtually every time that Lustre is started, especially after a recovery, it will print a CERROR() to the console and start debug daemon. In that case, it will always be running on all of the nodes. |
| Comment by Gu Zheng (Inactive) [ 17/May/16 ] |
|
Hi Andreas, |
| Comment by Andreas Dilger [ 17/May/16 ] |
|
I thought that messages like: Lustre: myth-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect were printed with LustreError, but they are not. Maybe it will be OK. Best to check some of your customer systems for LustreError in the console logs starting at boot time to see if having debug_daemon enabled after CERROR() would be useful for debugging or just dumping the debug log once. I suspect that in most cases the valuable debugging information will have happened before the CERROR() and not after, so dumping the current logs would be enough. Note also that there are already module parameters like dump_on_eviction, dump_on_timeout, and dump_on_peer_timeout that will dump the logs once without having to enable debug_daemon to collect a large amount of debug information. |
| Comment by Oleg Drokin [ 17/May/16 ] |
|
Note that Fujitsu has a similar thing where all lustre messages are dumped into a debug-deamon-like buffer not to clog dmesg. But anyway the biggest problem here is such that we run with very lean debug mask by default, so there's hardly anything you get from it outside of CERROR/CWARN that is already logged in syslog/dmesg. Now you could increase the debug level, but if you do it by default, this suddenly makes your FS slower and nobody likes that. If you do it on first CERROR, it's kind of too late too. |
| Comment by Gu Zheng (Inactive) [ 20/May/16 ] |
|
Yeah, it maybe not very helpful to production environment, but, IMO, it is useful to debug, especially when we hit error on production but hard to reproduce. |
| Comment by James A Simmons [ 01/Dec/17 ] |
|
Can we close this ticket. |