Prepare lustre for adoption into the linux kernel
(LU-9679)
|
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Technical task | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Epic/Theme: | upstream | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
One of the major obstacles to the movement of the upstream client out of staging is our debug logging system. The libcfs debugging infrastructure was developed long before trace point existed. |
| Comments |
| Comment by James A Simmons [ 28/Dec/16 ] |
|
I do have a first patch but I need to add the event filtering base on the debug flags. Here is the sample output I'm seeing. # tracer: nop # # entries-in-buffer/entries-written: 2121/11382 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | modprobe-10993 [007] .... 6584.386683: lnet_config: (linux-crypto.c:377:cfs_crypto_performance_test) Crypto hash algorithm crc32 speed = 1427 MB/s ..... |
| Comment by Gerrit Updater [ 30/Dec/16 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/24554 |
| Comment by James A Simmons [ 31/Dec/16 ] |
|
Also have llite in the works. Here is the output I have so far. kworker/u17:2-5181 [001] .... 2135.478773: llite_vfstrace: (vvp_io.c:306:vvp_io_fini) [0x200005222:0x13a4:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 sendfile_groupl-23957 [002] .... 2135.480731: llite_vfstrace: (file.c:3992:ll_layout_conf) [0x200005222:0x13a4:0x0]: layout version change: 0 -> 0 |
| Comment by Gerrit Updater [ 31/Dec/16 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/24560 |
| Comment by James A Simmons [ 19/Jan/17 ] |
|
Just pushed a version of tracepoint that handles printing messages to the console. Most of the infrastructure is now in place. Only a few issues to resolve. 1) Currently ratelimting is per module. For the lustre debugging code it has a individual rate limit for every single message. I doubt that would be acceptable upstream. 2) Need to fine tune the printk rate limiting based on module parameters in libcfs. 3) Some of the debug message mask are conditional i.e "CDEBUG(warn ? D_WARNING : D_INFO". Might need to create special macros for this. Haven't figured a clean solution. Another option is break it up into two separate trace events. For the most part its ready for review. Peter who would be good choices to look at this work? |
| Comment by Dmitry Eremin (Inactive) [ 07/Feb/17 ] |
|
James, This is not I expected to see when I vote for this feature. I expect to see correct function name in "FUNCTION" field instead of category like "lnet_info". # tracer: nop
#
# entries-in-buffer/entries-written: 58/61 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
ldlm_bl_01-15147 [003] .... 189369.402891: lnet_info: (linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: adler32 (adler32-zlib) speed 2454 MB/s
ll_ost_io00_002-15366 [001] .... 189369.403051: lnet_info: (linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: adler32 (adler32-zlib) speed 2454 MB/s
##### CPU 0 buffer started ####
ptlrpcd_00_03-14400 [000] .... 189383.834528: lnet_info: (linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: adler32 (adler32-zlib) speed 2454 MB/s
ll_ost_io00_002-15366 [001] .... 189383.835278: lnet_info: (linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: adler32 (adler32-zlib) speed 2454 MB/s
##### CPU 2 buffer started ####
ptlrpcd_00_01-14398 [002] .... 189383.848857: lnet_info: (linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: adler32 (adler32-zlib) speed 2454 MB/s
It would be very beneficially to be able to select appropriate function name for tracing: # perf list | grep lnet lnet:lnet_config [Tracepoint event] lnet:lnet_console [Tracepoint event] lnet:lnet_emerg [Tracepoint event] lnet:lnet_error [Tracepoint event] lnet:lnet_ha [Tracepoint event] lnet:lnet_info [Tracepoint event] lnet:lnet_ioctl [Tracepoint event] lnet:lnet_neterror [Tracepoint event] lnet:lnet_other [Tracepoint event] lnet:lnet_warning [Tracepoint event] Right now, I can select for tracing only category instead of exact function I interested in. |
| Comment by James A Simmons [ 09/Feb/17 ] |
|
Function in this case means the name the compiler generates from the TRACE_EVENT macro. For example DEFINE_EVENT(lnet_log_msg, lnet_config, ... --> becomes lnet_config(...) in the source file. So they are not regular function but more like event markers. One of major goals of this project is to replace the current lctl debugging infrastructure since this is a requirement upstream. So I worked the code to map to the behavior lctl debug is doing which is category based. You will see CDEBUG(D_HA,...) maps to lnet_ha(...) etc. By the way the reason I used names like lnet_info, lnet_ioctl is to see clearly in the logs which subsys the event is showing up in. Perhaps ptlrpcd_00_01-14398 [002] .... 189383.848857: info: lnet:(linux-crypto.c:119:cfs_crypto_hash_alloc) Using crypto hash: ... I have to look at the logs perf creates to see if it makes sense to the user. As for limiting debugging to an exact function you can do that with filtering. |
| Comment by Dmitry Eremin (Inactive) [ 09/Feb/17 ] |
|
Unfortunately tracepoints don't allow to filter by "const char *" fields. I was not able to specify filter by file or function name because of it's interpreted as integer. Can we change the type to fixed array of char or dynamic array of char? I encountered that only such types allow to specify filter with wild card characters (*,?).
|
| Comment by Dmitry Eremin (Inactive) [ 10/Feb/17 ] |
|
This patch allow string filtering by file and function name. diff --git a/libcfs/include/libcfs/libcfs_trace.h b/libcfs/include/libcfs/libcfs_trace.h
index c2aecdb..5d210cb 100644
--- a/libcfs/include/libcfs/libcfs_trace.h
+++ b/libcfs/include/libcfs/libcfs_trace.h
@@ -67,20 +67,20 @@ DECLARE_EVENT_CLASS(libcfs_log_msg,
struct va_format *vaf),
TP_ARGS(msg_file, msg_line, msg_fn, vaf),
TP_STRUCT__entry(
- __field(const char *, msg_file)
+ __string(msg_file, msg_file ? msg_file : "<nofile>")
__field(int, msg_line)
- __field(const char *, msg_fn)
+ __string(msg_func, msg_fn ? msg_fn : "<nofunc>")
__dynamic_array(char, msg, CFS_TRACE_CONSOLE_BUFFER_SIZE)
),
TP_fast_assign(
- __entry->msg_file = msg_file;
+ __assign_str(msg_file, msg_file);
__entry->msg_line = msg_line;
- __entry->msg_fn = msg_fn;
+ __assign_str(msg_func, msg_fn);
vsnprintf(__get_dynamic_array(msg),
CFS_TRACE_CONSOLE_BUFFER_SIZE, vaf->fmt, *vaf->va);
),
- TP_printk("(%s:%d:%s) %s", __entry->msg_file, __entry->msg_line,
- __entry->msg_fn, __get_str(msg))
+ TP_printk("(%s:%d:%s) %s", __get_str(msg_file), __entry->msg_line,
+ __get_str(msg_func), __get_str(msg))
);
DEFINE_EVENT(libcfs_log_msg, libcfs_config,
@@ -118,20 +118,20 @@ DECLARE_EVENT_CLASS(libcfs_log_msg_limited,
const char *msg_fn, struct va_format *vaf),
TP_ARGS(rs, msg_file, msg_line, msg_fn, vaf),
TP_STRUCT__entry(
- __field(const char *, msg_file)
+ __string(msg_file, msg_file ? msg_file : "<nofile>")
__field(int, msg_line)
- __field(const char *, msg_fn)
+ __string(msg_func, msg_fn ? msg_fn : "<nofunc>")
__dynamic_array(char, msg, CFS_TRACE_CONSOLE_BUFFER_SIZE)
),
TP_fast_assign(
- __entry->msg_file = msg_file;
+ __assign_str(msg_file, msg_file);
__entry->msg_line = msg_line;
- __entry->msg_fn = msg_fn;
+ __assign_str(msg_func, msg_fn);
vsnprintf(__get_dynamic_array(msg),
CFS_TRACE_CONSOLE_BUFFER_SIZE, vaf->fmt, *vaf->va);
),
- TP_printk("(%s:%d:%s) %s", __entry->msg_file, __entry->msg_line,
- __entry->msg_fn, __get_str(msg))
+ TP_printk("(%s:%d:%s) %s", __get_str(msg_file), __entry->msg_line,
+ __get_str(msg_func), __get_str(msg))
);
DEFINE_EVENT_CONDITION(libcfs_log_msg_limited, libcfs_console,
Probably with this patch the implementation will be suitable enough to replace current logging system. |
| Comment by James A Simmons [ 10/Feb/17 ] |
|
Oh now I understand what you mean by your filtering. Thanks for making it clearer. BTW its really awesome to be able to filter that way. |
| Comment by James A Simmons [ 13/Feb/17 ] |
|
Dmitry how does it look to you now. BTW does anyone even know what the error codes for LCONSOLE_ERROR mean anymore? |
| Comment by James A Simmons [ 31/Mar/17 ] |
|
I pushed a new patch that has been updated and I fixed the build issue on RHEL6. Only the CERROR need to be ported but I like to have feedback on the patch before I complete the remaining work. |
| Comment by James A Simmons [ 04/Apr/17 ] |
|
It lives. I have completed a functional patch with a complete port of all the trace point events. Please give it a review. |
| Comment by Gerrit Updater [ 28/Oct/23 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52875 |