[LU-11566] sanity test_60aa: llog_print_cb()) not enough space for print log records Created: 24/Oct/18  Updated: 01/Mar/22  Resolved: 09/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.6
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Ben Evans (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11373 CERROR/CWARN messages are not throttled Resolved
is related to LU-4939 Need to be able to sanely query and c... Resolved
is related to LU-12285 Wrong variable scope used in jt_llog_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/473e7234-d712-11e8-ad90-52540065bddc

test_60aa produced the following stack trace:

Lustre: DEBUG MARKER: /usr/sbin/lctl --device %MGS llog_print \$lustre-client
------------[ cut here ]------------
WARNING: CPU: 1 PID: 5457 at lib/vsprintf.c:1741 vsnprintf+0x691/0x6a0
CPU: 1 PID: 5457 Comm: llog_process_th Kernel: 3.10.0-862.14.4.el7_lustre.x86_64 #1
Call Trace:
 dump_stack+0x19/0x1b
 warn_slowpath_null+0x1d/0x20
 vsnprintf+0x691/0x6a0
 snprintf+0x49/0x70
 class_config_yaml_output+0x22a/0x430 [obdclass]
 llog_print_cb+0x415/0x4f0 [obdclass]
 llog_process_thread+0x892/0x15a0 [obdclass]
 llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
 kthread+0xd1/0xe0
---[ end trace a7e0036714283adf ]---
LustreError: 5457:0:(llog_ioctl.c:264:llog_print_cb()) not enough space for print log records

It looks like this is a problem in how class_config_yaml_output() is written:

��������        if (LUSTRE_CFG_BUFLEN(lcfg, 0) > 0)
               ����������������ptr += snprintf(ptr, end - ptr, ", device: %s",
                              ��������������������������������lustre_cfg_string(lcfg, 0));

If ptr ever exceeds end then a negative value is passed to snprintf(), and triggers this WARN_ONCE(). This can happen if the message is too long, because snprintf() will return the total length that would have been printed, not the actual length, so it keeps growing.

Two fixes are needed here:

  • increase the buffer size so we don't overflow it, and userspace gets the records it wants
  • fix the code to stop printing, or compute the remaining buffer space differently so it doesn't become negative

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_60b - CDEBUG_LIMIT not limiting messages (103)



 Comments   
Comment by Andreas Dilger [ 24/Oct/18 ]

This looks a bit worse than I thought. It seems that the current mechanism for printing the config logs is to print the whole log into a single buffer in the kernel. This means any config with a lot of records is going to overflow the 8192-byte buffer (I thought it was holding at most a single line of output).

It seems there are a couple of options here:

  • rework how the logs are printed to be one-per-line
  • allow the call to dump logs to be continued at the next record
  • completely change how logs are dumped, and use debugfs

Probably the third option is best. This would involve creating a debugfs entry for each config record that generates the YAML-formatted output, then have llog_print access that file (for compatibility, users could also then use "lctl get_param mgs.param.<config>" or similar.

Note that the MGS should only generate debugfs entries for well-known config files, like param, <fsname>-MDTxxxx, <fsname>-OSTxxxx}, and whatever the sptlrpc file is named, and not any cruft in the CONFIGS directory such as backups or other files in that directory.

Comment by Andreas Dilger [ 25/Oct/18 ]

Ben, it looks like this will affect your patch to backup/restore the config logs. If the config log is large, then the supplied buffer will not be large enough to hold all of the config records.

As a short term solution, it might be enough to increase MAX_IOC_BUFLEN to e.g. 16KB, but I don't think that is a long-term solution. See my previous comment about using debugfs for the longer-term solution. I think that would be more suitable for 2.13, since it will involve some non-trivial changes to this code.

Comment by Gerrit Updater [ 30/Oct/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33511
Subject: LU-11566 utils: increase llog_print buffer size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0027b58de29e1d8574098eed892377ad15b6d0bf

Comment by James A Simmons [ 30/Oct/18 ]

I do like the idea of using debugfs but remember its limited to root only. Will this matter to admins or for logging tools? In case we do need non root, but still privileged access, we can use the generic netlink API. For the generic netlink case we can send up to 32K size packets and each packet could contain a seq number so the packets can be pieced together. Just another option for the 2.13 window.

Comment by Andreas Dilger [ 31/Oct/18 ]

We already have an ioctl, I don't think netlink is an improvement for this case. For llog_print this should really be a root-only operation anyway, so debugfs is OK.

Comment by Andreas Dilger [ 20/Nov/18 ]

Ben, is there any chance you could look into this? I think that without some kind of fix here, your (IMHO very useful) LU-4939 patch to allow config backup/restore will only be useful for small configurations that don't have very many config records.

Comment by Ben Evans (Inactive) [ 20/Nov/18 ]

Yep, it's on my list of things to do.  I think the short-term solution is to read each file in chunks.  Long term, debugfs should work fine.

Comment by Andreas Dilger [ 20/Nov/18 ]

If you can work up a mechanism to read the file in chunks, that is fine too. Probably the easiest would be to print a single record per call, so that the buffer is always large enough. A bit more complex would be to pack multiple whole records into the buffer, and stop when the last one doesn't fit. Splitting a single record across multiple calls would be hard to handle in both the kernel and userspace.

Strangely, I didn't know that llog_print allows specifying the start and end record, so that makes the solution relatively easy to implement. Just specify a series of start/end pairs internally until there are no more records. It would be relatively safe to assume e.g. 256 bytes/record and then plan accordingly.

Note that there is a minor bug in jt_llog_print() in that it doesn't allow specifying only the starting record (i.e. argc == 3). There is also a small bug in the lctl-llog_print.8 man page, in that it should show "lctl llog_print <logname> [start_index [end_index]]" (i.e. "end_index" is only allowed if "start_index" is specified, instead of implying that "end_index" can be used independently). It would also be useful if llog_print checked for named arguments like --start and --end (or similar) instead of depending on positional parameters, since that makes parsing more complex as new options are added.

Comment by Gerrit Updater [ 10/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33815
Subject: LU-11566 utils: lctl llog_print for large configurations
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 039ed418fa9a7e075af8e3e45b22776227259abd

Comment by Gerrit Updater [ 14/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33863
Subject: LU-11566 utils: fix lctl llog_print for large configs
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 79d5c1f0aac2459070c0837835a2aac3e75ec4c5

Comment by Gerrit Updater [ 10/Jan/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34004
Subject: LU-11566 utils: improve usage/docs for lctl llog commands
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 32082e82469176652f90b261457993708aa4e1d6

Comment by Gerrit Updater [ 30/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33815/
Subject: LU-11566 utils: fix lctl llog_print for large configs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3783aa285b15a811081a8de829d52f7f83e91209

Comment by Gerrit Updater [ 23/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33863/
Subject: LU-11566 utils: fix lctl llog_print for large configs
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: a8abec1d666c92eb5003e020488f10d10da82787

Comment by Gerrit Updater [ 28/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34523
Subject: LU-11566 llog: consolidate argument validation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eb63a2f319a8e087f9223db3c41f00bc3f126747

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34004/
Subject: LU-11566 utils: improve usage/docs for lctl llog commands
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2ec11b04dd76ea176ba9c76f809f576a04dafc1a

Comment by Gerrit Updater [ 10/May/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34850
Subject: LU-11566 utils: fix lctl llog_print for large configs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ca403753ec60503ab599fb7292db95789bda858e

Comment by Gerrit Updater [ 08/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34850/
Subject: LU-11566 utils: fix lctl llog_print for large configs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c1b11bf102ea73860fcb65afceb99db5b4fcc2e2

Comment by Patrick Farrell (Inactive) [ 09/Sep/19 ]

Main patches landed, moved remaining cleanup to LU-12737.

Comment by Gerrit Updater [ 11/Feb/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46510
Subject: LU-11566 utils: improve usage/docs for lctl llog commands
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: af53bb19913ea41a9fdbee386cb5e1ef3eaf0cae

Generated at Sat Feb 10 02:44:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.