[LU-5202] LFSCK 5: LFSCK needs to log all changes and errors found Created: 16/Jun/14 Updated: 12/Aug/21 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | mq115 | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 14526 | ||||||||||||||||||||
| Description |
|
LFSCK needs to log with D_LFSCK all fixes that it makes and any inconsistencies that it finds that it does not repair (e.g. unknown LOV magic layouts). Otherwise it will be making secret changes to the filesystem and when there are problems they will be impossible to debug. There should be a mechanism for logging D_LFSCK messages to a separate log file for administrators to review, so that the kernel debug messages are not lost. A simple mechanism would be debug_daemon to log all messages to a file, then "lctl filter all; lctl show lfsck" (or similar) to filter all except lfsck messages into a text log file. The problem with this approach is that debug_daemon will consume all debug messages while it is running, and it will log a lot more than just D_LFSCK messages to disk. We may want to consider some other logging mechanism to capture just the D_LFSCK messages. |
| Comments |
| Comment by nasf (Inactive) [ 17/Jun/14 ] |
|
Why not use |
| Comment by Andreas Dilger [ 17/Jun/14 ] |
|
Because |
| Comment by Andreas Dilger [ 07/Jan/15 ] |
|
On 2.5.3-ge835226 I ran lctl lfsck_start -M myth-MDT0000 -t namespace and enabled D_LFSCK printing to the console via lctl set_param printk=lfsck. This generated a lot of console messages that weren't very useful: Jan 4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621438, rc = 0 Jan 4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621439, rc = 0 Jan 4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621440, rc = 0 Jan 4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621441, rc = 1 Jan 4 21:36:33 mookie kernel: Lustre: 3403:0:(osd_scrub.c:1301:osd_scrub_main()) OI scrub: stop, rc = 1, pos = 2621441 Jan 4 21:36:33 mookie kernel: Lustre: 3402:0:(lfsck_engine.c:340:lfsck_master_engine()) LFSCK exit: oit_flags = 0xc0002, dir_flags = 0xc004, oit_cookie = 2130251, dir_cookie = 3397299287892376453, parent = [0x20814b:0x17d4596b:0x0], pid = 3402, rc = 1 This should be fixed for b2_5. For 2.7, we discussed adding lctl lfsck_start -v and -q options to enable the D_LFSCK messages to print to the console or not. Also, major messages like start/stop, urgent scrub, and first message of non-upgrade error should always be printed to the console via D_CONSOLE. |
| Comment by Johann Lombardi (Inactive) [ 25/Feb/15 ] |
|
In the meantime, it would be pretty useful to have a procfs tunable to dump lustre debug logs when an inconsistency is found by lfsck (similar to dump_on_eviction). It would help us debugging on lola. |
| Comment by Gerrit Updater [ 25/Feb/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13864 |
| Comment by Andreas Dilger [ 25/Feb/15 ] |
|
I've just been using lctl set_param printk=+lfsck to have lfsck messages go to the console. That is ok with master, but b2_5 needs to be fixed so all the OI scrub status messages are not using D_LFSCK. |
| Comment by nasf (Inactive) [ 26/Feb/15 ] |
|
The patch 13864 is used as temporary solution for lola test. |
| Comment by Cliff White (Inactive) [ 17/Mar/15 ] |
|
The patch current has two negative reviews and has failed Maloo - we cannot use this on soak test in this state. |
| Comment by nasf (Inactive) [ 04/Mar/16 ] |
Sorry, there is not suitable patch yet, we need more consideration for that. |
| Comment by Nathan Dauchy (Inactive) [ 27/May/16 ] |
|
While we are waiting for a revised patch and new feature... can you please clarify the procedure for gathering the logs and perhaps deciphering them? What I have been able to come up with so far is... lctl set_param printk=+lfsck lctl filter all_subs lctl show lfsck lctl clear lctl debug_daemon start /var/log/lfsck.debug lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status lctl debug_daemon stop lctl set_param printk=-lfsck lctl debug_file /var/log/lfsck.debug | grep lfsck | egrep -v "(namespace LFSCK add flag| RPC pname)" > /var/log/lfsck.log lctl show all_subs I'm not sure that the "filter" commands are actually doing anything useful though. And I would like to know if that method leaves out anything useful from the lfsck.log file. |
| Comment by Andreas Dilger [ 28/May/16 ] |
|
What you've got there is belt and suspenders. The printk setting will itself dump all lfsck messages to the console (dmesg & /var/log/messages). If you don't want that, then debug_daemon is the alternative. The filter setting is indeed irrelevant since all the subsystems are already enabled by default. |