[LU-5202] LFSCK 5: LFSCK needs to log all changes and errors found Created: 16/Jun/14  Updated: 12/Aug/21

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.5
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: mq115

Issue Links:
Duplicate
is duplicated by LU-8169 More extensive and accessible logging... Resolved
Related
is related to LU-4610 LFSCK needs to log all changes and er... Closed
is related to LU-6102 OI scrub is triggered during racer test Resolved
Rank (Obsolete): 14526

 Description   

LFSCK needs to log with D_LFSCK all fixes that it makes and any inconsistencies that it finds that it does not repair (e.g. unknown LOV magic layouts). Otherwise it will be making secret changes to the filesystem and when there are problems they will be impossible to debug.

There should be a mechanism for logging D_LFSCK messages to a separate log file for administrators to review, so that the kernel debug messages are not lost. A simple mechanism would be debug_daemon to log all messages to a file, then "lctl filter all; lctl show lfsck" (or similar) to filter all except lfsck messages into a text log file.

The problem with this approach is that debug_daemon will consume all debug messages while it is running, and it will log a lot more than just D_LFSCK messages to disk. We may want to consider some other logging mechanism to capture just the D_LFSCK messages.



 Comments   
Comment by nasf (Inactive) [ 17/Jun/14 ]

Why not use LU-4610?

Comment by Andreas Dilger [ 17/Jun/14 ]

Because LU-4610 needs to be closed for 2.6.0 to be released, and it isn't likely that this code will be landed for 2.6.0.

Comment by Andreas Dilger [ 07/Jan/15 ]

On 2.5.3-ge835226 I ran lctl lfsck_start -M myth-MDT0000 -t namespace and enabled D_LFSCK printing to the console via lctl set_param printk=lfsck. This generated a lot of console messages that weren't very useful:

Jan  4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621438, rc = 0
Jan  4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621439, rc = 0
Jan  4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621440, rc = 0
Jan  4 21:36:33 mookie kernel: Lustre: 3402:0:(osd_scrub.c:1240:osd_otable_it_preload()) OSD pre-loaded: max = 2621440, preload = 2621441, rc = 1
Jan  4 21:36:33 mookie kernel: Lustre: 3403:0:(osd_scrub.c:1301:osd_scrub_main()) OI scrub: stop, rc = 1, pos = 2621441
Jan  4 21:36:33 mookie kernel: Lustre: 3402:0:(lfsck_engine.c:340:lfsck_master_engine()) LFSCK exit: oit_flags = 0xc0002, dir_flags = 0xc004, oit_cookie = 2130251, dir_cookie = 3397299287892376453, parent = [0x20814b:0x17d4596b:0x0], pid 
= 3402, rc = 1

This should be fixed for b2_5.

For 2.7, we discussed adding lctl lfsck_start -v and -q options to enable the D_LFSCK messages to print to the console or not. Also, major messages like start/stop, urgent scrub, and first message of non-upgrade error should always be printed to the console via D_CONSOLE.

Comment by Johann Lombardi (Inactive) [ 25/Feb/15 ]

In the meantime, it would be pretty useful to have a procfs tunable to dump lustre debug logs when an inconsistency is found by lfsck (similar to dump_on_eviction). It would help us debugging on lola.

Comment by Gerrit Updater [ 25/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13864
Subject: LU-5202 lfsck: dump LFSCK debug log automatically
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 06777610a8e5b16cfb66cd0a54bc616e965c28d2

Comment by Andreas Dilger [ 25/Feb/15 ]

I've just been using lctl set_param printk=+lfsck to have lfsck messages go to the console. That is ok with master, but b2_5 needs to be fixed so all the OI scrub status messages are not using D_LFSCK.

Comment by nasf (Inactive) [ 26/Feb/15 ]

The patch 13864 is used as temporary solution for lola test.

Comment by Cliff White (Inactive) [ 17/Mar/15 ]

The patch current has two negative reviews and has failed Maloo - we cannot use this on soak test in this state.
Is there a proper fix that is going to land somewhere?

Comment by nasf (Inactive) [ 04/Mar/16 ]

The patch current has two negative reviews and has failed Maloo - we cannot use this on soak test in this state.
Is there a proper fix that is going to land somewhere?

Sorry, there is not suitable patch yet, we need more consideration for that.

Comment by Nathan Dauchy (Inactive) [ 27/May/16 ]

While we are waiting for a revised patch and new feature... can you please clarify the procedure for gathering the logs and perhaps deciphering them?

What I have been able to come up with so far is...

 lctl set_param printk=+lfsck
 lctl filter all_subs
 lctl show lfsck
 lctl clear
 lctl debug_daemon start /var/log/lfsck.debug
 
 lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o
 lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status
 
 lctl debug_daemon stop
 lctl set_param printk=-lfsck
 lctl debug_file /var/log/lfsck.debug | grep lfsck | egrep -v "(namespace LFSCK add flag| RPC pname)" > /var/log/lfsck.log
 lctl show all_subs

I'm not sure that the "filter" commands are actually doing anything useful though. And I would like to know if that method leaves out anything useful from the lfsck.log file.

Comment by Andreas Dilger [ 28/May/16 ]

What you've got there is belt and suspenders. The printk setting will itself dump all lfsck messages to the console (dmesg & /var/log/messages). If you don't want that, then debug_daemon is the alternative. The filter setting is indeed irrelevant since all the subsystems are already enabled by default.

Generated at Sat Feb 10 01:49:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.