[LU-9187] LFSCK needs to handle parameter "failout" and "dryrun" properly Created: 07/Mar/17  Updated: 23/May/17  Resolved: 09/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9548 No debug info from lctl set_param deb... Open
is related to LU-9545 report inconsistent instead of "fixe... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Current implementation does not properly handle the LFSCK parameter "failout" and "dryrun", as to it only works for OI scrub, but does NOT work for namespace LFSCK and layout LFSCK.



 Comments   
Comment by Gerrit Updater [ 07/Mar/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25849
Subject: LU-9187 lfsck: handle parameters properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b2a84a3bf8a9dc68616e5a6db65d483e28d349aa

Comment by Gerrit Updater [ 19/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25849/
Subject: LU-9187 lfsck: handle parameters properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 03ab4771dfc6dafa8e6cd72509d8a6ac6c7046da

Comment by James A Simmons [ 28/Apr/17 ]

The last dry run for LFSCK on our atlas production systems completed, but we are still seeing that there were entries that were fixed even though we have the “dryrun” patch (see output below). So the patch that landed for LU-9187 only partially fixed the issue.

Also, it looks like no debug data could be captured for the lfsck output using the debug buffer even though lfsck output was configured to be captured. I don’t think we overwrote The debug buffer was not overwritten because the dump file was only 3.1 MB and the buffer was set to hold 721MB before overwriting (see output below). It would like to at a minimum get output of what was fixed even if we can’t run in dryrun mode.

[output]
name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: completed
flags: inconsistent
param: dryrun,all_targets,orphan
last_completed_time: 1493374063
time_since_last_completed: 3809 seconds
latest_start_time: 1493035314
time_since_latest_start: 342558 seconds
last_checkpoint_time: 1493374063
time_since_last_checkpoint: 3809 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 1073740039, N/A, N/A
first_failure_position: 930672790, N/A, N/A
checked_phase1: 468966333
checked_phase2: 272516
updated_phase1: 1
updated_phase2: 0
failed_phase1: 0
failed_phase2: 14
directories: 66928508
dirent_repaired: 0
linkea_repaired: 1
nlinks_repaired: 0
multiple_linked_checked: 547668
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
success_count: 3
run_time_phase1: 338639 seconds
run_time_phase2: 102 seconds
average_speed_phase1: 1384 items/sec
average_speed_phase2: 2671 objs/sec
average_speed_total: 1385 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A
current_position: N/A

We have the dumps if you want them as well.

Comment by nasf (Inactive) [ 29/Apr/17 ]

Under the 'dryrun' mode, the "fixed" items in the LFSCK output does not means real fix, instead, it means the found inconsistent items. You can verify whether the inconsistency has been fixed or not via another dryrun mode LFSCK. If the inconsistency has been fixed in the first dryrun LFSCK, it should NOT be found again during the second dyrun LFSCK.

Comment by Andreas Dilger [ 04/May/17 ]

I think it would be more clear for users if "repaired" only indicated actual fixed items, and there be a separate field accounting the errors found. That should probably be a separate ticket.

Comment by Peter Jones [ 09/May/17 ]

It sounds like any remaining work will be tracked under a new ticket

Generated at Sat Feb 10 02:23:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.