Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9187

LFSCK needs to handle parameter "failout" and "dryrun" properly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Current implementation does not properly handle the LFSCK parameter "failout" and "dryrun", as to it only works for OI scrub, but does NOT work for namespace LFSCK and layout LFSCK.

      Attachments

        Issue Links

          Activity

            [LU-9187] LFSCK needs to handle parameter "failout" and "dryrun" properly
            pjones Peter Jones added a comment -

            It sounds like any remaining work will be tracked under a new ticket

            pjones Peter Jones added a comment - It sounds like any remaining work will be tracked under a new ticket

            I think it would be more clear for users if "repaired" only indicated actual fixed items, and there be a separate field accounting the errors found. That should probably be a separate ticket.

            adilger Andreas Dilger added a comment - I think it would be more clear for users if "repaired" only indicated actual fixed items, and there be a separate field accounting the errors found. That should probably be a separate ticket.

            Under the 'dryrun' mode, the "fixed" items in the LFSCK output does not means real fix, instead, it means the found inconsistent items. You can verify whether the inconsistency has been fixed or not via another dryrun mode LFSCK. If the inconsistency has been fixed in the first dryrun LFSCK, it should NOT be found again during the second dyrun LFSCK.

            yong.fan nasf (Inactive) added a comment - Under the 'dryrun' mode, the "fixed" items in the LFSCK output does not means real fix, instead, it means the found inconsistent items. You can verify whether the inconsistency has been fixed or not via another dryrun mode LFSCK. If the inconsistency has been fixed in the first dryrun LFSCK, it should NOT be found again during the second dyrun LFSCK.

            The last dry run for LFSCK on our atlas production systems completed, but we are still seeing that there were entries that were fixed even though we have the “dryrun” patch (see output below). So the patch that landed for LU-9187 only partially fixed the issue.

            Also, it looks like no debug data could be captured for the lfsck output using the debug buffer even though lfsck output was configured to be captured. I don’t think we overwrote The debug buffer was not overwritten because the dump file was only 3.1 MB and the buffer was set to hold 721MB before overwriting (see output below). It would like to at a minimum get output of what was fixed even if we can’t run in dryrun mode.

            [output]
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags: inconsistent
            param: dryrun,all_targets,orphan
            last_completed_time: 1493374063
            time_since_last_completed: 3809 seconds
            latest_start_time: 1493035314
            time_since_latest_start: 342558 seconds
            last_checkpoint_time: 1493374063
            time_since_last_checkpoint: 3809 seconds
            latest_start_position: 77, N/A, N/A
            last_checkpoint_position: 1073740039, N/A, N/A
            first_failure_position: 930672790, N/A, N/A
            checked_phase1: 468966333
            checked_phase2: 272516
            updated_phase1: 1
            updated_phase2: 0
            failed_phase1: 0
            failed_phase2: 14
            directories: 66928508
            dirent_repaired: 0
            linkea_repaired: 1
            nlinks_repaired: 0
            multiple_linked_checked: 547668
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 0
            striped_shards_repaired: 0
            striped_shards_failed: 0
            striped_shards_skipped: 0
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 3
            run_time_phase1: 338639 seconds
            run_time_phase2: 102 seconds
            average_speed_phase1: 1384 items/sec
            average_speed_phase2: 2671 objs/sec
            average_speed_total: 1385 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A

            We have the dumps if you want them as well.

            simmonsja James A Simmons added a comment - The last dry run for LFSCK on our atlas production systems completed, but we are still seeing that there were entries that were fixed even though we have the “dryrun” patch (see output below). So the patch that landed for LU-9187 only partially fixed the issue. Also, it looks like no debug data could be captured for the lfsck output using the debug buffer even though lfsck output was configured to be captured. I don’t think we overwrote The debug buffer was not overwritten because the dump file was only 3.1 MB and the buffer was set to hold 721MB before overwriting (see output below). It would like to at a minimum get output of what was fixed even if we can’t run in dryrun mode. [output] name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun,all_targets,orphan last_completed_time: 1493374063 time_since_last_completed: 3809 seconds latest_start_time: 1493035314 time_since_latest_start: 342558 seconds last_checkpoint_time: 1493374063 time_since_last_checkpoint: 3809 seconds latest_start_position: 77, N/A, N/A last_checkpoint_position: 1073740039, N/A, N/A first_failure_position: 930672790, N/A, N/A checked_phase1: 468966333 checked_phase2: 272516 updated_phase1: 1 updated_phase2: 0 failed_phase1: 0 failed_phase2: 14 directories: 66928508 dirent_repaired: 0 linkea_repaired: 1 nlinks_repaired: 0 multiple_linked_checked: 547668 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 0 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 3 run_time_phase1: 338639 seconds run_time_phase2: 102 seconds average_speed_phase1: 1384 items/sec average_speed_phase2: 2671 objs/sec average_speed_total: 1385 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A We have the dumps if you want them as well.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25849/
            Subject: LU-9187 lfsck: handle parameters properly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 03ab4771dfc6dafa8e6cd72509d8a6ac6c7046da

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25849/ Subject: LU-9187 lfsck: handle parameters properly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 03ab4771dfc6dafa8e6cd72509d8a6ac6c7046da

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25849
            Subject: LU-9187 lfsck: handle parameters properly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b2a84a3bf8a9dc68616e5a6db65d483e28d349aa

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25849 Subject: LU-9187 lfsck: handle parameters properly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b2a84a3bf8a9dc68616e5a6db65d483e28d349aa

            People

              yong.fan nasf (Inactive)
              yong.fan nasf (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: