Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,

      we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe.

      warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied

      usr/src/lustre-2.10.3/lu10212-estale.patch
      usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch
      usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch
      usr/src/lustre-2.10.3/lu10887-lfsck.patch
      usr/src/lustre-2.10.3/lu8990-put-root.patch
      

      when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck.

      ...
      2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting
      2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force.
      2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports
      2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients
      2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs
      2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted.
      2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680
      2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
      2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace
      2018-05-02 22:06:44 [ 2957.922760]
      2018-05-02 22:06:44 [ 2957.922760] Call Trace:
      2018-05-02 22:06:44 [ 2957.928142]  [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      2018-05-02 22:06:44 [ 2957.935374]  [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2957.942270]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2957.949398]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:44 [ 2957.957911]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:44 [ 2957.965289]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:44 [ 2957.974129]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:44 [ 2957.983217]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2957.989375]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2957.997154]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:44 [ 2958.003538]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2958.009648]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      2018-05-02 22:06:44 [ 2958.016449]  [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2958.024186]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:44 [ 2958.029662]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.035220]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:44 [ 2958.041197]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.046723]
      2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG
      2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
      2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018
      2018-05-02 22:06:44 [ 2958.073066] Call Trace:
      2018-05-02 22:06:44 [ 2958.076060]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      2018-05-02 22:06:44 [ 2958.081738]  [<ffffffff816a8634>] panic+0xe8/0x21f
      2018-05-02 22:06:44 [ 2958.087058]  [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2958.093781]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2958.100741]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:45 [ 2958.109091]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:45 [ 2958.116294]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:45 [ 2958.124963]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:45 [ 2958.133889]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.139866]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:45 [ 2958.147492]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:45 [ 2958.153724]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.159689]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      2018-05-02 22:06:45 [ 2958.165742]  [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
      2018-05-02 22:06:45 [ 2958.173343]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:45 [ 2958.178685]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.185227]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:45 [ 2958.191065]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled
      

      I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-10988] LBUG in lfsck

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32522/
            Subject: LU-10988 lfsck: load object attr when prepare LFSCK request
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 21d33c11abdb2908fa5fa43bb942a97615c6f7a2

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32522/ Subject: LU-10988 lfsck: load object attr when prepare LFSCK request Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 21d33c11abdb2908fa5fa43bb942a97615c6f7a2

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32522
            Subject: LU-10988 lfsck: load object attr when prepare LFSCK request
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6b3b1a51fe759f769f86f8ff968290754a467ff1

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32522 Subject: LU-10988 lfsck: load object attr when prepare LFSCK request Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6b3b1a51fe759f769f86f8ff968290754a467ff1
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32245/
            Subject: LU-10988 lfsck: load object attr when prepare LFSCK request
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f7c354096a810df0a9333dace2f538d6dfbe486f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32245/ Subject: LU-10988 lfsck: load object attr when prepare LFSCK request Project: fs/lustre-release Branch: master Current Patch Set: Commit: f7c354096a810df0a9333dace2f538d6dfbe486f

            striped_shards_repaired: 1

            That means some corrupted striped directory has been fixed. And the status become "completed", that is expected.

            striped_shards_skipped: 2

            That means that the namespace LFSCK could not confirm whether related shards are inconsistent or not because without enough evidences.

            Generally, the system is in healthy status. As for those uncertain items, if they are corrupted, then we can consider to fix them when we really hit trouble in the future.

            yong.fan nasf (Inactive) added a comment - striped_shards_repaired: 1 That means some corrupted striped directory has been fixed. And the status become "completed", that is expected. striped_shards_skipped: 2 That means that the namespace LFSCK could not confirm whether related shards are inconsistent or not because without enough evidences. Generally, the system is in healthy status. As for those uncertain items, if they are corrupted, then we can consider to fix them when we really hit trouble in the future.
            scadmin SC Admin added a comment -

            yes, I know 'stopped' is not correct. these 3 MDTs have been 'stopped', and 'crashed', 'crashed' respectively for a week now, and mounted with -o skip_lfsck (for obvious reasons).

            now that lfsck is fixed, the 'crashed' ones continued automatically as soon as they were mounted, and then finished. the 'stopped' one (MDT0) remained stopped, as that has been its state for a week or more.

            I have now re-run the whole namespace lfsck again, and output is now this:

            [warble1]root:   lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1525344435
            time_since_last_completed: 963 seconds
            latest_start_time: 1525342746
            time_since_latest_start: 2652 seconds
            last_checkpoint_time: 1525344435
            time_since_last_checkpoint: 963 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: N/A, N/A, N/A
            checked_phase1: 8060227
            checked_phase2: 3437356
            updated_phase1: 0
            updated_phase2: 0
            failed_phase1: 0
            failed_phase2: 0
            directories: 1393898
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 53149
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 1036368
            striped_shards_repaired: 1
            striped_shards_failed: 0
            striped_shards_skipped: 2
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 6
            run_time_phase1: 1039 seconds
            run_time_phase2: 650 seconds
            average_speed_phase1: 7757 items/sec
            average_speed_phase2: 5288 objs/sec
            average_speed_total: 6807 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1525344401
            time_since_last_completed: 997 seconds
            latest_start_time: 1525342746
            time_since_latest_start: 2652 seconds
            last_checkpoint_time: 1525344401
            time_since_last_checkpoint: 997 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: N/A, N/A, N/A
            checked_phase1: 7743558
            checked_phase2: 3250035
            updated_phase1: 0
            updated_phase2: 0
            failed_phase1: 0
            failed_phase2: 0
            directories: 1381748
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 52656
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 1036341
            striped_shards_repaired: 0
            striped_shards_failed: 0
            striped_shards_skipped: 2
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 7
            run_time_phase1: 638 seconds
            run_time_phase2: 616 seconds
            average_speed_phase1: 12137 items/sec
            average_speed_phase2: 5276 objs/sec
            average_speed_total: 8766 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1525344398
            time_since_last_completed: 1000 seconds
            latest_start_time: 1525342746
            time_since_latest_start: 2652 seconds
            last_checkpoint_time: 1525344398
            time_since_last_checkpoint: 1000 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: N/A, N/A, N/A
            checked_phase1: 7746926
            checked_phase2: 3246941
            updated_phase1: 2
            updated_phase2: 0
            failed_phase1: 0
            failed_phase2: 0
            directories: 1382139
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 52862
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 1036341
            striped_shards_repaired: 1
            striped_shards_failed: 0
            striped_shards_skipped: 2
            name_hash_repaired: 2
            linkea_overflow_cleared: 0
            success_count: 7
            run_time_phase1: 641 seconds
            run_time_phase2: 613 seconds
            average_speed_phase1: 12085 items/sec
            average_speed_phase2: 5296 objs/sec
            average_speed_total: 8767 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            

            look ok to you?

            anything else we should be doing to repair the damage from LU-10877 or LU-10677?

            cheers,
            robin

            scadmin SC Admin added a comment - yes, I know 'stopped' is not correct. these 3 MDTs have been 'stopped', and 'crashed', 'crashed' respectively for a week now, and mounted with -o skip_lfsck (for obvious reasons). now that lfsck is fixed, the 'crashed' ones continued automatically as soon as they were mounted, and then finished. the 'stopped' one (MDT0) remained stopped, as that has been its state for a week or more. I have now re-run the whole namespace lfsck again, and output is now this: [warble1]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets,create_mdtobj last_completed_time: 1525344435 time_since_last_completed: 963 seconds latest_start_time: 1525342746 time_since_latest_start: 2652 seconds last_checkpoint_time: 1525344435 time_since_last_checkpoint: 963 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 8060227 checked_phase2: 3437356 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 1393898 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 53149 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1036368 striped_shards_repaired: 1 striped_shards_failed: 0 striped_shards_skipped: 2 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 6 run_time_phase1: 1039 seconds run_time_phase2: 650 seconds average_speed_phase1: 7757 items/sec average_speed_phase2: 5288 objs/sec average_speed_total: 6807 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets,create_mdtobj last_completed_time: 1525344401 time_since_last_completed: 997 seconds latest_start_time: 1525342746 time_since_latest_start: 2652 seconds last_checkpoint_time: 1525344401 time_since_last_checkpoint: 997 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 7743558 checked_phase2: 3250035 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 1381748 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 52656 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1036341 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 2 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 7 run_time_phase1: 638 seconds run_time_phase2: 616 seconds average_speed_phase1: 12137 items/sec average_speed_phase2: 5276 objs/sec average_speed_total: 8766 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets,create_mdtobj last_completed_time: 1525344398 time_since_last_completed: 1000 seconds latest_start_time: 1525342746 time_since_latest_start: 2652 seconds last_checkpoint_time: 1525344398 time_since_last_checkpoint: 1000 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 7746926 checked_phase2: 3246941 updated_phase1: 2 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 1382139 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 52862 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1036341 striped_shards_repaired: 1 striped_shards_failed: 0 striped_shards_skipped: 2 name_hash_repaired: 2 linkea_overflow_cleared: 0 success_count: 7 run_time_phase1: 641 seconds run_time_phase2: 613 seconds average_speed_phase1: 12085 items/sec average_speed_phase2: 5296 objs/sec average_speed_total: 8767 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A look ok to you? anything else we should be doing to repair the damage from LU-10877 or LU-10677 ? cheers, robin

            "stopped" is not the expected status. Please re-run the LFSCK with LFSCK debug enabled (lctl set_param debug+=lfsck), then if it failed again, we can analysis the debug logs.

            yong.fan nasf (Inactive) added a comment - "stopped" is not the expected status. Please re-run the LFSCK with LFSCK debug enabled (lctl set_param debug+=lfsck), then if it failed again, we can analysis the debug logs.
            scadmin SC Admin added a comment -

            yup, that worked. thanks!

            lfsck on the MDT1,MDT2 where it was 'crashed' has now completed:

            [warble1]root:   lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: stopped
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1523258154
            time_since_last_completed: 2084076 seconds
            latest_start_time: 1523283745
            time_since_latest_start: 2058485 seconds
            last_checkpoint_time: 1523284186
            time_since_last_checkpoint: 2058044 seconds
            latest_start_position: 3643308, [0x20001072c:0x16502:0x0], 0x75092e6bed520000
            last_checkpoint_position: 3643308, [0x20001072c:0x16502:0x0], 0x750934ffbf07ffff
            first_failure_position: 3643307, N/A, N/A
            checked_phase1: 2426567
            checked_phase2: 0
            updated_phase1: 0
            updated_phase2: 0
            failed_phase1: 1
            failed_phase2: 0
            directories: 202176
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 14706
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 1
            striped_shards_scanned: 142443
            striped_shards_repaired: 0
            striped_shards_failed: 0
            striped_shards_skipped: 0
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 5
            run_time_phase1: 180 seconds
            run_time_phase2: 0 seconds
            average_speed_phase1: 13480 items/sec
            average_speed_phase2: 0 objs/sec
            average_speed_total: 13480 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1525341840
            time_since_last_completed: 390 seconds
            latest_start_time: 1525339672
            time_since_latest_start: 2558 seconds
            last_checkpoint_time: 1525341840
            time_since_last_checkpoint: 390 seconds
            latest_start_position: 2934356, [0x28001c941:0x9c24:0x0], 0x0
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: N/A, N/A, N/A
            checked_phase1: 7946529
            checked_phase2: 3312630
            updated_phase1: 1
            updated_phase2: 1
            failed_phase1: 0
            failed_phase2: 0
            directories: 1414222
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 52655
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 3
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 1
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 1060672
            striped_shards_repaired: 0
            striped_shards_failed: 0
            striped_shards_skipped: 0
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 6
            run_time_phase1: 1488 seconds
            run_time_phase2: 931 seconds
            average_speed_phase1: 5340 items/sec
            average_speed_phase2: 3558 objs/sec
            average_speed_total: 4654 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags:
            param: all_targets,create_mdtobj
            last_completed_time: 1525341827
            time_since_last_completed: 403 seconds
            latest_start_time: 1525339070
            time_since_latest_start: 3160 seconds
            last_checkpoint_time: 1525341827
            time_since_last_checkpoint: 403 seconds
            latest_start_position: 2965156, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: N/A, N/A, N/A
            checked_phase1: 7917065
            checked_phase2: 3318839
            updated_phase1: 0
            updated_phase2: 1
            failed_phase1: 0
            failed_phase2: 0
            directories: 1425028
            dirent_repaired: 0
            linkea_repaired: 0
            nlinks_repaired: 0
            multiple_linked_checked: 52856
            multiple_linked_repaired: 0
            unknown_inconsistency: 0
            unmatched_pairs_repaired: 0
            dangling_repaired: 0
            multiple_referenced_repaired: 0
            bad_file_type_repaired: 0
            lost_dirent_repaired: 1
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_repaired: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 1068469
            striped_shards_repaired: 0
            striped_shards_failed: 0
            striped_shards_skipped: 0
            name_hash_repaired: 0
            linkea_overflow_cleared: 0
            success_count: 6
            run_time_phase1: 2016 seconds
            run_time_phase2: 919 seconds
            average_speed_phase1: 3927 items/sec
            average_speed_phase2: 3611 objs/sec
            average_speed_total: 3828 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            

            MDT0 still needs to be re-scanned I think though 'cos it was 'stopped' as part of the crashing stuff in LU-10887

            I'll start a whole new scan again now...

            lctl lfsck_start -M dagg-MDT0000 -t namespace -A -r -C
            

            cheers,
            robin

            scadmin SC Admin added a comment - yup, that worked. thanks! lfsck on the MDT1,MDT2 where it was 'crashed' has now completed: [warble1]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: stopped flags: param: all_targets,create_mdtobj last_completed_time: 1523258154 time_since_last_completed: 2084076 seconds latest_start_time: 1523283745 time_since_latest_start: 2058485 seconds last_checkpoint_time: 1523284186 time_since_last_checkpoint: 2058044 seconds latest_start_position: 3643308, [0x20001072c:0x16502:0x0], 0x75092e6bed520000 last_checkpoint_position: 3643308, [0x20001072c:0x16502:0x0], 0x750934ffbf07ffff first_failure_position: 3643307, N/A, N/A checked_phase1: 2426567 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 1 failed_phase2: 0 directories: 202176 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 14706 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 1 striped_shards_scanned: 142443 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 5 run_time_phase1: 180 seconds run_time_phase2: 0 seconds average_speed_phase1: 13480 items/sec average_speed_phase2: 0 objs/sec average_speed_total: 13480 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets,create_mdtobj last_completed_time: 1525341840 time_since_last_completed: 390 seconds latest_start_time: 1525339672 time_since_latest_start: 2558 seconds last_checkpoint_time: 1525341840 time_since_last_checkpoint: 390 seconds latest_start_position: 2934356, [0x28001c941:0x9c24:0x0], 0x0 last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 7946529 checked_phase2: 3312630 updated_phase1: 1 updated_phase2: 1 failed_phase1: 0 failed_phase2: 0 directories: 1414222 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 52655 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 3 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 1 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1060672 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 6 run_time_phase1: 1488 seconds run_time_phase2: 931 seconds average_speed_phase1: 5340 items/sec average_speed_phase2: 3558 objs/sec average_speed_total: 4654 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets,create_mdtobj last_completed_time: 1525341827 time_since_last_completed: 403 seconds latest_start_time: 1525339070 time_since_latest_start: 3160 seconds last_checkpoint_time: 1525341827 time_since_last_checkpoint: 403 seconds latest_start_position: 2965156, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 7917065 checked_phase2: 3318839 updated_phase1: 0 updated_phase2: 1 failed_phase1: 0 failed_phase2: 0 directories: 1425028 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 52856 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 1 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1068469 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 6 run_time_phase1: 2016 seconds run_time_phase2: 919 seconds average_speed_phase1: 3927 items/sec average_speed_phase2: 3611 objs/sec average_speed_total: 3828 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A MDT0 still needs to be re-scanned I think though 'cos it was 'stopped' as part of the crashing stuff in LU-10887 I'll start a whole new scan again now... lctl lfsck_start -M dagg-MDT0000 -t namespace -A -r -C cheers, robin

            scadmin, thanks for the new test results. I have updated the patch ttps://review.whamcloud.com/32245. That should have fixed your trouble. Would you please to try again (set 2 or newer)? Thanks!

            yong.fan nasf (Inactive) added a comment - scadmin , thanks for the new test results. I have updated the patch ttps://review.whamcloud.com/32245. That should have fixed your trouble. Would you please to try again (set 2 or newer)? Thanks!
            scadmin SC Admin added a comment -

            no probs.

            2018-05-03 01:02:53 [10209.454437] LustreError: 35645:0:(dt_object.c:213:dt_mode_to_dft()) ASSERTION( 0 ) failed: invalid mode 0
            2018-05-03 01:02:53 [10209.464798] LustreError: 35645:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
            2018-05-03 01:02:53 [10209.472260] Pid: 35645, comm: lfsck_namespace
            2018-05-03 01:02:53 [10209.477336] 
            2018-05-03 01:02:53 [10209.477336] Call Trace:
            2018-05-03 01:02:53 [10209.482611]  [<ffffffffc05f67ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            2018-05-03 01:02:53 [10209.489784]  [<ffffffffc05f683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            2018-05-03 01:02:53 [10209.496628]  [<ffffffffc0b275a1>] dt_mode_to_dft+0xa1/0xb0 [obdclass]
            2018-05-03 01:02:53 [10209.503721]  [<ffffffffc14b5c81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
            2018-05-03 01:02:53 [10209.512193]  [<ffffffffc0b23a22>] ? htable_lookup+0x102/0x180 [obdclass]
            2018-05-03 01:02:53 [10209.519533]  [<ffffffffc14e1f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
            2018-05-03 01:02:53 [10209.528336]  [<ffffffffc14b7e71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
            2018-05-03 01:02:53 [10209.537370]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-03 01:02:53 [10209.543458]  [<ffffffffc149b98e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
            2018-05-03 01:02:53 [10209.551182]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
            2018-05-03 01:02:53 [10209.557512]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-03 01:02:53 [10209.563565]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
            2018-05-03 01:02:53 [10209.570305]  [<ffffffffc149b5c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
            2018-05-03 01:02:53 [10209.578004]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            2018-05-03 01:02:53 [10209.583438]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-03 01:02:53 [10209.588968]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
            2018-05-03 01:02:53 [10209.594914]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-03 01:02:53 [10209.600428] 
            2018-05-03 01:02:53 [10209.602473] Kernel panic - not syncing: LBUG
            

            cheers,
            robin

            scadmin SC Admin added a comment - no probs. 2018-05-03 01:02:53 [10209.454437] LustreError: 35645:0:(dt_object.c:213:dt_mode_to_dft()) ASSERTION( 0 ) failed: invalid mode 0 2018-05-03 01:02:53 [10209.464798] LustreError: 35645:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-05-03 01:02:53 [10209.472260] Pid: 35645, comm: lfsck_namespace 2018-05-03 01:02:53 [10209.477336] 2018-05-03 01:02:53 [10209.477336] Call Trace: 2018-05-03 01:02:53 [10209.482611] [<ffffffffc05f67ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-05-03 01:02:53 [10209.489784] [<ffffffffc05f683c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-05-03 01:02:53 [10209.496628] [<ffffffffc0b275a1>] dt_mode_to_dft+0xa1/0xb0 [obdclass] 2018-05-03 01:02:53 [10209.503721] [<ffffffffc14b5c81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-03 01:02:53 [10209.512193] [<ffffffffc0b23a22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-03 01:02:53 [10209.519533] [<ffffffffc14e1f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-03 01:02:53 [10209.528336] [<ffffffffc14b7e71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-03 01:02:53 [10209.537370] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-03 01:02:53 [10209.543458] [<ffffffffc149b98e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-03 01:02:53 [10209.551182] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-03 01:02:53 [10209.557512] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-03 01:02:53 [10209.563565] [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20 2018-05-03 01:02:53 [10209.570305] [<ffffffffc149b5c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck] 2018-05-03 01:02:53 [10209.578004] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-03 01:02:53 [10209.583438] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-03 01:02:53 [10209.588968] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-03 01:02:53 [10209.594914] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-03 01:02:53 [10209.600428] 2018-05-03 01:02:53 [10209.602473] Kernel panic - not syncing: LBUG cheers, robin

            People

              yong.fan nasf (Inactive)
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: