Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6109

LFSCK gets "inconsistent" flag and won't stop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0, Lustre 2.8.0
    • Lustre 2.7.0
    • OpenSFS cluster with 2 MDSs with 1MDT each, 3 OSSs with two OSTs each and three clients running Lustre (master) build # 2812
    • 3
    • 17016

    Description

      While running the performance test (3.3.2) from the LFSCK Phase 3 test plan, I’ve run into a problem. I installed Lustre on all nodes and created a new file system. I don’t know if this matters, but the file system is empty when I run this test. I then run test 3.3.2 from the test plan; set fail_loc on the MDSs, create a small number of objects and run LFSCK. For some reason, LFSCk has flag “inconsistent” on both MDSs.

      On mds01:

      # cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
      name: lfsck_namespace
      magic: 0xa0629d03
      version: 2
      status: scanning-phase1
      flags: inconsistent
      param: all_targets,create_ostobj,create_mdtobj
      time_since_last_completed: 1155 seconds
      time_since_latest_start: 27 seconds
      time_since_last_checkpoint: N/A
      latest_start_position: 77, N/A, N/A
      last_checkpoint_position: N/A, N/A, N/A
      first_failure_position: N/A, N/A, N/A
      checked_phase1: 461
      checked_phase2: 0
      updated_phase1: 208
      updated_phase2: 0
      failed_phase1: 0
      failed_phase2: 0
      directories: 27
      dirent_repaired: 0
      linkea_repaired: 208
      nlinks_repaired: 0
      multiple_linked_checked: 187
      multiple_linked_repaired: 0
      unknown_inconsistency: 0
      unmatched_pairs_repaired: 0
      dangling_repaired: 0
      multiple_referenced_repaired: 0
      bad_file_type_repaired: 0
      lost_dirent_repaired: 0
      local_lost_found_scanned: 0
      local_lost_found_moved: 0
      local_lost_found_skipped: 0
      local_lost_found_failed: 0
      striped_dirs_scanned: 0
      striped_dirs_repaired: 0
      striped_dirs_failed: 0
      striped_dirs_disabled: 0
      striped_dirs_skipped: 0
      striped_shards_scanned: 2
      striped_shards_repaired: 0
      striped_shards_failed: 0
      striped_shards_skipped: 0
      name_hash_repaired: 0
      success_count: 1
      run_time_phase1: 28 seconds
      run_time_phase2: 0 seconds
      average_speed_phase1: 16 items/sec
      average_speed_phase2: N/A
      real_time_speed_phase1: 16 items/sec
      real_time_speed_phase2: N/A
      current_position: 238551041, N/A, N/A
      

      There is nothing in dmesg except confirmation that fail_loc was set on the MDTs:

      # dmesg
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000380000400-0x00000003c0000400):0:mdt
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x00000003c0000400-0x0000000400000400):1:mdt
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: Skipped 101 previous similar messages
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: Skipped 3 previous similar messages
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: Skipped 100 previous similar messages
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: Skipped 101 previous similar messages
      Lustre: *** cfs_fail_loc=1603, val=0***
      Lustre: Skipped 3 previous similar messages
      

      If I wait a few minutes, the following is in dmesg:

      INFO: task lfsck_namespace:469 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000006     0   469      2 0x00000080
       ffff880478e87ba0 0000000000000046 0000000000000000 ffff880a9849b7e0
       ffff880a9849b7e0 ffff880d7d69f000 ffff880478e87ba0 ffffffffa06d57e9
       ffff880478e66638 ffff880478e87fd8 000000000000fbc8 ffff880478e66638
      Call Trace:
       [<ffffffffa06d57e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa06d5b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f33619>] lfsck_namespace_assistant_handler_p1+0x2d9/0x1f70 [lfsck]
       [<ffffffffa057d472>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffffa0f23347>] lfsck_assistant_engine+0x497/0x1e00 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f22eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:469 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000006     0   469      2 0x00000080
       ffff880478e87ba0 0000000000000046 0000000000000000 ffff880a9849b7e0
       ffff880a9849b7e0 ffff880d7d69f000 ffff880478e87ba0 ffffffffa06d57e9
       ffff880478e66638 ffff880478e87fd8 000000000000fbc8 ffff880478e66638
      Call Trace:
       [<ffffffffa06d57e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa06d5b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f33619>] lfsck_namespace_assistant_handler_p1+0x2d9/0x1f70 [lfsck]
       [<ffffffffa057d472>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffffa0f23347>] lfsck_assistant_engine+0x497/0x1e00 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f22eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      On mds02, the following is the state of LFSCK on mds02:

      # cat /proc/fs/lustre/mdd/scratch-MDT0001/lfsck_namespace 
      name: lfsck_namespace
      magic: 0xa0629d03
      version: 2
      status: scanning-phase2
      flags: scanned-once,inconsistent
      param: all_targets,create_ostobj,create_mdtobj
      time_since_last_completed: 2533 seconds
      time_since_latest_start: 1405 seconds
      time_since_last_checkpoint: 1405 seconds
      latest_start_position: 77, N/A, N/A
      last_checkpoint_position: 260571137, [0x3c0000401:0x4:0x0], 0x5d2c710b01284458
      first_failure_position: N/A, N/A, N/A
      checked_phase1: 16
      checked_phase2: 0
      updated_phase1: 0
      updated_phase2: 0
      failed_phase1: 0
      failed_phase2: 0
      directories: 6
      dirent_repaired: 0
      linkea_repaired: 4
      nlinks_repaired: 0
      multiple_linked_checked: 0
      multiple_linked_repaired: 0
      unknown_inconsistency: 0
      unmatched_pairs_repaired: 0
      dangling_repaired: 0
      multiple_referenced_repaired: 0
      bad_file_type_repaired: 0
      lost_dirent_repaired: 0
      local_lost_found_scanned: 0
      local_lost_found_moved: 0
      local_lost_found_skipped: 0
      local_lost_found_failed: 0
      striped_dirs_scanned: 0
      striped_dirs_repaired: 0
      striped_dirs_failed: 0
      striped_dirs_disabled: 0
      striped_dirs_skipped: 0
      striped_shards_scanned: 2
      striped_shards_repaired: 0
      striped_shards_failed: 0
      striped_shards_skipped: 0
      name_hash_repaired: 0
      success_count: 1
      run_time_phase1: 0 seconds
      run_time_phase2: 1405 seconds
      average_speed_phase1: 16 items/sec
      average_speed_phase2: 0 objs/sec
      real_time_speed_phase1: N/A
      real_time_speed_phase2: 0 objs/sec
      current_position: [0x0:0x0:0x0]
      

      I cannot stop nor start LFSCK again. So, I’m not sure what state it is in.

      Trying to start LFSCK again fails:

      # lctl lfsck_start -A -M scratch-MDT0000 -c -C --type namespace 
      Fail to start LFSCK: Operation already in progress
      

      Stopping LFSCK works on mds02:

      # cat /proc/fs/lustre/mdd/scratch-MDT0001/lfsck_namespace 
      name: lfsck_namespace
      magic: 0xa0629d03
      version: 2
      status: stopped
      flags: scanned-once,inconsistent
      param: all_targets,create_ostobj,create_mdtobj
      time_since_last_completed: 2890 seconds
      time_since_latest_start: 1762 seconds
      time_since_last_checkpoint: 10 seconds
      latest_start_position: 77, N/A, N/A
      last_checkpoint_position: 260571137, [0x3c0000401:0x4:0x0], 0x5d2c710b01284458
      first_failure_position: N/A, N/A, N/A
      checked_phase1: 16
      checked_phase2: 0
      updated_phase1: 0
      updated_phase2: 0
      failed_phase1: 0
      failed_phase2: 0
      directories: 6
      dirent_repaired: 0
      linkea_repaired: 4
      nlinks_repaired: 0
      multiple_linked_checked: 0
      multiple_linked_repaired: 0
      unknown_inconsistency: 0
      unmatched_pairs_repaired: 0
      dangling_repaired: 0
      multiple_referenced_repaired: 0
      bad_file_type_repaired: 0
      lost_dirent_repaired: 0
      local_lost_found_scanned: 0
      local_lost_found_moved: 0
      local_lost_found_skipped: 0
      local_lost_found_failed: 0
      striped_dirs_scanned: 0
      striped_dirs_repaired: 0
      striped_dirs_failed: 0
      striped_dirs_disabled: 0
      striped_dirs_skipped: 0
      striped_shards_scanned: 2
      striped_shards_repaired: 0
      striped_shards_failed: 0
      striped_shards_skipped: 0
      name_hash_repaired: 0
      success_count: 1
      run_time_phase1: 0 seconds
      run_time_phase2: 1753 seconds
      average_speed_phase1: 16 items/sec
      average_speed_phase2: 0 objs/sec
      real_time_speed_phase1: N/A
      real_time_speed_phase2: N/A
      current_position: N/A
      

      but does not stop LFSCK on mds01

      # cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
      name: lfsck_namespace
      magic: 0xa0629d03
      version: 2
      status: scanning-phase1
      flags: inconsistent
      param: all_targets,create_ostobj,create_mdtobj
      time_since_last_completed: 2953 seconds
      time_since_latest_start: 1825 seconds
      time_since_last_checkpoint: N/A
      latest_start_position: 77, N/A, N/A
      last_checkpoint_position: N/A, N/A, N/A
      first_failure_position: N/A, N/A, N/A
      checked_phase1: 461
      checked_phase2: 0
      updated_phase1: 208
      updated_phase2: 0
      failed_phase1: 0
      failed_phase2: 0
      directories: 27
      dirent_repaired: 0
      linkea_repaired: 208
      nlinks_repaired: 0
      multiple_linked_checked: 187
      multiple_linked_repaired: 0
      unknown_inconsistency: 0
      unmatched_pairs_repaired: 0
      dangling_repaired: 0
      multiple_referenced_repaired: 0
      bad_file_type_repaired: 0
      lost_dirent_repaired: 0
      local_lost_found_scanned: 0
      local_lost_found_moved: 0
      local_lost_found_skipped: 0
      local_lost_found_failed: 0
      striped_dirs_scanned: 0
      striped_dirs_repaired: 0
      striped_dirs_failed: 0
      striped_dirs_disabled: 0
      striped_dirs_skipped: 0
      striped_shards_scanned: 2
      striped_shards_repaired: 0
      striped_shards_failed: 0
      striped_shards_skipped: 0
      name_hash_repaired: 0
      success_count: 1
      run_time_phase1: 1826 seconds
      run_time_phase2: 0 seconds
      average_speed_phase1: 0 items/sec
      average_speed_phase2: N/A
      real_time_speed_phase1: 0 items/sec
      real_time_speed_phase2: N/A
      current_position: 238551041, N/A, N/A
      

      Attachments

        1. lfsck_log_mds01_13511.txt
          402 kB
        2. lfsck_log_mds01.txt
          403 kB
        3. lfsck_log_mds02_13511.txt
          11 kB
        4. lfsck_log_mds02.txt
          10 kB
        5. lfsck_logs.tgz
          0.2 kB

        Activity

          People

            yong.fan nasf (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: