Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5885

LFSCK 3: ‘lctl lfsck_start -t namespace’ Not Progressing Under Remove Workload

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs and three clients. Lustre tag 2.6.54 build 2725
    • 3
    • 16456

    Description

      While running the LFSCK Phase 3 test plan, I created 10,000 objects; files, remote directories, local directories, links; then ran

      # lctl lfsck_start -A -M scratch-MDT0000 -r -t namespace -c -C
      Started LFSCK on the device scratch-MDT0000: scrub namespace
      

      On the client, I then deleted all files and directories in the file system. At some point LFSCK hung and ‘lctl lfsck_stop’ will not stop LFSCK and looks like it hangs. LFSCK progresses to a certain point and then hangs; the time counters progress, but none of the other counters increase and we are stuck in “scanning-phase1”.

      # cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
      name: lfsck_namespace
      magic: 0xa0629d03
      version: 2
      status: scanning-phase1
      flags:
      param: all_targets,create_ostobj,
      time_since_last_completed: 59865 seconds
      time_since_latest_start: 8714 seconds
      time_since_last_checkpoint: N/A
      latest_start_position: 77, N/A, N/A
      last_checkpoint_position: N/A, N/A, N/A
      first_failure_position: N/A, N/A, N/A
      checked_phase1: 3347202
      checked_phase2: 0
      updated_phase1: 0
      updated_phase2: 0
      failed_phase1: 0
      failed_phase2: 0
      directories: 182634
      dirent_repaired: 0
      linkea_repaired: 0
      nlinks_repaired: 0
      multiple_linked_checked: 0
      multiple_linked_repaired: 0
      unknown_inconsistency: 0
      unmatched_pairs_repaired: 0
      dangling_repaired: 0
      multiple_referenced_repaired: 0
      bad_file_type_repaired: 0
      lost_dirent_repaired: 0
      local_lost_found_scanned: 0
      local_lost_found_moved: 0
      local_lost_found_skipped: 0
      local_lost_found_failed: 0
      striped_dirs_scanned: 0
      striped_dirs_repaired: 0
      striped_dirs_failed: 0
      striped_dirs_disabled: 0
      striped_dirs_skipped: 0
      striped_shards_scanned: 1560
      striped_shards_repaired: 0
      striped_shards_failed: 0
      striped_shards_skipped: 0
      name_hash_repaired: 0
      success_count: 23
      run_time_phase1: 8714 seconds
      run_time_phase2: 0 seconds
      average_speed_phase1: 384 items/sec
      average_speed_phase2: N/A
      real_time_speed_phase1: 384 items/sec
      real_time_speed_phase2: N/A
      current_position: 180358673, N/A, N/A
      

      On the MDT with index 0, dmesg contains:

      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
       ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
       ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
       ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
      Call Trace:
       [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
       [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
       [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
       [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
       [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
       [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
       [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
       [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
       [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
       [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Similar stack traces can be found on the second MDS/MDT and is also stuck in “scanning -phase1”.

      Attachments

        Issue Links

          Activity

            [LU-5885] LFSCK 3: ‘lctl lfsck_start -t namespace’ Not Progressing Under Remove Workload

            The patch has been landed to master.

            yong.fan nasf (Inactive) added a comment - The patch has been landed to master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12741/
            Subject: LU-5885 lfsck: deadlock when remove striped dir
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f0137d89fd40ae66aa1d3a180e4e5a6240009dcc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12741/ Subject: LU-5885 lfsck: deadlock when remove striped dir Project: fs/lustre-release Branch: master Current Patch Set: Commit: f0137d89fd40ae66aa1d3a180e4e5a6240009dcc

            Thanks James for the verification!

            yong.fan nasf (Inactive) added a comment - Thanks James for the verification!

            With your patch, http://review.whamcloud.com/#/c/12741/ , I can run the remove workload and create files/directories/etc. and LFSCK does not hang. I've tried this four times and cannot get LFSCK to hang. So, this patch fixed the LFSCK hang problem.

            jamesanunez James Nunez (Inactive) added a comment - With your patch, http://review.whamcloud.com/#/c/12741/ , I can run the remove workload and create files/directories/etc. and LFSCK does not hang. I've tried this four times and cannot get LFSCK to hang. So, this patch fixed the LFSCK hang problem.

            James, would you please to verify the patch http://review.whamcloud.com/#/c/12741/ ? Thanks!

            yong.fan nasf (Inactive) added a comment - James, would you please to verify the patch http://review.whamcloud.com/#/c/12741/ ? Thanks!

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12741
            Subject: LU-5885 lfsck: deadlock when remove striped dir
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4ab1b1b15835879a145002221bb4cc492e57c791

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12741 Subject: LU-5885 lfsck: deadlock when remove striped dir Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4ab1b1b15835879a145002221bb4cc492e57c791

            I ran this test again for lustre-master tag 2.6.90 build #2734 and was able to reproduce this issue very quickly. I used a workload similar to what was described above; ran test 3.3.3 creating about 130 directories with 10,000 objects each, then ran the same workload in a different directory, started LFSCK on both MDSs and then went back and removed the directories/objects created by test 3.3.3.

            I captured kernel logs on both the MDSs. They are at uploads/LU-5885/lfsck_log_1.txt (MDS0) and lfsck_log_2.txt (MDS1)

            When looking at lfsck_namespace, there might be something wrong with the real-time timers calculating the rate of scanning objects, the real_time_speed_phase1 never decreases, but the average_speed_phase1 does decrease. In this case where LFSCK seems to hang, meaning it is not scanning objects anymore, I’d expect the real_time_speed to decrease, but it just keeps growing:

            real_time_speed_phase1: 21441823787665 items/sec
            
            jamesanunez James Nunez (Inactive) added a comment - I ran this test again for lustre-master tag 2.6.90 build #2734 and was able to reproduce this issue very quickly. I used a workload similar to what was described above; ran test 3.3.3 creating about 130 directories with 10,000 objects each, then ran the same workload in a different directory, started LFSCK on both MDSs and then went back and removed the directories/objects created by test 3.3.3. I captured kernel logs on both the MDSs. They are at uploads/ LU-5885 /lfsck_log_1.txt (MDS0) and lfsck_log_2.txt (MDS1) When looking at lfsck_namespace, there might be something wrong with the real-time timers calculating the rate of scanning objects, the real_time_speed_phase1 never decreases, but the average_speed_phase1 does decrease. In this case where LFSCK seems to hang, meaning it is not scanning objects anymore, I’d expect the real_time_speed to decrease, but it just keeps growing: real_time_speed_phase1: 21441823787665 items/sec

            Fan Yong,
            Could you take a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Fan Yong, Could you take a look at this one? Thank you!

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: