[LU-5885] LFSCK 3: ‘lctl lfsck_start -t namespace’ Not Progressing Under Remove Workload Created: 07/Nov/14  Updated: 23/Dec/15  Resolved: 10/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, lfsck
Environment:

OpenSFS cluster with two MDSs with one MDT each, three OSSs and three clients. Lustre tag 2.6.54 build 2725


Issue Links:
Related
is related to LU-5774 lu_object_find_at() should NOT test c... Resolved
Severity: 3
Rank (Obsolete): 16456

 Description   

While running the LFSCK Phase 3 test plan, I created 10,000 objects; files, remote directories, local directories, links; then ran

# lctl lfsck_start -A -M scratch-MDT0000 -r -t namespace -c -C
Started LFSCK on the device scratch-MDT0000: scrub namespace

On the client, I then deleted all files and directories in the file system. At some point LFSCK hung and ‘lctl lfsck_stop’ will not stop LFSCK and looks like it hangs. LFSCK progresses to a certain point and then hangs; the time counters progress, but none of the other counters increase and we are stuck in “scanning-phase1”.

# cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags:
param: all_targets,create_ostobj,
time_since_last_completed: 59865 seconds
time_since_latest_start: 8714 seconds
time_since_last_checkpoint: N/A
latest_start_position: 77, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 3347202
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 182634
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 0
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 1560
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 23
run_time_phase1: 8714 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 384 items/sec
average_speed_phase2: N/A
real_time_speed_phase1: 384 items/sec
real_time_speed_phase2: N/A
current_position: 180358673, N/A, N/A

On the MDT with index 0, dmesg contains:

INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001     0  1210      2 0x00000080
 ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
 ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
 ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
 [<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
 [<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
 [<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
 [<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
 [<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
 [<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
 [<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Similar stack traces can be found on the second MDS/MDT and is also stuck in “scanning -phase1”.



 Comments   
Comment by Jodi Levi (Inactive) [ 10/Nov/14 ]

Fan Yong,
Could you take a look at this one?
Thank you!

Comment by James Nunez (Inactive) [ 13/Nov/14 ]

I ran this test again for lustre-master tag 2.6.90 build #2734 and was able to reproduce this issue very quickly. I used a workload similar to what was described above; ran test 3.3.3 creating about 130 directories with 10,000 objects each, then ran the same workload in a different directory, started LFSCK on both MDSs and then went back and removed the directories/objects created by test 3.3.3.

I captured kernel logs on both the MDSs. They are at uploads/LU-5885/lfsck_log_1.txt (MDS0) and lfsck_log_2.txt (MDS1)

When looking at lfsck_namespace, there might be something wrong with the real-time timers calculating the rate of scanning objects, the real_time_speed_phase1 never decreases, but the average_speed_phase1 does decrease. In this case where LFSCK seems to hang, meaning it is not scanning objects anymore, I’d expect the real_time_speed to decrease, but it just keeps growing:

real_time_speed_phase1: 21441823787665 items/sec
Comment by Gerrit Updater [ 16/Nov/14 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12741
Subject: LU-5885 lfsck: deadlock when remove striped dir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4ab1b1b15835879a145002221bb4cc492e57c791

Comment by nasf (Inactive) [ 16/Nov/14 ]

James, would you please to verify the patch http://review.whamcloud.com/#/c/12741/ ? Thanks!

Comment by James Nunez (Inactive) [ 19/Nov/14 ]

With your patch, http://review.whamcloud.com/#/c/12741/ , I can run the remove workload and create files/directories/etc. and LFSCK does not hang. I've tried this four times and cannot get LFSCK to hang. So, this patch fixed the LFSCK hang problem.

Comment by nasf (Inactive) [ 19/Nov/14 ]

Thanks James for the verification!

Comment by Gerrit Updater [ 10/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12741/
Subject: LU-5885 lfsck: deadlock when remove striped dir
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f0137d89fd40ae66aa1d3a180e4e5a6240009dcc

Comment by nasf (Inactive) [ 10/Dec/14 ]

The patch has been landed to master.

Generated at Sat Feb 10 01:55:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.