[LU-5885] LFSCK 3: ‘lctl lfsck_start -t namespace’ Not Progressing Under Remove Workload Created: 07/Nov/14 Updated: 23/Dec/15 Resolved: 10/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, lfsck | ||
| Environment: |
OpenSFS cluster with two MDSs with one MDT each, three OSSs and three clients. Lustre tag 2.6.54 build 2725 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16456 | ||||||||
| Description |
|
While running the LFSCK Phase 3 test plan, I created 10,000 objects; files, remote directories, local directories, links; then ran # lctl lfsck_start -A -M scratch-MDT0000 -r -t namespace -c -C Started LFSCK on the device scratch-MDT0000: scrub namespace On the client, I then deleted all files and directories in the file system. At some point LFSCK hung and ‘lctl lfsck_stop’ will not stop LFSCK and looks like it hangs. LFSCK progresses to a certain point and then hangs; the time counters progress, but none of the other counters increase and we are stuck in “scanning-phase1”. # cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: scanning-phase1 flags: param: all_targets,create_ostobj, time_since_last_completed: 59865 seconds time_since_latest_start: 8714 seconds time_since_last_checkpoint: N/A latest_start_position: 77, N/A, N/A last_checkpoint_position: N/A, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 3347202 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 182634 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 0 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1560 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 success_count: 23 run_time_phase1: 8714 seconds run_time_phase2: 0 seconds average_speed_phase1: 384 items/sec average_speed_phase2: N/A real_time_speed_phase1: 384 items/sec real_time_speed_phase2: N/A current_position: 180358673, N/A, N/A On the MDT with index 0, dmesg contains: INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:1210 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g8fab48a.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000001 0 1210 2 0x00000080
ffff880485cfbac0 0000000000000046 0000000000000000 ffff88050b8c13e0
ffff88050b8c13e0 ffff881023077000 ffff880485cfbac0 ffffffffa06d4e39
ffff88047443c638 ffff880485cfbfd8 000000000000fbc8 ffff88047443c638
Call Trace:
[<ffffffffa06d4e39>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d5085>] lu_object_find_at+0x35/0x100 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa04f14b3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
[<ffffffffa06d518f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0f8f958>] lfsck_namespace_handle_striped_master+0x118/0xb10 [lfsck]
[<ffffffffa0b5de4c>] ? fld_local_lookup+0x6c/0x290 [fld]
[<ffffffffa0f5d23f>] lfsck_namespace_assistant_handler_p1+0x5bf/0x1f40 [lfsck]
[<ffffffffa06d3743>] ? lu_object_free+0x113/0x1a0 [obdclass]
[<ffffffffa057b482>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[<ffffffff81283a85>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffffa0f4d197>] lfsck_assistant_engine+0x497/0x1c50 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0f4cd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Similar stack traces can be found on the second MDS/MDT and is also stuck in “scanning -phase1”. |
| Comments |
| Comment by Jodi Levi (Inactive) [ 10/Nov/14 ] |
|
Fan Yong, |
| Comment by James Nunez (Inactive) [ 13/Nov/14 ] |
|
I ran this test again for lustre-master tag 2.6.90 build #2734 and was able to reproduce this issue very quickly. I used a workload similar to what was described above; ran test 3.3.3 creating about 130 directories with 10,000 objects each, then ran the same workload in a different directory, started LFSCK on both MDSs and then went back and removed the directories/objects created by test 3.3.3. I captured kernel logs on both the MDSs. They are at uploads/ When looking at lfsck_namespace, there might be something wrong with the real-time timers calculating the rate of scanning objects, the real_time_speed_phase1 never decreases, but the average_speed_phase1 does decrease. In this case where LFSCK seems to hang, meaning it is not scanning objects anymore, I’d expect the real_time_speed to decrease, but it just keeps growing: real_time_speed_phase1: 21441823787665 items/sec |
| Comment by Gerrit Updater [ 16/Nov/14 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/12741 |
| Comment by nasf (Inactive) [ 16/Nov/14 ] |
|
James, would you please to verify the patch http://review.whamcloud.com/#/c/12741/ ? Thanks! |
| Comment by James Nunez (Inactive) [ 19/Nov/14 ] |
|
With your patch, http://review.whamcloud.com/#/c/12741/ , I can run the remove workload and create files/directories/etc. and LFSCK does not hang. I've tried this four times and cannot get LFSCK to hang. So, this patch fixed the LFSCK hang problem. |
| Comment by nasf (Inactive) [ 19/Nov/14 ] |
|
Thanks James for the verification! |
| Comment by Gerrit Updater [ 10/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12741/ |
| Comment by nasf (Inactive) [ 10/Dec/14 ] |
|
The patch has been landed to master. |