[LU-6109] LFSCK gets "inconsistent" flag and won't stop - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0, Lustre 2.8.0
Affects Version/s: Lustre 2.7.0
Labels:
- HB
- lfsck
Environment:
OpenSFS cluster with 2 MDSs with 1MDT each, 3 OSSs with two OSTs each and three clients running Lustre (master) build # 2812

Severity:
3
Rank (Obsolete):
17016

Description

While running the performance test (3.3.2) from the LFSCK Phase 3 test plan, I’ve run into a problem. I installed Lustre on all nodes and created a new file system. I don’t know if this matters, but the file system is empty when I run this test. I then run test 3.3.2 from the test plan; set fail_loc on the MDSs, create a small number of objects and run LFSCK. For some reason, LFSCk has flag “inconsistent” on both MDSs.

On mds01:

# cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent
param: all_targets,create_ostobj,create_mdtobj
time_since_last_completed: 1155 seconds
time_since_latest_start: 27 seconds
time_since_last_checkpoint: N/A
latest_start_position: 77, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 461
checked_phase2: 0
updated_phase1: 208
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 27
dirent_repaired: 0
linkea_repaired: 208
nlinks_repaired: 0
multiple_linked_checked: 187
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 2
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 1
run_time_phase1: 28 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 16 items/sec
average_speed_phase2: N/A
real_time_speed_phase1: 16 items/sec
real_time_speed_phase2: N/A
current_position: 238551041, N/A, N/A

There is nothing in dmesg except confirmation that fail_loc was set on the MDTs:

# dmesg
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000380000400-0x00000003c0000400):0:mdt
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x00000003c0000400-0x0000000400000400):1:mdt
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: Skipped 101 previous similar messages
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: Skipped 3 previous similar messages
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: Skipped 100 previous similar messages
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: Skipped 101 previous similar messages
Lustre: *** cfs_fail_loc=1603, val=0***
Lustre: Skipped 3 previous similar messages

If I wait a few minutes, the following is in dmesg:

INFO: task lfsck_namespace:469 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000006     0   469      2 0x00000080
 ffff880478e87ba0 0000000000000046 0000000000000000 ffff880a9849b7e0
 ffff880a9849b7e0 ffff880d7d69f000 ffff880478e87ba0 ffffffffa06d57e9
 ffff880478e66638 ffff880478e87fd8 000000000000fbc8 ffff880478e66638
Call Trace:
 [<ffffffffa06d57e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa06d5b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f33619>] lfsck_namespace_assistant_handler_p1+0x2d9/0x1f70 [lfsck]
 [<ffffffffa057d472>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffffa0f23347>] lfsck_assistant_engine+0x497/0x1e00 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f22eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task lfsck_namespace:469 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000006     0   469      2 0x00000080
 ffff880478e87ba0 0000000000000046 0000000000000000 ffff880a9849b7e0
 ffff880a9849b7e0 ffff880d7d69f000 ffff880478e87ba0 ffffffffa06d57e9
 ffff880478e66638 ffff880478e87fd8 000000000000fbc8 ffff880478e66638
Call Trace:
 [<ffffffffa06d57e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d5a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa06d5b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0f33619>] lfsck_namespace_assistant_handler_p1+0x2d9/0x1f70 [lfsck]
 [<ffffffffa057d472>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
 [<ffffffffa0f23347>] lfsck_assistant_engine+0x497/0x1e00 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0f22eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

On mds02, the following is the state of LFSCK on mds02:

# cat /proc/fs/lustre/mdd/scratch-MDT0001/lfsck_namespace 
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase2
flags: scanned-once,inconsistent
param: all_targets,create_ostobj,create_mdtobj
time_since_last_completed: 2533 seconds
time_since_latest_start: 1405 seconds
time_since_last_checkpoint: 1405 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 260571137, [0x3c0000401:0x4:0x0], 0x5d2c710b01284458
first_failure_position: N/A, N/A, N/A
checked_phase1: 16
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 6
dirent_repaired: 0
linkea_repaired: 4
nlinks_repaired: 0
multiple_linked_checked: 0
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 2
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 1
run_time_phase1: 0 seconds
run_time_phase2: 1405 seconds
average_speed_phase1: 16 items/sec
average_speed_phase2: 0 objs/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: 0 objs/sec
current_position: [0x0:0x0:0x0]

I cannot stop nor start LFSCK again. So, I’m not sure what state it is in.

Trying to start LFSCK again fails:

# lctl lfsck_start -A -M scratch-MDT0000 -c -C --type namespace 
Fail to start LFSCK: Operation already in progress

Stopping LFSCK works on mds02:

# cat /proc/fs/lustre/mdd/scratch-MDT0001/lfsck_namespace 
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: stopped
flags: scanned-once,inconsistent
param: all_targets,create_ostobj,create_mdtobj
time_since_last_completed: 2890 seconds
time_since_latest_start: 1762 seconds
time_since_last_checkpoint: 10 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 260571137, [0x3c0000401:0x4:0x0], 0x5d2c710b01284458
first_failure_position: N/A, N/A, N/A
checked_phase1: 16
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 6
dirent_repaired: 0
linkea_repaired: 4
nlinks_repaired: 0
multiple_linked_checked: 0
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 2
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 1
run_time_phase1: 0 seconds
run_time_phase2: 1753 seconds
average_speed_phase1: 16 items/sec
average_speed_phase2: 0 objs/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A
current_position: N/A

but does not stop LFSCK on mds01

# cat /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_namespace 
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent
param: all_targets,create_ostobj,create_mdtobj
time_since_last_completed: 2953 seconds
time_since_latest_start: 1825 seconds
time_since_last_checkpoint: N/A
latest_start_position: 77, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 461
checked_phase2: 0
updated_phase1: 208
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 27
dirent_repaired: 0
linkea_repaired: 208
nlinks_repaired: 0
multiple_linked_checked: 187
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 2
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 1
run_time_phase1: 1826 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 0 items/sec
average_speed_phase2: N/A
real_time_speed_phase1: 0 items/sec
real_time_speed_phase2: N/A
current_position: 238551041, N/A, N/A

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lfsck_log_mds01_13511.txt
402 kB
26/Jan/15 5:12 AM
lfsck_log_mds01.txt
403 kB
23/Jan/15 8:59 AM
lfsck_log_mds02_13511.txt
11 kB
26/Jan/15 5:12 AM
lfsck_log_mds02.txt
10 kB
23/Jan/15 8:59 AM
lfsck_logs.tgz
0.2 kB
29/Jan/15 6:41 PM

Activity

People

Assignee:: nasf (Inactive)

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Jan/15 4:58 AM

Updated:: 16/Mar/16 4:47 AM

Resolved:: 08/Feb/15 4:55 AM