Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.4
-
None
-
x86_64, zfs 0.7.9, centos7.4, 2.10.4 plus the "lfsck: load object attr when prepare LFSCK request" patch from
LU-10988
-
3
-
9223372036854775807
Description
Hi,
I hit this just now when doing a lfsck for LU-10677.
lctl lfsck_start -M dagg-MDT0000 -t namespace -A -r -C
2018-07-02 19:00:24 [699080.215320] Lustre: debug daemon will attempt to start writing to /mnt/root/root/lfsck.debug_daemon.dagg-MDTs.20180702.1 (20480000kB max) 2018-07-02 19:51:27 [702136.419050] LustreError: 199045:0:(mdd_orphans.c:205:orph_index_insert()) ASSERTION( !(obj->mod_flags & ORPHAN_OBJ) ) failed: 2018-07-02 19:51:27 [702136.432255] LustreError: 199045:0:(mdd_orphans.c:205:orph_index_insert()) LBUG 2018-07-02 19:51:27 [702136.440349] Pid: 199045, comm: mdt01_013 2018-07-02 19:51:27 [702136.445132] 2018-07-02 19:51:27 [702136.445132] Call Trace: 2018-07-02 19:51:27 [702136.450734] [<ffffffffc05267ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-07-02 19:51:27 [702136.458083] [<ffffffffc052683c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-07-02 19:51:27 [702136.465078] [<ffffffffc11e3827>] __mdd_orphan_add+0x8e7/0xa20 [mdd] 2018-07-02 19:51:27 [702136.472242] [<ffffffffc09f720c>] ? osd_write_locked+0x3c/0x60 [osd_zfs] 2018-07-02 19:51:27 [702136.479747] [<ffffffffc11d33ce>] mdd_finish_unlink+0x17e/0x410 [mdd] 2018-07-02 19:51:27 [702136.486988] [<ffffffffc11d56e4>] mdd_unlink+0xae4/0xbf0 [mdd] 2018-07-02 19:51:27 [702136.493616] [<ffffffffc10bc728>] mdt_reint_unlink+0xc28/0x11d0 [mdt] 2018-07-02 19:51:27 [702136.500846] [<ffffffffc10bfb33>] mdt_reint_rec+0x83/0x210 [mdt] 2018-07-02 19:51:27 [702136.507628] [<ffffffffc10a137b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] 2018-07-02 19:51:27 [702136.514928] [<ffffffffc10acf07>] mdt_reint+0x67/0x140 [mdt] 2018-07-02 19:51:28 [702136.521402] [<ffffffffc0e752ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc] 2018-07-02 19:51:28 [702136.529074] [<ffffffffc0e1de2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] 2018-07-02 19:51:28 [702136.537493] [<ffffffffc0e1a458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] 2018-07-02 19:51:28 [702136.545015] [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20 2018-07-02 19:51:28 [702136.552008] [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90 2018-07-02 19:51:28 [702136.558588] [<ffffffffc0e21572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] 2018-07-02 19:51:28 [702136.565605] [<ffffffffc0e20ae0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] 2018-07-02 19:51:28 [702136.572580] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-07-02 19:51:28 [702136.578163] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-07-02 19:51:28 [702136.583819] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-07-02 19:51:28 [702136.589900] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-07-02 19:51:28 [702136.595542] 2018-07-02 19:51:28 [702136.597694] Kernel panic - not syncing: LBUG 2018-07-02 19:51:28 [702136.602600] CPU: 9 PID: 199045 Comm: mdt01_013 Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 2018-07-02 19:51:28 [702136.613744] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018 2018-07-02 19:51:28 [702136.621857] Call Trace: 2018-07-02 19:51:28 [702136.624946] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b 2018-07-02 19:51:28 [702136.630704] [<ffffffff816a8634>] panic+0xe8/0x21f 2018-07-02 19:51:28 [702136.636117] [<ffffffffc0526854>] lbug_with_loc+0x64/0xb0 [libcfs] 2018-07-02 19:51:28 [702136.642927] [<ffffffffc11e3827>] __mdd_orphan_add+0x8e7/0xa20 [mdd] 2018-07-02 19:51:28 [702136.649895] [<ffffffffc09f720c>] ? osd_write_locked+0x3c/0x60 [osd_zfs] 2018-07-02 19:51:28 [702136.657201] [<ffffffffc11d33ce>] mdd_finish_unlink+0x17e/0x410 [mdd] 2018-07-02 19:51:28 [702136.664232] [<ffffffffc11d56e4>] mdd_unlink+0xae4/0xbf0 [mdd] 2018-07-02 19:51:28 [702136.670661] [<ffffffffc10bc728>] mdt_reint_unlink+0xc28/0x11d0 [mdt] 2018-07-02 19:51:28 [702136.677681] [<ffffffffc10bfb33>] mdt_reint_rec+0x83/0x210 [mdt] 2018-07-02 19:51:28 [702136.684247] [<ffffffffc10a137b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] 2018-07-02 19:51:28 [702136.691331] [<ffffffffc10acf07>] mdt_reint+0x67/0x140 [mdt] 2018-07-02 19:51:28 [702136.697583] [<ffffffffc0e752ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc] 2018-07-02 19:51:28 [702136.705025] [<ffffffffc0e1de2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] 2018-07-02 19:51:28 [702136.713235] [<ffffffffc0e1a458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] 2018-07-02 19:51:28 [702136.720538] [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20 2018-07-02 19:51:28 [702136.727314] [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90 2018-07-02 19:51:28 [702136.733666] [<ffffffffc0e21572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] 2018-07-02 19:51:28 [702136.740438] [<ffffffffc0e20ae0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] 2018-07-02 19:51:28 [702136.748310] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-07-02 19:51:28 [702136.753661] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-07-02 19:51:28 [702136.760227] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-07-02 19:51:28 [702136.766108] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-07-02 19:51:28 [702136.772696] Kernel Offset: disabled
the debug_daemon running at the time, but the file is 6.1G and probably truncated because of the crash. let me know if you want it anyway.
I've mounted the MDT's back up again with -o skip_lfsck.
I think the crash happened somewhere in phase2 of the namespace lfsck. the lfsck state after the reboot is below.
[warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: crashed flags: scanned-once,inconsistent param: all_targets,create_mdtobj last_completed_time: 1528222041 time_since_last_completed: 2305149 seconds latest_start_time: 1530521735 time_since_latest_start: 5455 seconds last_checkpoint_time: 1530525083 time_since_last_checkpoint: 2107 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 14945199 checked_phase2: 4683016 updated_phase1: 0 updated_phase2: 1 failed_phase1: 0 failed_phase2: 0 directories: 2205524 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 324817 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 1 dangling_repaired: 92 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1639566 striped_shards_repaired: 1 striped_shards_failed: 0 striped_shards_skipped: 3 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 9 run_time_phase1: 2357 seconds run_time_phase2: 960 seconds average_speed_phase1: 6340 items/sec average_speed_phase2: 4878 objs/sec average_speed_total: 5917 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: crashed flags: scanned-once,inconsistent param: all_targets,create_mdtobj last_completed_time: 1528222004 time_since_last_completed: 2305186 seconds latest_start_time: 1530521735 time_since_latest_start: 5455 seconds last_checkpoint_time: 1530525083 time_since_last_checkpoint: 2107 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 14356291 checked_phase2: 4705808 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 2185653 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 323999 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 11 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1639415 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 4 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 10 run_time_phase1: 2374 seconds run_time_phase2: 960 seconds average_speed_phase1: 6047 items/sec average_speed_phase2: 4901 objs/sec average_speed_total: 5717 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: crashed flags: scanned-once,inconsistent param: all_targets,create_mdtobj last_completed_time: 1528221995 time_since_last_completed: 2305195 seconds latest_start_time: 1530521735 time_since_latest_start: 5455 seconds last_checkpoint_time: 1530525083 time_since_last_checkpoint: 2107 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 14368285 checked_phase2: 4739156 updated_phase1: 2 updated_phase2: 1 failed_phase1: 0 failed_phase2: 0 directories: 2187046 dirent_repaired: 0 linkea_repaired: 1 nlinks_repaired: 0 multiple_linked_checked: 324797 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 88 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 1 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 1639666 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 4 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 10 run_time_phase1: 2387 seconds run_time_phase2: 960 seconds average_speed_phase1: 6019 items/sec average_speed_phase2: 4936 objs/sec average_speed_total: 5708 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A
cheers,
robin
Attachments
Issue Links
- is duplicated by
-
LU-12127 Improve LFSCK orphan handling
- Resolved
- is related to
-
LU-11239 sanity-lfsck test 36a fails with 'Fail to split mirror'
- Resolved
-
LU-11419 lfsck does not complete phase2
- Resolved
-
LU-11516 ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG
- Resolved
- is related to
-
LU-10888 'lctl abort_recovery' allow aborting recovery between MDTs
- Resolved