[LU-16826] MDS nodes panicked running lfsck repair create lost objects: (osd_handler.c:6260:osd_index_declare_ea_insert()) ASSERTION( fid != ((void *)0) ) failed Created: 12/May/23 Updated: 23/Dec/23 Resolved: 20/Dec/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Full stack tace is: [81891.829222] LustreError: 2384348:0:(osd_handler.c:6260:osd_index_declare_ea_insert()) ASSERTION( fid != ((void *)0) ) failed: [81891.842752] LustreError: 2384348:0:(osd_handler.c:6260:osd_index_declare_ea_insert()) LBUG [81891.851987] Pid: 2384348, comm: lfsck_namespace 4.18.0-305.10.2.x6.4.010.32.x86_64 #1 SMP Thu Apr 27 19:48:12 MDT 2023 [81891.863654] Call Trace TBD: [81891.867456] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [81891.873549] [<0>] lbug_with_loc+0x43/0x80 [libcfs] [81891.879328] [<0>] osd_index_declare_ea_insert+0x3d4/0x480 [osd_ldiskfs] [81891.886923] [<0>] lod_sub_declare_insert+0xef/0x240 [lod] [81891.893314] [<0>] lfsck_namespace_repair_dangling+0xe75/0x1370 [lfsck] [81891.900770] [<0>] lfsck_namespace_assistant_handler_p1+0x13b1/0x2020 [lfsck] [81891.908732] [<0>] lfsck_assistant_engine+0x359/0x1c20 [lfsck] [81891.915378] [<0>] kthread+0x116/0x130 [81891.919931] [<0>] ret_from_fork+0x1f/0x40 [81891.924807] Kernel panic - not syncing: LBUG [81891.929939] CPU: 24 PID: 2384348 Comm: lfsck_namespace Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.4.010.32.x86_64 #1 [81891.944936] Hardware name: Viking Enterprise Solutions VSSEP1EC/VSSEP1EC, BIOS RWH3LJ-10.07.00 08/29/2022 [81891.955347] Call Trace: [81891.958645] dump_stack+0x5c/0x80 [81891.962787] panic+0xe7/0x2a9 [81891.966564] ? ret_from_fork+0x1f/0x40 [81891.971112] lbug_with_loc.cold.10+0x18/0x18 [libcfs] [81891.976956] osd_index_declare_ea_insert+0x3d4/0x480 [osd_ldiskfs] [81891.983914] ? osd_index_declare_ea_delete+0x1cd/0x2f0 [osd_ldiskfs] [81891.991040] lod_sub_declare_insert+0xef/0x240 [lod] [81891.996762] lfsck_namespace_repair_dangling+0xe75/0x1370 [lfsck] [81892.003700] ? dt_lookup_dir+0x80/0x190 [obdclass] [81892.009229] lfsck_namespace_assistant_handler_p1+0x13b1/0x2020 [lfsck] [81892.016561] ? __schedule+0x2cc/0x700 [81892.020938] lfsck_assistant_engine+0x359/0x1c20 [lfsck] [81892.026945] ? __switch_to+0x10c/0x480 [81892.031371] ? __schedule+0x2cc/0x700 [81892.035689] ? finish_wait+0x80/0x80 [81892.039917] ? lfsck_master_engine+0xcd0/0xcd0 [lfsck] [81892.045680] kthread+0x116/0x130 [81892.049530] ? kthread_flush_work_fn+0x10/0x10 [81892.054580] ret_from_fork+0x1f/0x40 |
| Comments |
| Comment by Alexander Zarochentsev [ 12/May/23 ] |
lfsck_namespace_repair_dangling(...):
...
/* 7a. if child is remote, delete and insert to generate local agent */
if (dt_object_remote(child)) {
rc = dt_declare_delete(env, parent,
(const struct dt_key *)lnr->lnr_name,
th);
if (rc)
GOTO(stop, rc);
===> rc = dt_declare_insert(env, parent, (const struct dt_rec *)rec,
(const struct dt_key *)lnr->lnr_name,
th);
if (rc)
GOTO(stop, rc);
}
Looks like 7a code path was never called (or the crash has not been reported yet), it misses rec->ref_fid initialisation before calling dt_declare_insert(), it causes an assertion failure in |
| Comment by Gerrit Updater [ 12/May/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50980 |
| Comment by Alexander Zarochentsev [ 14/May/23 ] |
|
Rerproducer: MDSCOUNT=2 sh llmount.sh ../utils/lfs mkdir -i 0 /mnt/lustre/mdt0dir ../utils/lfs mkdir -i 1 /mnt/lustre/mdt1dir touch /mnt/lustre/mdt0dir/foo mv /mnt/lustre/mdt0dir/foo /mnt/lustre/mdt1dir/ FOOFID=$(../utils/lfs path2fid /mnt/lustre/mdt1dir/foo | sed -E 's/^.(.*).$/\1/') echo $FOOFID sync umount /mnt/lustre-mds1 umount /mnt/lustre-mds2 echo "rm /REMOTE_PARENT_DIR/$FOOFID" | debugfs -w /dev/mapper/mds1_flakey mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre-mds1/ mount -t lustre /dev/mapper/mds2_flakey /mnt/lustre-mds2/ ../utils/lctl lfsck_start -M lustre-MDT0000 -C ../utils/lctl lfsck_start -M lustre-MDT0001 -C |
| Comment by Gerrit Updater [ 15/May/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50998 |
| Comment by Gerrit Updater [ 31/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50980/ |
| Comment by Gerrit Updater [ 20/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50998/ |
| Comment by Peter Jones [ 20/Dec/23 ] |
|
Landed for 2.16 |