[LU-16826] MDS nodes panicked running lfsck repair create lost objects: (osd_handler.c:6260:osd_index_declare_ea_insert()) ASSERTION( fid != ((void *)0) ) failed Created: 12/May/23  Updated: 23/Dec/23  Resolved: 20/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17385 sanity-lfsck test_26a: only 3 of 4 MD... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Full stack tace is:

[81891.829222] LustreError: 2384348:0:(osd_handler.c:6260:osd_index_declare_ea_insert()) ASSERTION( fid != ((void *)0) ) failed:
[81891.842752] LustreError: 2384348:0:(osd_handler.c:6260:osd_index_declare_ea_insert()) LBUG
[81891.851987] Pid: 2384348, comm: lfsck_namespace 4.18.0-305.10.2.x6.4.010.32.x86_64 #1 SMP Thu Apr 27 19:48:12 MDT 2023
[81891.863654] Call Trace TBD:
[81891.867456] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[81891.873549] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
[81891.879328] [<0>] osd_index_declare_ea_insert+0x3d4/0x480 [osd_ldiskfs]
[81891.886923] [<0>] lod_sub_declare_insert+0xef/0x240 [lod]
[81891.893314] [<0>] lfsck_namespace_repair_dangling+0xe75/0x1370 [lfsck]
[81891.900770] [<0>] lfsck_namespace_assistant_handler_p1+0x13b1/0x2020 [lfsck]
[81891.908732] [<0>] lfsck_assistant_engine+0x359/0x1c20 [lfsck]
[81891.915378] [<0>] kthread+0x116/0x130
[81891.919931] [<0>] ret_from_fork+0x1f/0x40
[81891.924807] Kernel panic - not syncing: LBUG
[81891.929939] CPU: 24 PID: 2384348 Comm: lfsck_namespace Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-305.10.2.x6.4.010.32.x86_64 #1
[81891.944936] Hardware name: Viking Enterprise Solutions VSSEP1EC/VSSEP1EC, BIOS RWH3LJ-10.07.00 08/29/2022
[81891.955347] Call Trace:
[81891.958645]  dump_stack+0x5c/0x80
[81891.962787]  panic+0xe7/0x2a9
[81891.966564]  ? ret_from_fork+0x1f/0x40
[81891.971112]  lbug_with_loc.cold.10+0x18/0x18 [libcfs]
[81891.976956]  osd_index_declare_ea_insert+0x3d4/0x480 [osd_ldiskfs]
[81891.983914]  ? osd_index_declare_ea_delete+0x1cd/0x2f0 [osd_ldiskfs]
[81891.991040]  lod_sub_declare_insert+0xef/0x240 [lod]
[81891.996762]  lfsck_namespace_repair_dangling+0xe75/0x1370 [lfsck]
[81892.003700]  ? dt_lookup_dir+0x80/0x190 [obdclass]
[81892.009229]  lfsck_namespace_assistant_handler_p1+0x13b1/0x2020 [lfsck]
[81892.016561]  ? __schedule+0x2cc/0x700
[81892.020938]  lfsck_assistant_engine+0x359/0x1c20 [lfsck]
[81892.026945]  ? __switch_to+0x10c/0x480
[81892.031371]  ? __schedule+0x2cc/0x700
[81892.035689]  ? finish_wait+0x80/0x80
[81892.039917]  ? lfsck_master_engine+0xcd0/0xcd0 [lfsck]
[81892.045680]  kthread+0x116/0x130
[81892.049530]  ? kthread_flush_work_fn+0x10/0x10
[81892.054580]  ret_from_fork+0x1f/0x40 


 Comments   
Comment by Alexander Zarochentsev [ 12/May/23 ]
lfsck_namespace_repair_dangling(...):
...
        /* 7a. if child is remote, delete and insert to generate local agent */
        if (dt_object_remote(child)) {
                rc = dt_declare_delete(env, parent,
                                       (const struct dt_key *)lnr->lnr_name,
                                       th);
                if (rc)
                        GOTO(stop, rc);

===>        rc = dt_declare_insert(env, parent, (const struct dt_rec *)rec,
                                       (const struct dt_key *)lnr->lnr_name,
                                       th);
                if (rc)
                        GOTO(stop, rc);
        }

Looks like 7a code path was never called (or the crash has not been reported yet), it misses rec->ref_fid initialisation before calling dt_declare_insert(), it causes an assertion failure in
osd_index_declare_ea_insert().

Comment by Gerrit Updater [ 12/May/23 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50980
Subject: LU-16826 lfsck: init rec_fid before declare_insert
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5dd21e56cbd9a695bf5444218bdec7206c346afe

Comment by Alexander Zarochentsev [ 14/May/23 ]

Rerproducer:

MDSCOUNT=2 sh llmount.sh 

../utils/lfs mkdir -i 0 /mnt/lustre/mdt0dir
../utils/lfs mkdir -i 1 /mnt/lustre/mdt1dir

touch /mnt/lustre/mdt0dir/foo
mv /mnt/lustre/mdt0dir/foo /mnt/lustre/mdt1dir/
FOOFID=$(../utils/lfs path2fid /mnt/lustre/mdt1dir/foo | sed -E 's/^.(.*).$/\1/')
echo $FOOFID 

sync
umount /mnt/lustre-mds1
umount /mnt/lustre-mds2

echo "rm /REMOTE_PARENT_DIR/$FOOFID" | debugfs -w /dev/mapper/mds1_flakey 

mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre-mds1/
mount -t lustre /dev/mapper/mds2_flakey /mnt/lustre-mds2/

../utils/lctl lfsck_start -M lustre-MDT0000 -C
../utils/lctl lfsck_start -M lustre-MDT0001 -C

Comment by Gerrit Updater [ 15/May/23 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50998
Subject: LU-16826 tests: lfsck to repair a dangling remote entry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: afbf69ada658ca63c7a5953f3de12beb49d3a62b

Comment by Gerrit Updater [ 31/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50980/
Subject: LU-16826 lfsck: init rec_fid before declare_insert
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 02ac821653a0b2d897442e276d0afc31755064a4

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50998/
Subject: LU-16826 tests: lfsck to repair a dangling remote entry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 07e02a600e5707de30e1441ce56b68b0cbc3c260

Comment by Peter Jones [ 20/Dec/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:30:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.