Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.14.0
-
None
-
Lustre 2.12.6 with ldiskfs (with MDT0000 formatted in Lustre 2.7)
-
3
-
9223372036854775807
Description
This crash has occurred in production.
A admin executed script sending fid2path requests with malformed FID. This caused the following crash on the MDT0000.
[65226.320307] LustreError: 24497:0:(osd_handler.c:1368:osd_fid_lookup()) ASSERTION( !updated ) failed: [65226.321549] LustreError: 24497:0:(osd_handler.c:1368:osd_fid_lookup()) LBUG [65226.322160] Pid: 24497, comm: mdt00_004 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:48:45 UTC 2020 [65226.322161] Call Trace: [65226.322171] [<ffffffffc09c57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [65226.322180] [<ffffffffc09c587c>] lbug_with_loc+0x4c/0xa0 [libcfs] [65226.322184] [<ffffffffc113c4e1>] osd_fid_lookup+0x17e1/0x1d90 [osd_ldiskfs] [65226.322192] [<ffffffffc113caf1>] osd_object_init+0x61/0x110 [osd_ldiskfs] [65226.322198] [<ffffffffc0c0d82b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [65226.322221] [<ffffffffc0c11744>] lu_object_find_at+0x234/0xab0 [obdclass] [65226.322235] [<ffffffffc0c11fd6>] lu_object_find+0x16/0x20 [obdclass] [65226.322248] [<ffffffffc12e146b>] mdt_object_find+0x4b/0x170 [mdt] [65226.322259] [<ffffffffc12ed0a5>] mdt_rpc_fid2path+0x4a5/0x17c0 [mdt] [65226.322267] [<ffffffffc12ee770>] mdt_get_info+0x3b0/0x490 [mdt] [65226.322275] [<ffffffffc0f5489a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [65226.322319] [<ffffffffc0ef973b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [65226.322343] [<ffffffffc0efd0a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [65226.322366] [<ffffffffa2ac6691>] kthread+0xd1/0xe0 [65226.322369] [<ffffffffa3192d37>] ret_from_fork_nospec_end+0x0/0x39 [65226.322372] [<ffffffffffffffff>] 0xffffffffffffffff
After some troubleshooting, I was able to reproduce the issue on a 2.12.7.
Here are the following steps:
- Format filesystem (ldiskfs and mount FS
- umount MDT0000
- Mount MDT0000 as ldiskfs and modify OI_scrub file (binary) by clearing the SIF_NO_HANDLE_OLD_FID in sf_internal_flags
- Remount MDT0000 as lustre
- Get the inode number of a Lustre system file:
debugfs -c -R 'stat lfsck_layout' /dev/mapper/mds1_flakey
Inode: 132 .... - Execute fid2path with inode as FID (IGIF FID):
lfs fid2path /media/lustrefs/client 0x84 - Crash
Dk log extracted from the crash:
00000004:00000040:0.0:1626701081.815693:0:24497:0:(mdt_handler.c:5736:mdt_object_init()) object init, fid = [0x84:0x0:0x0] 00000004:00000010:0.0:1626701081.815695:0:24497:0:(mdd_object.c:277:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff9c86d6d6a120. 00000004:00000001:0.0:1626701081.815697:0:24497:0:(mdt_handler.c:5745:mdt_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1626701081.815698:0:24497:0:(mdd_object.c:299:mdd_object_init()) Process entered 00000004:00000001:0.0:1626701081.815699:0:24497:0:(lod_dev.c:212:lod_object_alloc()) Process entered 00000004:00000010:0.0:1626701081.815700:0:24497:0:(lod_dev.c:214:lod_object_alloc()) slab-alloced 'lod_obj': 152 at ffff9c8699f347b8. 00000004:00000001:0.0:1626701081.815701:0:24497:0:(lod_dev.c:224:lod_object_alloc()) Process leaving (rc=18446634700655249336 : -109373054302280 : ffff9c8699f347b8) 00000004:00000001:0.0:1626701081.815701:0:24497:0:(mdd_object.c:309:mdd_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1626701081.815703:0:24497:0:(lod_object.c:6967:lod_object_init()) Process entered 00000004:00000001:0.0:1626701081.815703:0:24497:0:(lod_dev.c:128:lod_fld_lookup()) Process entered 80000000:00000001:0.0:1626701081.815704:0:24497:0:(fld_handler.c:248:fld_server_lookup()) Process entered 80000000:00000001:0.0:1626701081.815704:0:24497:0:(fld_handler.c:212:fld_local_lookup()) Process entered 80000000:00000001:0.0:1626701081.815705:0:24497:0:(fld_cache.c:517:fld_cache_lookup()) Process entered 80000000:00000001:0.0:1626701081.815705:0:24497:0:(fld_cache.c:536:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:0.0:1626701081.815706:0:24497:0:(fld_handler.c:229:fld_local_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:0.0:1626701081.815706:0:24497:0:(fld_handler.c:252:fld_server_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000040:0.0:1626701081.815707:0:24497:0:(lod_dev.c:169:lod_fld_lookup()) lustrefs-MDT0000-mdtlov: got tgt 0 for sequence: 0x84 00000004:00000001:0.0:1626701081.815708:0:24497:0:(lod_dev.c:171:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0) 00080000:00000010:0.0:1626701081.815709:0:24497:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 200 at ffff9c86d56fc400. 00000004:00000001:0.0:1626701081.815710:0:24497:0:(lod_object.c:7010:lod_object_init()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1626701081.815711:0:24497:0:(osd_handler.c:1142:osd_fid_lookup()) Process entered 00080000:00000001:0.0:1626701081.815712:0:24497:0:(osd_oi.c:576:fid_is_on_ost()) Process entered 00080000:00000001:0.0:1626701081.815712:0:24497:0:(osd_oi.c:584:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1626701081.815713:0:24497:0:(osd_handler.c:628:osd_iget_check()) Process entered 00080000:00000001:0.0:1626701081.815715:0:24497:0:(osd_handler.c:770:osd_iget_check()) Process leaving via put (rc=0 : 0 : 0x0) 00080000:00000001:0.0:1626701081.815715:0:24497:0:(osd_handler.c:854:osd_check_lma()) Process entered 00080000:00000001:0.0:1626701081.815722:0:24497:0:(osd_handler.c:929:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) 00080000:00000001:0.0:1626701081.815723:0:24497:0:(osd_oi.c:576:fid_is_on_ost()) Process entered 00080000:00000001:0.0:1626701081.815723:0:24497:0:(osd_oi.c:584:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1626701081.815723:0:24497:0:(osd_oi.c:576:fid_is_on_ost()) Process entered 00080000:00000001:0.0:1626701081.815723:0:24497:0:(osd_oi.c:584:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1626701081.815724:0:24497:0:(osd_handler.c:628:osd_iget_check()) Process entered 00080000:00000001:0.0:1626701081.815724:0:24497:0:(osd_handler.c:770:osd_iget_check()) Process leaving via put (rc=0 : 0 : 0x0) 00080000:00000001:0.0:1626701081.815725:0:24497:0:(osd_handler.c:854:osd_check_lma()) Process entered 00080000:00000001:0.0:1626701081.815725:0:24497:0:(osd_handler.c:929:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) 00080000:00040000:0.0:1626701081.815726:0:24497:0:(osd_handler.c:1368:osd_fid_lookup()) ASSERTION( !updated ) failed: 00080000:00040000:0.0:1626701081.816969:0:24497:0:(osd_handler.c:1368:osd_fid_lookup()) LBUG
It appears that the flag SIF_NO_HANDLE_OLD_FID was not set in older versions of Lustre. In our case the FS was formatted in Lustre 2.7. But I checked some other filesystems with a Lustre's clusterstore version and they seem to be affected too.
This cause od_igif_inoi flag to not be set in osd_device.
And then the following code get the inode directly from the FID's sequence (as in Lustre 1.8 for IGIF FID):
int osd_oi_lookup(.... .... if (!osd->od_igif_inoi && fid_is_igif(fid)) { osd_id_gen(id, lu_igif_ino(fid), lu_igif_gen(fid)); return 0; }
Then osd_check_lma returns an error because FID in LMA and the current FID (IGIF: [0x84:0x0:0x0]) does not match:
static int osd_check_lma(const struct lu_env *env, struct osd_object *obj) ...... if (fid != NULL && unlikely(!lu_fid_eq(rfid, fid))) { if (fid_is_idif(rfid) && fid_is_idif(fid)) { ................... } rc = -EREMCHG; }