[LU-16246] NULL pointer at lod_lookup+0x24/0x38 Created: 18/Oct/22 Updated: 15/Nov/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jason Feng | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre servers: lustre clients: IO500 tag:io500-sc21 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
[32261.214407] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [32261.223858] Mem abort info: [32261.227340] ESR = 0x96000004 [32261.231077] EC = 0x25: DABT (current EL), IL = 32 bits [32261.237060] SET = 0, FnV = 0 [32261.240797] EA = 0, S1PTW = 0 [32261.244621] Data abort info: [32261.248185] ISV = 0, ISS = 0x00000004 [32261.252702] CM = 0, WnR = 0 [32261.256354] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202681405000 [32261.263462] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [32261.270918] Internal error: Oops: 96000004 [#1] SMP [32261.276466] Modules linked in: ofd(OE) ost(OE) osd_zfs(POE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) mbcache jbd2 lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) crc32_generic libcfs(OE) dm_flakey dm_mod vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) rfkill sunrpc nls_cp437 vfat fat zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) aes_ce_blk zcommon(POE) znvpair(POE) crypto_simd zavl(POE) ipmi_ssif cryptd icp(POE) aes_ce_cipher ghash_ce spl(OE) sha1_ce acpi_ipmi sbsa_gwdt ipmi_si ipmi_devintf ipmi_msghandler hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu hisi_uncore_pmu sch_fq_codel binfmt_misc knem(OE) xfs libcrc32c sd_mod sg hclge mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) hisi_sas_v3_hw tls hisi_sas_main psample sha2_ce libsas nvme ahci [32261.276555] hibmc_drm mlxdevm(OE) sha256_arm64 nvme_core hns3 libahci scsi_transport_sas drm_vram_helper auxiliary(OE) t10_pi mlx_compat(OE) drm_ttm_helper libata hnae3 ttm megaraid_sas host_edma_drv i2c_designware_platform i2c_designware_core xpmem(OE) fuse [32261.386429] CPU: 49 PID: 52372 Comm: mdt02_000 Kdump: loaded Tainted: P OE 5.10.0-60.18.0.50.aarch64 #1 [32261.397678] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.35 04/30/2020 [32261.406595] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--) [32261.413307] pc : lod_lookup+0x24/0x38 [lod] [32261.418192] lr : __mdd_lookup.isra.3+0x314/0x5b8 [mdd] [32261.423997] sp : ffff8000650ab4d0 [32261.427987] x29: ffff8000650ab4d0 x28: ffff2042c84c8820 [32261.433966] x27: ffff80000912d000 x26: 00000000000034e0 [32261.439945] x25: ffff8000650ab6e0 x24: ffff2023d2d16c50 [32261.445924] x23: ffff2023d1ae0080 x22: ffff0045aaf12e60 [32261.451904] x21: ffff2023d1ae0080 x20: ffff80000912d000 [32261.457882] x19: 0000000000000000 x18: 0000000000000001 [32261.463861] x17: 0000000000000000 x16: ffff80000a7df920 [32261.469841] x15: ffffffffffffffff x14: ffffffffffffffff [32261.475819] x13: 0000000000000018 x12: ffffffffffffffff [32261.481798] x11: 0000000000000040 x10: 7f7f7f7f7f7f7f7f [32261.487777] x9 : ffff80000ac39fc4 x8 : 0000000000000001 [32261.493757] x7 : 0000000000000b20 x6 : 0000000000004000 [32261.499737] x5 : ffff80000912d000 x4 : 0000000000000000 [32261.505716] x3 : ffff2023d2d16c50 x2 : ffff8000650ab6e0 [32261.511695] x1 : ffff2042d17dff00 x0 : ffff2023d1ae0080 [32261.517675] Call trace: [32261.520818] lod_lookup+0x24/0x38 [lod] [32261.525337] __mdd_lookup.isra.3+0x314/0x5b8 [mdd] [32261.530806] mdd_lookup+0x108/0x208 [mdd] [32261.535524] mdt_reint_open+0xffc/0x3810 [mdt] [32261.540656] mdt_reint_rec+0x170/0x390 [mdt] [32261.545614] mdt_reint_internal+0x6fc/0xf98 [mdt] [32261.551004] mdt_intent_open+0x17c/0x470 [mdt] [32261.556134] mdt_intent_opc+0x194/0x1040 [mdt] [32261.561265] mdt_intent_policy+0x23c/0x438 [mdt] [32261.566662] ldlm_lock_enqueue+0x5f0/0xbc0 [ptlrpc] [32261.572276] ldlm_handle_enqueue0+0x6ec/0x23e0 [ptlrpc] [32261.578230] tgt_enqueue+0xd4/0x2f0 [ptlrpc] [32261.583232] tgt_handle_request0+0xd4/0x9b0 [ptlrpc] [32261.588922] tgt_request_handle+0x7cc/0x1a30 [ptlrpc] [32261.594701] ptlrpc_server_handle_request+0x3bc/0x1218 [ptlrpc] [32261.601342] ptlrpc_main+0xdfc/0x16c8 [ptlrpc] [32261.606462] kthread+0x130/0x138 [32261.610369] ret_from_fork+0x10/0x18 [32261.614621] Code: f9400c24 d1006084 aa0403e1 f9401c84 (f9400084) [32261.621429] SMP: stopping secondary CPUs [32261.628375] Starting crashdump kernel... [32261.632977] Bye! |
| Comments |
| Comment by Andreas Dilger [ 18/Oct/22 ] |
|
I may not be able to help much here, since I suspect this issue relates somehow to ARM server (what is PAGE_SIZE and endianness?), but some things of note:
|
| Comment by Jason Feng [ 19/Oct/22 ] |
|
Thanks for comment. I will try b2_15 and new IO500 sc22. static int lod_lookup(const struct lu_env *env, struct dt_object *dt, struct dt_rec *rec, const struct dt_key *key) { struct dt_object *next = dt_object_child(dt); It show this next = NULL.If next == null , - 1 is returned to avoid null pointer hanging,is this ok?
return next->do_index_ops->dio_lookup(env, next, rec, key);
}
|
| Comment by Andreas Dilger [ 19/Oct/22 ] |
That might be OK for debugging (I would suggest to return something like -ENOENT or -EINVAL), but I suspect it will still not work properly because there is likely a problem elsewhere in the code. The "dt" object is a directory, and the mdd_lookup() caller should have initialized the object correctly before calling lod_lookup(). I suspect some larger problem here, like the locking being broken or similar. |
| Comment by Jason Feng [ 19/Oct/22 ] |
|
The directory is not deleted during the test, which may be caused by the memory problem. I try to reproduce the problem and capture the complete vmcore file for further analysis. |
| Comment by Aurelien Degremont (Inactive) [ 09/Nov/22 ] |
|
For the record, AWS reproduced a very similar crash in `lod_lookup()` on AWS specific Graviton ARM processors (4K pages) but this is 'do_index_ops' and not 'next' which was NULL. The crash dump shows that the dt_object memory structure is correct, do_index_ops has the correct value, but the register was NULL and the system crashed. This happens several times. This is running 2.12.9 + backports. |
| Comment by Jason Feng [ 09/Nov/22 ] |
|
Because the time sequence problem cannot be identified by kdump, can we add logs to further locate the problem?
|