Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
client crash with gdsio 32k < PAGE_SIZE of ARM CPU (64K) when
# getconf PAGE_SIZE 65536 # /usr/local/cuda/gds/tools/gdsio -f /lustre/file -d 0 -n 0 -w 1 -s 1m -i 32k -x 0 -I 1
[66108.386817] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000fffd787a1000 [66108.397771] Mem abort info: [66108.400627] ESR = 0x000000009600000f [66108.404455] EC = 0x25: DABT (current EL), IL = 32 bits [66108.409886] SET = 0, FnV = 0 [66108.413002] EA = 0, S1PTW = 0 [66108.416206] FSC = 0x0f: level 3 permission fault [66108.421104] Data abort info: [66108.424041] ISV = 0, ISS = 0x0000000f, ISS2 = 0x00000000 [66108.429649] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [66108.434809] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [66108.440239] user pgtable: 64k pages, 48-bit VAs, pgdp=00000001528da000 [66108.446911] [0000fffd787a1000] pgd=080000048cb30003, p4d=080000048cb30003, pud=080000048cb30003, pmd=08000004953f0003, pte=00e8000642910f43 [66108.459722] Internal error: Oops: 000000009600000f [#1] SMP [66108.465419] Modules linked in: mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw(OE) macsec tls pci_hyperv_intf knem(OE) mst_pciconf(OE) crc32_generic rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) video nouveau drm_exec gpu_sched drm_display_helper cec drm_ttm_helper ttm vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables libcrc32c nfnetlink qrtr sunrpc vfat fat acpi_ipmi spi_nor ipmi_ssif i2c_smbus arm_cspmu_module arm_spe_pmu mtd ipmi_devintf ipmi_msghandler [66108.465468] coresight_stm coresight_tmc stm_core coresight_funnel cppc_cpufreq coresight ext4 mbcache jbd2 ast drm_shmem_helper i2c_algo_bit drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops nvme sha256_arm64 ixgbe drm sha1_ce nvme_core sbsa_gwdt mdio nvme_common spi_tegra210_quad acpi_power_meter fuse [last unloaded: libcfs] [66108.587367] CPU: 21 PID: 937378 Comm: gdsio Kdump: loaded Tainted: G OE ------- --- 5.14.0-427.18.1.el9_4.aarch64+64k #1 [66108.599907] Hardware name: Giga Computing H223-V10-AAW1-000/MV13-HD0-000, BIOS F07 05/13/2024 [66108.608625] pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [66108.615741] pc : cl_sub_dio_alloc+0x13c/0x300 [obdclass] [66108.621202] lr : cl_sub_dio_alloc+0x128/0x300 [obdclass] [66108.626648] sp : ffff8000bb4af650 [66108.630030] x29: ffff8000bb4af650 x28: ffff0002a1389be8 x27: ffff00040ce5c698 [66108.637325] x26: ffffa02065880000 x25: 0000000000000000 x24: 0000000000000001 [66108.644619] x23: ffffa020672958c0 x22: ffff8000bb4af880 x21: 0000000000000006 [66108.651914] x20: ffff0004364ed1a0 x19: ffff00040ab154a8 x18: 0000000000000000 [66108.659209] x17: 0000000000000000 x16: ffffa020c0d4a100 x15: 0000000000000000 [66108.666503] x14: 0000000000000fd4 x13: 0000000000000000 x12: 0000000000000fd3 [66108.673798] x11: 0000000000000040 x10: 000000000002dcd5 x9 : ffffa020c0af9e64 [66108.681093] x8 : ffff0000de8642e0 x7 : 0000000000000000 x6 : 0000000001704015 [66108.688388] x5 : ffff6056f2bd0000 x4 : ffff0004538c3f00 x3 : ffff0000de8642d0 [66108.695683] x2 : ffff6056f2bd0000 x1 : ffff0004538c3f00 x0 : 0000fffd787a1000 [66108.702977] Call trace: [66108.705472] cl_sub_dio_alloc+0x13c/0x300 [obdclass] [66108.710562] ll_direct_IO_impl+0x328/0xa60 [lustre] [66108.715568] ll_direct_IO+0x18/0x20 [lustre] [66108.719940] generic_file_direct_write+0xd0/0x1dc [66108.724759] __generic_file_write_iter+0x98/0x1b0 [66108.729565] vvp_io_write_start+0x32c/0xae0 [lustre] [66108.734648] cl_io_start+0x78/0x140 [obdclass] [66108.739220] cl_io_loop+0xac/0x210 [obdclass] [66108.743688] ll_file_io_generic+0x428/0xc60 [lustre] [66108.748784] do_file_write_iter+0x444/0x680 [lustre] [66108.753866] ll_file_write_iter+0x58/0x120 [lustre] [66108.758858] vfs_write+0x250/0x300 [66108.762334] ksys_pwrite64+0x78/0xc0 [66108.765983] __arm64_sys_pwrite64+0x24/0x30 [66108.770255] invoke_syscall.constprop.0+0x7c/0xd0 [66108.775067] do_el0_svc+0xb4/0xd0 [66108.778449] el0_svc+0xe8/0x1f4 [66108.781657] el0t_64_sync_handler+0x134/0x150 [66108.786107] el0t_64_sync+0x17c/0x180 [66108.789850] Code: 37200560 f9404a63 b4000de3 f9400ec0 (a9400400) [66108.796080] SMP: stopping secondary CPUs [66108.802025] Starting crashdump kernel... [66108.806032] Bye!
without unaligned_dio, GDSIO fails since 32K IO size is not aligned against PAGE_SIZE. So, this is expected and its fine
# lctl set_param llite.*.unaligned_dio=0 # /usr/local/cuda/gds/tools/gdsio -f /lustre/file -d 0 -n 0 -w 1 -s 1m -i 32k -x 0 -I 1 write io failed of type 1 size: 32768 , ret: 0 failed to submit io of type 1 ret: -5 Error: IO failed stopping traffic, fd :35 ret:-5 errno :5 io failed :ret :-5 errno :5, file offset :0, block size :32768
Tested commit: "ede8d928d6 LU-17871 ldlm: FLOCK ownlocks may be not set" in master branch.