Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.15.4
-
None
-
Rocky 8.9 client:
- Lustre 2.15.4
- Kernel 4.18.0-513.11.1.el8_9.x86_64
vs. Lustre 2.12.6 server
-
3
-
9223372036854775807
Description
Hi,
We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn't see this when the client was running Rocky 8.8 / Lustre 2.12.9.
The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.
I see multiple messages such as the following in the client syslog:-
Feb 7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [vasp_std:1029118]
Feb 7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
Feb 7 14:10:18 xxxx kernel: dm_log dm_mod fuse
Feb 7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.11.1.el8_9.x86_64 #1
Feb 7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
Feb 7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
Feb 7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
Feb 7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Feb 7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
Feb 7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
Feb 7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
Feb 7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
Feb 7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Feb 7 14:10:18 xxxx kernel: FS: 00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
Feb 7 14:10:18 xxxx kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
Feb 7 14:10:18 xxxx kernel: Call Trace:
Feb 7 14:10:18 xxxx kernel: <IRQ>
Feb 7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
Feb 7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
Feb 7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
Feb 7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
Feb 7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
Feb 7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
Feb 7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
Feb 7 14:10:18 xxxx kernel: </IRQ>
Feb 7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
Feb 7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
Feb 7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
Feb 7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
Feb 7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
Feb 7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
Feb 7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
Feb 7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
Feb 7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
Feb 7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
Feb 7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
Feb 7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
Feb 7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
Feb 7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
Feb 7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
Feb 7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
Feb 7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
Feb 7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
Feb 7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
Feb 7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
Feb 7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
Feb 7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
Feb 7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Feb 7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
Feb 7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
Feb 7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
Feb 7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
Feb 7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000
Any ideas, please?
Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.