Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17510

Client hung on ll_file_open

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.4
    • Rocky 8.9 client:
      - Lustre 2.15.4
      - Kernel 4.18.0-513.11.1.el8_9.x86_64

      vs. Lustre 2.12.6 server
    • 3
    • 9223372036854775807

    Description

      Hi,

      We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn't see this when the client was running Rocky 8.8 / Lustre 2.12.9.

      The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.

      I see multiple messages such as the following in the client syslog:-

      Feb  7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [vasp_std:1029118]
      Feb  7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
      Feb  7 14:10:18 xxxx kernel: dm_log dm_mod fuse
      Feb  7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G           OEL   --------- -  - 4.18.0-513.11.1.el8_9.x86_64 #1
      Feb  7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
      Feb  7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
      Feb  7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
      Feb  7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
      Feb  7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
      Feb  7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      Feb  7 14:10:18 xxxx kernel: FS:  00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
      Feb  7 14:10:18 xxxx kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Feb  7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
      Feb  7 14:10:18 xxxx kernel: Call Trace:
      Feb  7 14:10:18 xxxx kernel: <IRQ>
      Feb  7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
      Feb  7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
      Feb  7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
      Feb  7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
      Feb  7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
      Feb  7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
      Feb  7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
      Feb  7 14:10:18 xxxx kernel: </IRQ>
      Feb  7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
      Feb  7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
      Feb  7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
      Feb  7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
      Feb  7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
      Feb  7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
      Feb  7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
      Feb  7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
      Feb  7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
      Feb  7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
      Feb  7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
      Feb  7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
      Feb  7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
      Feb  7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
      Feb  7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
      Feb  7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
      Feb  7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
      Feb  7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
      Feb  7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
      Feb  7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
      Feb  7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
      Feb  7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
      Feb  7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000
      

      Any ideas, please?

      Attachments

        Issue Links

          Activity

            People

              stancheff Shaun Tancheff
              bodgerer Mark Dixon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: