Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.5
    • Lustre 2.15.4
    • None
    • Rocky 8.9 client:
      - Lustre 2.15.4
      - Kernel 4.18.0-513.11.1.el8_9.x86_64

      vs. Lustre 2.12.6 server
    • 3
    • 9223372036854775807

    Description

      Hi,

      We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn't see this when the client was running Rocky 8.8 / Lustre 2.12.9.

      The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.

      I see multiple messages such as the following in the client syslog:-

      Feb  7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [vasp_std:1029118]
      Feb  7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
      Feb  7 14:10:18 xxxx kernel: dm_log dm_mod fuse
      Feb  7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G           OEL   --------- -  - 4.18.0-513.11.1.el8_9.x86_64 #1
      Feb  7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
      Feb  7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
      Feb  7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
      Feb  7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
      Feb  7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
      Feb  7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      Feb  7 14:10:18 xxxx kernel: FS:  00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
      Feb  7 14:10:18 xxxx kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Feb  7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
      Feb  7 14:10:18 xxxx kernel: Call Trace:
      Feb  7 14:10:18 xxxx kernel: <IRQ>
      Feb  7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
      Feb  7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
      Feb  7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
      Feb  7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
      Feb  7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
      Feb  7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
      Feb  7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
      Feb  7 14:10:18 xxxx kernel: </IRQ>
      Feb  7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
      Feb  7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
      Feb  7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
      Feb  7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
      Feb  7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
      Feb  7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
      Feb  7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
      Feb  7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
      Feb  7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
      Feb  7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
      Feb  7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
      Feb  7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
      Feb  7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
      Feb  7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
      Feb  7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
      Feb  7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
      Feb  7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
      Feb  7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
      Feb  7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
      Feb  7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
      Feb  7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
      Feb  7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
      Feb  7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000
      

      Any ideas, please?

      Attachments

        Issue Links

          Activity

            [LU-17510] Client hung on ll_file_open

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54689/
            Subject: LU-17510 obdclass: fix wake up when queuing close request.
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 1732704711488d2d233f0b8e5bc9814f443405c6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54689/ Subject: LU-17510 obdclass: fix wake up when queuing close request. Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 1732704711488d2d233f0b8e5bc9814f443405c6

            "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54689
            Subject: LU-17510 obdclass: fix wake up when queuing close request.
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 60b4d29af760549b2b543bee1f4f54538757d388

            gerrit Gerrit Updater added a comment - "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54689 Subject: LU-17510 obdclass: fix wake up when queuing close request. Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 60b4d29af760549b2b543bee1f4f54538757d388
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54259/
            Subject: LU-17510 obdclass: fix wake up when queuing close request.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7a2296a397381a5f6f9473b297f0062e8ff15948

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54259/ Subject: LU-17510 obdclass: fix wake up when queuing close request. Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7a2296a397381a5f6f9473b297f0062e8ff15948
            bodgerer Mark Dixon added a comment -

            Thanks Andreas, much appreciated.

            bodgerer Mark Dixon added a comment - Thanks Andreas, much appreciated.

            These patches are not affecting data correctness per se, so at worst you could get hung client threads (which you are already seeing) and I don't think there is a high risk to applying them to your tree.

            adilger Andreas Dilger added a comment - These patches are not affecting data correctness per se, so at worst you could get hung client threads (which you are already seeing) and I don't think there is a high risk to applying them to your tree.
            bodgerer Mark Dixon added a comment -

            Thanks all!

            I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release?

            This set appears to apply cleanly, apart from a tweak to LU-15947 because LU-16231 had already landed:

            5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
            91a3726f31 LU-16633 obdclass: fix rpc slot leakage
            b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
            (plus Neil's patch for LU-17510)
            
            bodgerer Mark Dixon added a comment - Thanks all! I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release? This set appears to apply cleanly, apart from a tweak to LU-15947 because LU-16231 had already landed: 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs 91a3726f31 LU-16633 obdclass: fix rpc slot leakage b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot (plus Neil's patch for LU-17510)

            That is good news. I think the preference to use continue with Neil's patch

            stancheff Shaun Tancheff added a comment - That is good news. I think the preference to use continue with Neil's patch
            bodgerer Mark Dixon added a comment -

            Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

            bodgerer Mark Dixon added a comment - Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

            I can confirm that adding the above patch ... fixes the problem

            Could you please clarify which patch you are referring to, now that there are two involved?

            the client show a couple of other issues on our system

            They should be filed as separate issues against 2.16.0.

            adilger Andreas Dilger added a comment - I can confirm that adding the above patch ... fixes the problem Could you please clarify which patch you are referring to, now that there are two involved? the client show a couple of other issues on our system They should be filed as separate issues against 2.16.0.

            People

              stancheff Shaun Tancheff
              bodgerer Mark Dixon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: