<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:36:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-17510] Client hung on ll_file_open</title>
                <link>https://jira.whamcloud.com/browse/LU-17510</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn&apos;t see this when the client was running Rocky 8.8 / Lustre 2.12.9.&lt;/p&gt;

&lt;p&gt;The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.&lt;/p&gt;

&lt;p&gt;I see multiple messages such as the following in the client syslog:-&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Feb &#160;7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 22s! [vasp_std:1029118]
Feb &#160;7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
Feb &#160;7 14:10:18 xxxx kernel: dm_log dm_mod fuse
Feb &#160;7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G &#160; &#160; &#160; &#160; &#160; OEL &#160; --------- - &#160;- 4.18.0-513.11.1.el8_9.x86_64 #1
Feb &#160;7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
Feb &#160;7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
Feb &#160;7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d &amp;lt;0f&amp;gt; 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
Feb &#160;7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Feb &#160;7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
Feb &#160;7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
Feb &#160;7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
Feb &#160;7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
Feb &#160;7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Feb &#160;7 14:10:18 xxxx kernel: FS: &#160;00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
Feb &#160;7 14:10:18 xxxx kernel: CS: &#160;0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb &#160;7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
Feb &#160;7 14:10:18 xxxx kernel: Call Trace:
Feb &#160;7 14:10:18 xxxx kernel: &amp;lt;IRQ&amp;gt;
Feb &#160;7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
Feb &#160;7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
Feb &#160;7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
Feb &#160;7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
Feb &#160;7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
Feb &#160;7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
Feb &#160;7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
Feb &#160;7 14:10:18 xxxx kernel: &amp;lt;/IRQ&amp;gt;
Feb &#160;7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
Feb &#160;7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
Feb &#160;7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
Feb &#160;7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
Feb &#160;7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
Feb &#160;7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
Feb &#160;7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
Feb &#160;7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
Feb &#160;7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
Feb &#160;7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
Feb &#160;7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
Feb &#160;7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
Feb &#160;7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
Feb &#160;7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
Feb &#160;7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
Feb &#160;7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
Feb &#160;7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
Feb &#160;7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
Feb &#160;7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
Feb &#160;7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
Feb &#160;7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
Feb &#160;7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 &amp;lt;48&amp;gt; 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
Feb &#160;7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Feb &#160;7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
Feb &#160;7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
Feb &#160;7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
Feb &#160;7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
Feb &#160;7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Any ideas, please?&lt;/p&gt;</description>
                <environment>Rocky 8.9 client:&lt;br/&gt;
- Lustre 2.15.4&lt;br/&gt;
- Kernel 4.18.0-513.11.1.el8_9.x86_64&lt;br/&gt;
&lt;br/&gt;
vs. Lustre 2.12.6 server&lt;br/&gt;
</environment>
        <key id="80696">LU-17510</key>
            <summary>Client hung on ll_file_open</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="bodgerer">Mark Dixon</reporter>
                        <labels>
                    </labels>
                <created>Wed, 7 Feb 2024 14:52:56 +0000</created>
                <updated>Fri, 9 Feb 2024 09:22:19 +0000</updated>
                                            <version>Lustre 2.15.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="403071" author="adilger" created="Wed, 7 Feb 2024 19:53:02 +0000"  >&lt;p&gt;The stack shows it is trying to do a file close &lt;b&gt;while it is doing the open&lt;/b&gt; so there must be something wrong with opening the file?  It is stuck on a spinlock, or possibly the MDS_CLOSE RPC, hard to see for sure?&lt;/p&gt;

&lt;p&gt;Two suggestions to try and debug this (if you can reproduce it easily) would be to try a few different Lustre versions on the client:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;2.15.0 (in case it was backported recently)&lt;/li&gt;
	&lt;li&gt;current master development 2.15.60 (in case it was already fixed)&lt;/li&gt;
	&lt;li&gt;2.14.0 (as an intermediate between 2.12 and 2.15)&lt;br/&gt;
After that, if you can identify a good/bad version boundaries start bisecting between patches on the client to identify which patch introduced the problem.  Sorry, but I can&apos;t really be of more help right now.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="403355" author="bodgerer" created="Fri, 9 Feb 2024 09:22:19 +0000"  >&lt;p&gt;Thanks for the advice Andreas, I&apos;ll take a look&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i04amv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>