<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:07:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14171] Lock timed out &amp; hung clients</title>
                <link>https://jira.whamcloud.com/browse/LU-14171</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi folks,&lt;/p&gt;

&lt;p&gt;We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that&apos;s resulting in some clients being hung/evicted and requiring a reboot.&lt;/p&gt;

&lt;p&gt;What we&apos;re seeing are entries like this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;At the time of first investigating it appears that FID was indeed not accessible:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0
/fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;ls&apos;ing this file hung and resulted in: &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766]  req@ffff88ba05c35100 x1684505381509824/t0(0) o101-&amp;gt;dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;



&lt;p&gt;This file did not show up as being open, per: &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT&apos;s and remounting will result in the file/dir becoming accessible again. &lt;/p&gt;

&lt;p&gt;What steps would you like us to perform to provide additional information to you?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Simon&lt;/p&gt;</description>
                <environment>CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5</environment>
        <key id="61833">LU-14171</key>
            <summary>Lock timed out &amp; hung clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                    </labels>
                <created>Wed, 2 Dec 2020 01:24:58 +0000</created>
                <updated>Fri, 4 Dec 2020 18:44:23 +0000</updated>
                                            <version>Lustre 2.12.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="286468" author="eaujames" created="Wed, 2 Dec 2020 12:17:23 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;Maybe the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14031&quot; title=&quot;long time between reconnects&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14031&quot;&gt;&lt;del&gt;LU-14031&lt;/del&gt;&lt;/a&gt; could help you?&lt;/p&gt;</comment>
                            <comment id="286563" author="scadmin" created="Thu, 3 Dec 2020 03:32:27 +0000"  >&lt;p&gt;Hi Etienne,&lt;/p&gt;

&lt;p&gt;thanks for looking at the ticket.&lt;/p&gt;

&lt;p&gt;I&apos;ve attached the ~2 days of ldlm_request looping that we saw on the hung FID.&lt;/p&gt;

&lt;p&gt;after that the MDS stopped responding totally to all requests, the fs was hung, and we were forced to umount, lustre_rmmod, mount the MDT to recover. there weren&apos;t any ldlm_lockd.c::expired_lock_main in those 2 days.&lt;/p&gt;

&lt;p&gt;on the client side, our main symptom was that there were D state processes on several clients - all associated with that single FID. when we rebooted those clients, it didn&apos;t affect the hung processes on other nodes.&lt;/p&gt;

&lt;p&gt;the MDT totally failed at approx Dec 1 14:50:49&lt;br/&gt;
MDTs were &quot;umount complete&quot; at Dec 1 15:01:19&lt;/p&gt;

&lt;p&gt;after the total fail we did see a couple of ldlm_lockd.c:256:expired_lock_main, but not during the preceedng 2 days.&lt;br/&gt;
greps for &quot;ldlm_request.c|ldlm_lockd&quot; at the end of the event is below -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;...
/var/log/messages-20201202.gz:Dec  1 14:19:32 warble2 kernel: LustreError: 42988:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606792472, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec603688b40/0xe4be9cdddb855e96 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 224 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42988 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606793921, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1e6b402d00/0xe4be9cdde0306879 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43125 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 1 previous similar message
/var/log/messages-20201202.gz:Dec  1 14:45:48 warble2 kernel: LustreError: 43123:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606794048, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1b19948900/0xe4be9cdde086763d lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43123 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:52:30 arkle8 kernel: LustreError: 54274:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 192.168.44.110@o2ib44  ns: filter-dagg-OST000e_UUID lock: ffff8f1d76277bc0/0xc9ea3a0ff004f28e lrc: 3/0,0 mode: PW/PW res: [0xc7844cc:0x0:0x0].0x0 rrc: 4 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x60000400020020 nid: 192.168.44.110@o2ib44 remote: 0x6745004c01094c2f expref: 269 pid: 24386 timeout: 226556 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:52:33 arkle8 kernel: LustreError: 163439:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 192.168.44.110@o2ib44 arrived at 1606794753 with bad export cookie 14549505386225834195
/var/log/messages-20201202.gz:Dec  1 15:01:18 warble2 kernel: LustreError: 43038:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8ec9ee5d6800 ns: mdt-dagg-MDT0000_UUID lock: ffff8ec41c8e1440/0xe4be9cdcf8e07aae lrc: 3/0,0 mode: PR/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x1b/0x0 rrc: 227 type: IBT flags: 0x50200400000020 nid: 192.168.44.137@o2ib44 remote: 0xfc2c8df7a2d04940 expref: 7 pid: 43038 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;does &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14031&quot; title=&quot;long time between reconnects&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14031&quot;&gt;&lt;del&gt;LU-14031&lt;/del&gt;&lt;/a&gt; match those symptoms? TBH it doesn&apos;t really look like it to me.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="286652" author="green" created="Thu, 3 Dec 2020 18:38:55 +0000"  >&lt;p&gt;I don&apos;t think &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14031&quot; title=&quot;long time between reconnects&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14031&quot;&gt;&lt;del&gt;LU-14031&lt;/del&gt;&lt;/a&gt; matches.&lt;/p&gt;

&lt;p&gt;This needs a collection of debug logs (with elevated debug level) from Lustre on both a client(s) and servers affected.&lt;/p&gt;

&lt;p&gt;Something like this: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13500?focusedCommentId=269199&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-269199&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13500?focusedCommentId=269199&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-269199&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13500&quot; title=&quot;Client gets evicted - nfsd non-standard errorno -108&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13500&quot;&gt;LU-13500&lt;/a&gt; is a good example ticket to read through to see how the debug info is gathered and what it needs to have&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="36825" name="ldlm_request.log.txt" size="94790" author="scadmin" created="Thu, 3 Dec 2020 03:09:51 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01g6n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>