<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:44:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4607] OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired</title>
                <link>https://jira.whamcloud.com/browse/LU-4607</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;These messages appear every few hours on the oss nodes:&lt;br/&gt;
oss6 kernel: : LustreError: &lt;br/&gt;
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 192.168.224.14@o2ib  ns: filter-scratch-OST000b_UUID lock: ffff8804a321f000/0xaa2e9b983dbd2233 lrc: 3/0,0 mode: PW/PW res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x4a3a76:0x0:0x0&amp;#93;&lt;/span&gt;.0 rrc: 2 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;4095) flags: 0x20 nid: 192.168.224.14@o2ib remote: 0x55c758a593d4fc6 expref: 24 pid: 12551timeout: 5161946610 lvb_type: 0&lt;/p&gt;

&lt;p&gt;On the client:&lt;/p&gt;

&lt;p&gt;pod24b14 kernel: : LustreError: 11-0: &lt;br/&gt;
scratch-OST000b-osc-ffff880312dff800: Communicating with 192.168.254.36@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
pod24b14 kernel: : Lustre: &lt;br/&gt;
scratch-OST000b-osc-ffff880312dff800: Connection to scratch-OST000b (at&lt;br/&gt;
192.168.254.36@o2ib) was lost; in progress operations using this service will wait for recovery to complete Feb  3 12:21:45 pod24b14 kernel: : Lustre: Skipped 1 previous pod24b14 kernel: : LustreError: 167-0: &lt;br/&gt;
scratch-OST000b-osc-ffff880312dff800: This client was evicted by scratch-OST000b; in progress operations using this service will fail.&lt;br/&gt;
pod24b14 kernel: : Lustre: &lt;br/&gt;
6039:0:(llite_lib.c:2506:ll_dirty_page_discard_warn()) scratch: dirty page&lt;br/&gt;
discard: 192.168.254.41@o2ib:192.168.254.42@o2ib:/scratch/fid: &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000020a6:0x130dc:0x0&amp;#93;&lt;/span&gt;/ may get corrupted (rc -108)pod24b14 kernel: : LustreError: 16480:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 0 of inodeffff880476e1a638 failed -108&lt;br/&gt;
pod24b14 kernel: : LustreError: &lt;br/&gt;
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) lock@ffff8806063297b8[2&lt;br/&gt;
3 0 1 1 00000000] W(2):[0,&lt;br/&gt;
18446744073709551615]@&lt;span class=&quot;error&quot;&gt;&amp;#91;0x1000b0000:0x4a3a76:0x0&amp;#93;&lt;/span&gt; &lt;/p&gt;
{ pod24b14 kernel: : LustreError: 
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast())
lovsub@ffff8804c673e620: [0 ffff880471ca03a0 W(2):[0, 18446744073709551615]@[0x2000020a6:0x14f4a:0x0]]
pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) 
osc@ffff8804901eaf00: ffff8805596246c0    0x20040000001 0x55c758a593d4fc6 
3 ffff88044b08cc70 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
pod24b14 kernel: : LustreError: 
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) }
&lt;p&gt; lock@ffff8806063297b8&lt;br/&gt;
pod24b14 kernel: : LustreError: &lt;br/&gt;
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) dlmlock returned -5 &lt;br/&gt;
od24b14 kernel: : LustreError: &lt;br/&gt;
16480:0:(cl_lock.c:1420:cl_unuse_try()) result = -5, this is unlikely!&lt;br/&gt;
pod24b14 kernel: : LustreError: &lt;br/&gt;
16480:0:(cl_lock.c:1435:cl_unuse_locked()) lock@ffff880606329978&lt;span class=&quot;error&quot;&gt;&amp;#91;1 0 0 1 0 00000000&amp;#93;&lt;/span&gt; W(2):&lt;span class=&quot;error&quot;&gt;&amp;#91;0, 18446744073709551615&amp;#93;&lt;/span&gt;@&lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000020a6:0x14f4a:0x0&amp;#93;&lt;/span&gt; &lt;/p&gt;
{ 
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked())     vvp@ffff8805f0daa678:
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked())     lov@ffff880471ca03a0: 1
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked())     0 0: ---
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked())
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) }
&lt;p&gt; lock@ffff880606329978&lt;br/&gt;
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) unuse return -5&lt;/p&gt;</description>
                <environment>mds1 (mgs + mds for home)&lt;br/&gt;
mds2 (      mds for scratch)&lt;br/&gt;
6 oss server serving 6 volumes for home and 24 volumes for scratch&lt;br/&gt;
&lt;br/&gt;
&lt;a href=&apos;mailto:mds1.ibb@o2ib0&apos;&gt;mds1.ibb@o2ib0&lt;/a&gt;:&lt;a href=&apos;mailto:mds2.ibb@o2ib0&apos;&gt;mds2.ibb@o2ib0&lt;/a&gt;:/home&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;86T   42T   43T  50% /global/home&lt;br/&gt;
&lt;a href=&apos;mailto:mds1.ibb@o2ib0&apos;&gt;mds1.ibb@o2ib0&lt;/a&gt;:&lt;a href=&apos;mailto:mds2.ibb@o2ib0&apos;&gt;mds2.ibb@o2ib0&lt;/a&gt;:/scratch&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;342T   89T  250T  27% /global/scratch&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
all clients mounts:&lt;br/&gt;
&lt;br/&gt;
# Lustre home and scratch FS from ahab1:ahab2&lt;br/&gt;
&lt;a href=&apos;mailto:mds1.ibb@o2ib0&apos;&gt;mds1.ibb@o2ib0&lt;/a&gt;:&lt;a href=&apos;mailto:mds2.ibb@o2ib0&apos;&gt;mds2.ibb@o2ib0&lt;/a&gt;:/home     /global/home    lustre &lt;br/&gt;
rw,user_xattr,localflock,_netdev 0 0&lt;br/&gt;
&lt;br/&gt;
&lt;a href=&apos;mailto:mds1.ibb@o2ib0&apos;&gt;mds1.ibb@o2ib0&lt;/a&gt;:&lt;a href=&apos;mailto:mds2.ibb@o2ib0&apos;&gt;mds2.ibb@o2ib0&lt;/a&gt;:/scratch  /global/scratch lustre rw,user_xattr,localflock,_netdev 0 0</environment>
        <key id="23086">LU-4607</key>
            <summary>OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="orentas">Oz Rentas</reporter>
                        <labels>
                    </labels>
                <created>Mon, 10 Feb 2014 22:49:33 +0000</created>
                <updated>Thu, 13 Feb 2014 20:24:29 +0000</updated>
                            <resolved>Thu, 13 Feb 2014 20:24:29 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="76674" author="cliffw" created="Mon, 10 Feb 2014 23:12:01 +0000"  >&lt;p&gt;These evictions are usually a sign of either a busy server, or a busy/bad network. The clients should recover, perhaps with an error. What does the load look like on the server side? Is there any indication of network error/issue? &lt;/p&gt;</comment>
                            <comment id="76675" author="cliffw" created="Mon, 10 Feb 2014 23:13:56 +0000"  >&lt;p&gt;If this is only happening on a few clients every few hours, you may be able to isolate it to a specific task, again best to monitor server load over time (vmstat, top, etc).&lt;/p&gt;</comment>
                            <comment id="76698" author="adilger" created="Tue, 11 Feb 2014 07:30:19 +0000"  >&lt;p&gt;For future reference, the binary debug logs that are dumped by the kernel need to be post-processed before we can use them.  You need to run:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl debug_file /tmp/lustre-log &amp;gt; /tmp/lustre-log.txt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;for them to be very useful.  It would also be useful to include a bit more of the console logs from the OSS, since this error message is itself not a sign of a crash, but a normal message indicating that the client was not responsive to the server&apos;s request to cancel the lock.&lt;/p&gt;</comment>
                            <comment id="76835" author="orentas" created="Wed, 12 Feb 2014 15:30:41 +0000"  >&lt;p&gt;Noted, thank you Andreas.&lt;/p&gt;

&lt;p&gt;The processed logs are too large to attach to this ticket but can be downloaded from: &lt;a href=&quot;http://ddntsr.com/ftp/2014-02-11-R30000_ddn_lustre_processed_logs.tgz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-02-11-R30000_ddn_lustre_processed_logs.tgz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, what can this mean?&lt;br/&gt;
Feb 12 02:25:28 oss5 kernel: : LustreError: &lt;br/&gt;
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1620s: evicting client at 192.168.224.16@o2ib ns: &lt;br/&gt;
filter-scratch-OST0004_UUID lock: ffff8804f4fbc240/0xde6e74a6631bb013 lrc: &lt;br/&gt;
3/0,0 mode: PW/PW res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x5b9afe:0x0:0x0&amp;#93;&lt;/span&gt;.0 rrc: 2 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;4095) flags: 0x10020 nid: &lt;br/&gt;
192.168.224.16@o2ib remote: 0x3ef50ad1d4f06&lt;/p&gt;

&lt;p&gt;Now on the client 192.168.224.16 i.e. pod24b16&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@pod24b16 ~&amp;#93;&lt;/span&gt;# date ; lfs fid2path /global/scratch &lt;span class=&quot;error&quot;&gt;&amp;#91;0x5b9afe:0x0:0x0&amp;#93;&lt;/span&gt; ; date Wed Feb 12 03:09:16 PST 2014 ioctl err -22: Invalid argument (22) fid2path error: Invalid argument Wed Feb 12 03:09:16 PST 2014&lt;/p&gt;

&lt;p&gt;The message on mds2 (MDS node for scratch) Feb 12 03:09:16 mds2 kernel: : &lt;br/&gt;
Lustre: 24536:0:(mdt_handler.c:5739:mdt_fid2path()) scratch-MDT0000: &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x5b9afe:0x0:0x0&amp;#93;&lt;/span&gt; is invalid, sequence should be &amp;gt;= 0x200000400&lt;/p&gt;
</comment>
                            <comment id="76854" author="jfc" created="Wed, 12 Feb 2014 17:12:50 +0000"  >&lt;p&gt;Thanks Cliff.&lt;/p&gt;</comment>
                            <comment id="76915" author="cliffw" created="Wed, 12 Feb 2014 23:29:28 +0000"  >&lt;p&gt;What that means is quite simple: There is a client holding a lock on a resource. The client has stopped talking to the server. The server waits, in this case 1620 seconds, and then times out, the timeout allows the server to reclaim the lock. This is done so that a dead client does not halt other work on the cluster. This normally has several causes:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;A client may be dead, or rebooted. (client state is gone after a reboot)&lt;/li&gt;
	&lt;li&gt;A client may be non responsive (very busy)&lt;/li&gt;
	&lt;li&gt;The network may be dead, or degraded.&lt;/li&gt;
	&lt;li&gt;The server itself may be busy/wedged in some fashion&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;First, you should determine if there are corresponding dead clients, or clients having application errors matching the server error timestamps. &lt;br/&gt;
Then, examine load on network switches/routers. &lt;br/&gt;
Then, examine load on servers. A server wedge is typically indicated by a very high load factor, dead clients/bad networks usually result in idle servers. &lt;br/&gt;
A callback error only affects a single client - the rest of the cluster should be happy. If multiple clients are experiencing these errors at the same time, that is a good indication of the issue being network or server related. &lt;/p&gt;</comment>
                            <comment id="76918" author="cliffw" created="Wed, 12 Feb 2014 23:37:26 +0000"  >&lt;p&gt;To further clarify, this bit:&lt;br/&gt;
 evicting client at 192.168.224.16@o2ib &lt;br/&gt;
means that the client (at 192.168.224.16) will get an error when it next tries to talk to the server. The client will then re-connect and continue. The thread holding this request will report an error to the user application that made the IO request. Depending on how this error is handled in the application, the eviction/reconnection maybe be un-noticed by a user, but there will be errors in the client logs.&lt;/p&gt;</comment>
                            <comment id="76920" author="orentas" created="Wed, 12 Feb 2014 23:51:52 +0000"  >&lt;p&gt;Thanks Cliff. Totally understand.  In this case, there has been some push back from the customer to collect network / client / OSS stats.  This feedback will hopefully help justify the need to collect the data that has already been requested (multiple times).  Thanks again.&lt;/p&gt;</comment>
                            <comment id="76983" author="jfc" created="Thu, 13 Feb 2014 16:48:13 +0000"  >&lt;p&gt;Oz &amp;#8211; do you want us to keep this ticket open/unresolved, while you try to get your customer data? Or shall we mark this ticket as resolved and wait for you to open a new ticket if the problem reoccurs? Please advise.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="76998" author="orentas" created="Thu, 13 Feb 2014 17:57:52 +0000"  >&lt;p&gt;There are 2 different issues:&lt;/p&gt;

&lt;p&gt;1) Mellanox: OSS HCAs losing access to UFM, and thus losing complete access to the storage.  Sequence of errors attached do_IRQ-errors.txt.  &lt;/p&gt;

&lt;p&gt;-irqbalance has been disabled on all servers, and now the system is being monitored.&lt;/p&gt;

&lt;p&gt;2) Quota issue on a single OST (ost_scratch_11)&lt;/p&gt;

&lt;p&gt;The first issue is not a Lustre problem, so we can go ahead and close &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4607&quot; title=&quot;OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4607&quot;&gt;&lt;del&gt;LU-4607&lt;/del&gt;&lt;/a&gt;.  A separate ticket will be opened for the quota problem.&lt;/p&gt;

&lt;p&gt;Thanks all.&lt;/p&gt;</comment>
                            <comment id="77013" author="pjones" created="Thu, 13 Feb 2014 20:24:29 +0000"  >&lt;p&gt;ok thanks Oz&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="14113" name="do_IRQ-errors.txt" size="3505" author="orentas" created="Thu, 13 Feb 2014 17:57:52 +0000"/>
                            <attachment id="14107" name="kern.log.1" size="142302" author="orentas" created="Wed, 12 Feb 2014 17:22:07 +0000"/>
                            <attachment id="14083" name="lustre-log.partaa" size="236" author="orentas" created="Mon, 10 Feb 2014 22:49:33 +0000"/>
                            <attachment id="14084" name="lustre-log.partab" size="236" author="orentas" created="Mon, 10 Feb 2014 22:49:33 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwepz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12604</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>