<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:07:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST</title>
                <link>https://jira.whamcloud.com/browse/LU-7321</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One the file systems that have BlueGene/Q clients, I am seeing these log messages:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;vestai:/g/g0/marc# tail /var/log/lustre.log
Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
vestai:/g/g0/marc# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There were over 3000 messages printed to the logs that day, not counting the suppressed messages:&lt;br/&gt;
vestai:/g/g0/marc# zgrep -c &quot;returned -5 from glimpse AST&quot; /var/log/lustre.log-20150915.gz &lt;br/&gt;
3232&lt;br/&gt;
vestai:/g/g0/marc# &lt;/p&gt;

&lt;p&gt;These messages only show up with NIDs related to BlueGene/Q clients, and I don&apos;t see them on file systems that do not have BlueGene/Q clients.  I don&apos;t see anything in the console or syslog logs on the clients.&lt;/p&gt;

&lt;p&gt;On the BG/Q clients we are running:&lt;br/&gt;
 vulcanio14-ib0@root: cat /proc/fs/lustre/version&lt;br/&gt;
lustre: 2.5.4&lt;br/&gt;
kernel: patchless_client&lt;br/&gt;
build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64&lt;/p&gt;

&lt;p&gt;We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:&lt;br/&gt;
Number of messages per month (not counting &quot;x similar messages&quot;:&lt;br/&gt;
10/14: 172&lt;br/&gt;
11/14: 8043&lt;br/&gt;
12/14: 15,002&lt;br/&gt;
1/15: 17,191&lt;br/&gt;
2/15: 36,258&lt;br/&gt;
3/15: 53,145&lt;br/&gt;
4/15: 69,215&lt;br/&gt;
5/15: 96,518&lt;br/&gt;
6/15: 54,504&lt;br/&gt;
7/15: 91,304&lt;br/&gt;
8/15: 144,691&lt;br/&gt;
9/15: 236,491&lt;/p&gt;
</description>
                <environment>LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64&lt;br/&gt;
ZFS Backend &lt;br/&gt;
BG/Q clients</environment>
        <key id="32739">LU-7321</key>
            <summary>LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="marc@llnl.gov">D. Marc Stearman</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>ppc</label>
                    </labels>
                <created>Tue, 20 Oct 2015 15:58:27 +0000</created>
                <updated>Wed, 21 Mar 2018 19:42:27 +0000</updated>
                            <resolved>Wed, 21 Mar 2018 19:42:27 +0000</resolved>
                                    <version>Lustre 2.5.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="130899" author="marc@llnl.gov" created="Tue, 20 Oct 2015 16:07:11 +0000"  >&lt;p&gt;Some more information:&lt;/p&gt;

&lt;p&gt;In Oct 2014, we were running 2.4.2 on the bluegene clusters, and the file system servers.  In Dec 2015, we updated to 2.5.3 on the servers, and 2.5.4 on the bluegene clients.&lt;/p&gt;</comment>
                            <comment id="130918" author="jgmitter" created="Tue, 20 Oct 2015 17:23:37 +0000"  >&lt;p&gt;Hi Nathaniel,&lt;br/&gt;
Can you have a look at this issue?&lt;br/&gt;
Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="130919" author="adilger" created="Tue, 20 Oct 2015 17:28:13 +0000"  >&lt;p&gt;Marc, could you please check a couple of the OST objects that are reported in these error messages to see if they still exist, e.g. &lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x4b17112:0x0:0x0&amp;#93;&lt;/span&gt;&lt;/tt&gt; and &lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x4a8bd2d:0x0:0x0&amp;#93;&lt;/span&gt;&lt;/tt&gt; (presumably with zdb to avoid unmounting the OSTs).  I&apos;m trying to figure out if these errors are appearing because MDT layouts are referencing missing OST objects, or if the objects exist but there is some error accessing or locking them from the PPC clients.  The OST FIDs (object IDs) themselves look reasonable, 78737682 and 78167341 respectively, so it doesn&apos;t appear to be a FID swabbing issue at least.&lt;/p&gt;</comment>
                            <comment id="130931" author="marc@llnl.gov" created="Tue, 20 Oct 2015 18:12:11 +0000"  >&lt;p&gt;Just to make sure I did the right thing:&lt;/p&gt;

&lt;p&gt;Using this current message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 20 10:53:33 vesta46 kernel: LustreError: 40317:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.14@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff8801c0a3bbc0/0x44146a450a733fa0 lrc: 5/0,0 mode: PW/PW res: [0x4f831ef:0x0:0x0].0 rrc: 3 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;1048575) flags: 0x50801000000000 nid: 172.20.17.14@o2ib500 remote: 0x7266945e442341a2 expref: 5 pid: 40321 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I want tot look for object &quot;0x4f831ef:0x0:0x0&quot; on vesta46.  I created a snapshot of the OST, and mounted it.  &quot;0x4f831ef&quot; translates to 83374575 in decimal.&lt;/p&gt;

&lt;p&gt;In the snapshot, I did this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vesta46:snap]# ls
CONFIGS       oi.10   oi.105  oi.110  oi.116  oi.121  oi.127  oi.18  oi.23  oi.29  oi.34  oi.4   oi.45  oi.50  oi.56  oi.61  oi.67  oi.72  oi.78  oi.83  oi.89  oi.94  quota_slave
O             oi.100  oi.106  oi.111  oi.117  oi.122  oi.13   oi.19  oi.24  oi.3   oi.35  oi.40  oi.46  oi.51  oi.57  oi.62  oi.68  oi.73  oi.79  oi.84  oi.9   oi.95  seq-200000003-lastid
health_check  oi.101  oi.107  oi.112  oi.118  oi.123  oi.14   oi.2   oi.25  oi.30  oi.36  oi.41  oi.47  oi.52  oi.58  oi.63  oi.69  oi.74  oi.8   oi.85  oi.90  oi.96  seq_srv
last_rcvd     oi.102  oi.108  oi.113  oi.119  oi.124  oi.15   oi.20  oi.26  oi.31  oi.37  oi.42  oi.48  oi.53  oi.59  oi.64  oi.7   oi.75  oi.80  oi.86  oi.91  oi.97
oi.0          oi.103  oi.109  oi.114  oi.12   oi.125  oi.16   oi.21  oi.27  oi.32  oi.38  oi.43  oi.49  oi.54  oi.6   oi.65  oi.70  oi.76  oi.81  oi.87  oi.92  oi.98
oi.1          oi.104  oi.11   oi.115  oi.120  oi.126  oi.17   oi.22  oi.28  oi.33  oi.39  oi.44  oi.5   oi.55  oi.60  oi.66  oi.71  oi.77  oi.82  oi.88  oi.93  oi.99
[root@vesta46:snap]# ls O/0
d0  d1  d10  d11  d12  d13  d14  d15  d16  d17  d18  d19  d2  d20  d21  d22  d23  d24  d25  d26  d27  d28  d29  d3  d30  d31  d4  d5  d6  d7  d8  d9
[root@vesta46:snap]# ls O/0/*/83374575
ls: cannot access O/0/*/83374575: No such file or directory
[root@vesta46:snap]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This tells me the object does not exist.  The next sequence number does, so it seems to be a valid range of objects:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vesta46:snap]# ls O/0/*/83374576
O/0/d16/83374576
[root@vesta46:snap]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If I did not look in the right place, let me know and I can check again.&lt;/p&gt;</comment>
                            <comment id="130956" author="marc@llnl.gov" created="Tue, 20 Oct 2015 21:05:30 +0000"  >&lt;p&gt;Just for kicks I picked a couple of other messages on that OST, and found that the files DO exist:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 20 07:52:13 vesta46 kernel: LustreError: 40313:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.185@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff880430e4bb00/0x44146a450a4fb185 lrc: 5/0,0 mode: PW/PW res: [0x4f814c7:0x0:0x0].0 rrc: 17 type: EXT [191496192-&amp;gt;18446744073709551615] (req 191496192-&amp;gt;191561727) flags: 0x40000000000000 nid: 172.20.17.185@o2ib500 remote: 0x82ceff91b720bbf5 expref: 18 pid: 40313 timeout: 0 lvb_type: 0
Oct 20 07:52:41 vesta46 kernel: LustreError: 40316:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.159@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff880aee7881c0/0x44146a450a4fb903 lrc: 5/0,0 mode: PW/PW res: [0x4f81442:0x0:0x0].0 rrc: 224 type: EXT [1861222400-&amp;gt;18446744073709551615] (req 1861222400-&amp;gt;18446744073709551615) flags: 0x40000000000000 nid: 172.20.17.159@o2ib500 remote: 0x7363d7a3a2c7e459 expref: 134 pid: 40316 timeout: 0 lvb_type: 0
Oct 20 08:00:33 vesta46 kernel: LustreError: 42292:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.177@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff88068ffb36c0/0x44146a450a50a9cd lrc: 4/0,0 mode: PW/PW res: [0x4f81468:0x0:0x0].0 rrc: 558 type: EXT [1877999616-&amp;gt;18446744073709551615] (req 1877999616-&amp;gt;18446744073709551615) flags: 0x40000000000000 nid: 172.20.17.177@o2ib500 remote: 0xd4e4f78838543fa8 expref: 291 pid: 42292 timeout: 0 lvb_type: 0
Oct 20 10:53:33 vesta46 kernel: LustreError: 40317:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.14@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff8801c0a3bbc0/0x44146a450a733fa0 lrc: 5/0,0 mode: PW/PW res: [0x4f831ef:0x0:0x0].0 rrc: 3 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;1048575) flags: 0x50801000000000 nid: 172.20.17.14@o2ib500 remote: 0x7266945e442341a2 expref: 5 pid: 40321 timeout: 0 lvb_type: 0

[root@vesta46:snap]# ls O/0/*/83367016
O/0/d8/83367016
[root@vesta46:snap]# ls O/0/*/83366978
O/0/d2/83366978
[root@vesta46:snap]# ls O/0/*/83374575
ls: cannot access O/0/*/83374575: No such file or directory
[root@vesta46:snap]# 
[root@vesta46:snap]# ls O/0/*/83367111
O/0/d7/83367111
[root@vesta46:snap]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, out of the four messages on that OST this morning, three of the files exist, but the one at 10:53:33 does not exist.&lt;/p&gt;</comment>
                            <comment id="133713" author="utopiabound" created="Tue, 17 Nov 2015 17:12:21 +0000"  >&lt;p&gt;Can you provide debug logs from the clients, OSS and MDS when this occurs?&lt;/p&gt;</comment>
                            <comment id="135125" author="jfc" created="Thu, 3 Dec 2015 18:07:26 +0000"  >&lt;p&gt;Hello Marc,&lt;/p&gt;

&lt;p&gt;Are you able to provide the information that Nathaniel has requested?&lt;/p&gt;

&lt;p&gt;Any other action you would like us to take on this one?&lt;br/&gt;
Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="135130" author="marc@llnl.gov" created="Thu, 3 Dec 2015 18:23:21 +0000"  >&lt;p&gt;I can probably trigger some dumps on the OSTs and MDS node.  Let me tweak some scripts and wait for it to happen.&lt;/p&gt;</comment>
                            <comment id="136481" author="marc@llnl.gov" created="Wed, 16 Dec 2015 00:47:25 +0000"  >&lt;p&gt;Here are the logs from vesta29, and also from the MDS at the time this log messages appeared:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Dec 15 14:56:21 vesta29 kernel: LustreError: 9813:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.17@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST001c_UUID lock: ffff8803ad803980/0x9749e8c363b75b7f lrc: 5/0,0 mode: PW/PW res: [0x542af58:0x0:0x0].0 rrc: 31 type: EXT [719323136-&amp;gt;18446744073709551615] (req 719323136-&amp;gt;720371711) flags: 0x60000000000020 nid: 172.20.17.17@o2ib500 remote: 0x1fa09e9911a44c7c expref: 349 pid: 15752 timeout: 7159851604 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="136500" author="morrone" created="Wed, 16 Dec 2015 02:08:21 +0000"  >&lt;p&gt;Nathaniel, could you please take a step back for a moment and explain the meaning of the error message?  I&apos;ll start the conversation by giving my best guess at the situation:&lt;/p&gt;

&lt;p&gt;Some server (OST) needs to satisfy a request to lookup the a file&apos;s size, but a client currently holds a range lock on the file that means the server doesn&apos;t know the size (because there may be outstanding dirty data on the client).  The server issues a glimpse AST to the client to get that information.  The client returns EIO.&lt;/p&gt;

&lt;p&gt;So if I&apos;m in the ballpark, the first step would be to look at the client code and figure out under which situations the response will be &quot;EIO&quot;.  I mean, heck, maybe the glimpse AST code just hasn&apos;t been correctly swabbed and the client can&apos;t make sense of the RPC.&lt;/p&gt;

&lt;p&gt;I could be completely off base.  If so, please enlighten us.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I&apos;m trying to figure out if these errors are appearing because MDT layouts are referencing missing OST objects,&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I am not sure that I understand that rationale.  Why would an OST ask a client about an object for which it itself had no knowledge?  That would be a bug right there, would it not?  The fact that the OST looked up the locks on the object and decided that a client had a lock on the end of the object pretty much tells us that the object existed on the OST at the time it issue the glimpse AST to the client, does it not?&lt;/p&gt;</comment>
                            <comment id="136626" author="utopiabound" created="Wed, 16 Dec 2015 22:05:20 +0000"  >&lt;p&gt;There looks like there&apos;s a lot of IB errors in the debug logs that are attached (these  messages appear in both debug logs that are attached).&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -&amp;gt; 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12
16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Could these be the source of the errors?&lt;/p&gt;

&lt;p&gt;The error that is printed on the client is coming up from the network stack.&lt;/p&gt;</comment>
                            <comment id="136637" author="marc@llnl.gov" created="Wed, 16 Dec 2015 23:15:59 +0000"  >&lt;p&gt;I&apos;m not seeing any errors on the IB fabric that would account for this.  Could there be some swabbing issues in the IB LND code?  We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes).  172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).&lt;/p&gt;</comment>
                            <comment id="136639" author="marc@llnl.gov" created="Wed, 16 Dec 2015 23:25:58 +0000"  >&lt;p&gt;I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped.  I&apos;m not sure which field is the timestamp, but it seems like there are no messages since it last mounted.&lt;/p&gt;</comment>
                            <comment id="136640" author="marc@llnl.gov" created="Wed, 16 Dec 2015 23:27:54 +0000"  >&lt;p&gt;And just to add if it helps, the IO node has been up for 6 days:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@vulcansn1:~&amp;#93;&lt;/span&gt;# mrsh vulcanio17 uptime&lt;br/&gt;
 15:20:58 up 6 days,  6:02,  1 user,  load average: 3.39, 5.00, 5.64&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@vulcansn1:~&amp;#93;&lt;/span&gt;# mrsh vulcanio17 cat /proc/sys/lnet/debug&lt;br/&gt;
ioctl neterror warning error emerg config console&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@vulcansn1:~&amp;#93;&lt;/span&gt;# &lt;/p&gt;</comment>
                            <comment id="136641" author="marc@llnl.gov" created="Wed, 16 Dec 2015 23:42:40 +0000"  >&lt;p&gt;I was looking at the debug log files.  You need to pay attention the timestamps.  We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted.  The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them.  The messages are happening quite frequently, and I don&apos;t believe we are tearing down network connections on each of these instances.&lt;/p&gt;</comment>
                            <comment id="136751" author="utopiabound" created="Thu, 17 Dec 2015 19:34:04 +0000"  >&lt;p&gt;Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.&lt;/p&gt;</comment>
                            <comment id="136850" author="gerrit" created="Fri, 18 Dec 2015 14:24:02 +0000"  >&lt;p&gt;Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/17672&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17672&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7321&quot; title=&quot;LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7321&quot;&gt;&lt;del&gt;LU-7321&lt;/del&gt;&lt;/a&gt; test: FOR TEST ONLY&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5d8700e425773b3738844b477b2fe74d6de85b08&lt;/p&gt;</comment>
                            <comment id="139087" author="utopiabound" created="Fri, 15 Jan 2016 22:01:29 +0000"  >&lt;p&gt;Can you try to get some debug logs with the following levels enabled: rpctrace ha&lt;/p&gt;

&lt;p&gt;This will help nail down if the error is come back from the client or due to a connectivity error.&lt;/p&gt;</comment>
                            <comment id="139112" author="green" created="Sat, 16 Jan 2016 02:22:47 +0000"  >&lt;p&gt;These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all.&lt;/p&gt;

&lt;p&gt;I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.&lt;/p&gt;</comment>
                            <comment id="152851" author="utopiabound" created="Thu, 19 May 2016 16:56:42 +0000"  >&lt;p&gt;We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.&lt;/p&gt;</comment>
                            <comment id="155963" author="pjones" created="Thu, 16 Jun 2016 17:47:21 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please assist with this issue?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="155966" author="pjones" created="Thu, 16 Jun 2016 17:49:21 +0000"  >&lt;p&gt;Sorry - wrong ticket&lt;/p&gt;</comment>
                            <comment id="163316" author="simmonsja" created="Fri, 26 Aug 2016 21:46:44 +0000"  >&lt;p&gt;Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node.&lt;/p&gt;

&lt;p&gt;Do you need debugging info from both the clients and the server.&lt;/p&gt;</comment>
                            <comment id="163434" author="simmonsja" created="Mon, 29 Aug 2016 17:32:25 +0000"  >&lt;p&gt;I fixed up my LNet DLC yaml configuration file and applied &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8507&quot; title=&quot;DLC: Enable peer_credit management&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8507&quot;&gt;&lt;del&gt;LU-8507&lt;/del&gt;&lt;/a&gt; so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?&lt;/p&gt;</comment>
                            <comment id="163473" author="morrone" created="Mon, 29 Aug 2016 20:02:41 +0000"  >&lt;p&gt;Yes, it is Lustre 2.5.5 + many patches.  I believe it is Mellanox.  We don&apos;t use DLC, yaml, or change the peer credits.&lt;/p&gt;</comment>
                            <comment id="163634" author="simmonsja" created="Tue, 30 Aug 2016 18:42:43 +0000"  >&lt;p&gt;Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?&lt;/p&gt;</comment>
                            <comment id="163645" author="simmonsja" created="Tue, 30 Aug 2016 20:18:38 +0000"  >&lt;p&gt;Here is a lctl dump from both my client and an OSS server.&lt;/p&gt;</comment>
                            <comment id="164516" author="utopiabound" created="Wed, 31 Aug 2016 21:32:52 +0000"  >&lt;p&gt;From OSS dump:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00010000:19.0:1472585788.794238:0:19880:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-sultan-OST001b_UUID lock: ffff8808c87bb300/0xca29cea60124a3ed lrc: 2/0,0 mode: PW/PW res: [0x1931e3:0x0:0x0].0x0 rrc: 2 type: EXT [0-&amp;gt;18446744073709551615] (req 2738880512-&amp;gt;2739929087) flags: 0x40000000000000 nid: 10.37.202.11@o2ib1 remote: 0x48e7a088193305c expref: 6 pid: 11322 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="164657" author="simmonsja" created="Thu, 1 Sep 2016 17:22:31 +0000"  >&lt;p&gt;So this is causing my applications to fail. I have no idea how to fix that. Any ideas?&lt;/p&gt;</comment>
                            <comment id="222651" author="simmonsja" created="Tue, 6 Mar 2018 23:13:01 +0000"  >&lt;p&gt;Do people still see this? I haven&apos;t done large scaling testing yet but I was wondering if it is still observed.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="29170">LU-6387</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="19944" name="debug_log.vesta-mds1.LU-7321.gz" size="181015" author="marc@llnl.gov" created="Wed, 16 Dec 2015 00:47:25 +0000"/>
                            <attachment id="19943" name="debug_log.vesta29.LU-7321.gz" size="2572654" author="marc@llnl.gov" created="Wed, 16 Dec 2015 00:47:25 +0000"/>
                            <attachment id="19949" name="debug_log.vulcanio17" size="570029" author="marc@llnl.gov" created="Wed, 16 Dec 2015 23:25:58 +0000"/>
                            <attachment id="22813" name="dump-client.log" size="190335" author="simmonsja" created="Tue, 30 Aug 2016 20:18:38 +0000"/>
                            <attachment id="22812" name="dump-oss.log.gz" size="12982111" author="simmonsja" created="Tue, 30 Aug 2016 20:18:37 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 19 May 2016 15:58:27 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxqy7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 20 Oct 2015 15:58:27 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>