<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:42:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4359] 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers</title>
                <link>https://jira.whamcloud.com/browse/LU-4359</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.&lt;/p&gt;

&lt;p&gt;Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.&lt;/p&gt;

&lt;p&gt;server logs relevant to 10.36.202.138@o2ib issue:&lt;br/&gt;
Dec  8 12:11:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1032387.863173&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs&lt;br/&gt;
Dec  8 12:11:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1032387.919690&amp;#93;&lt;/span&gt; LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880761cd2800 x1453425689933409/t0(0) o3-&amp;gt;3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0&lt;br/&gt;
Dec  8 12:11:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1032388.030434&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110&lt;/p&gt;

&lt;p&gt;client log on 10.36.202.138@o2ib:&lt;br/&gt;
Dec  8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000&lt;br/&gt;
Dec  8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).&lt;br/&gt;
Dec  8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages&lt;br/&gt;
Dec  8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.&lt;br/&gt;
Dec  8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages&lt;br/&gt;
Dec  8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16&lt;/p&gt;

&lt;p&gt;lctl dk output:&lt;br/&gt;
00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678&lt;br/&gt;
00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)&lt;br/&gt;
00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)&lt;br/&gt;
00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000&lt;br/&gt;
00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).&lt;/p&gt;

&lt;p&gt;server logs relevant to 10.36.202.142@o2ib issue:&lt;br/&gt;
Dec  8 09:27:15 atlas-oss4h1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1022525.123761&amp;#93;&lt;/span&gt; LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff880ff253d000 x145343054&lt;br/&gt;
1439222/t0(0) o4-&amp;gt;10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0&lt;br/&gt;
Dec  8 09:27:15 atlas-oss4h1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1022525.123900&amp;#93;&lt;/span&gt; Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry: &lt;br/&gt;
rc -110&lt;/p&gt;

&lt;p&gt;client logs from 10.36.202.142@o2ib:&lt;br/&gt;
Dec  8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati&lt;br/&gt;
ons using this service will wait for recovery to complete.&lt;br/&gt;
Dec  8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages&lt;br/&gt;
Dec  8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.&lt;br/&gt;
Dec  8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages&lt;/p&gt;

&lt;p&gt;The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?&lt;/p&gt;

&lt;p&gt;From the OS, if I try to kill the process or stat the inode, I see:&lt;br/&gt;
00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn-sch01 ~&amp;#93;&lt;/span&gt;# fuser -k -m /lustre/atlas2&lt;br/&gt;
Cannot stat file /proc/975/fd/25: Stale file handle&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn-sch01 ~&amp;#93;&lt;/span&gt;# ls -l /proc/975/fd/25&lt;br/&gt;
l-wx------ 1 cfuson ccsstaff 64 Dec  7 22:46 /proc/975/fd/25 -&amp;gt; /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn-sch01 ~&amp;#93;&lt;/span&gt;# &lt;/p&gt;</description>
                <environment>RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers</environment>
        <key id="22377">LU-4359</key>
            <summary>1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="blakecaldwell">Blake Caldwell</reporter>
                        <labels>
                    </labels>
                <created>Sun, 8 Dec 2013 18:16:54 +0000</created>
                <updated>Wed, 22 Jan 2014 13:36:36 +0000</updated>
                            <resolved>Wed, 22 Jan 2014 13:36:35 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="73047" author="pjones" created="Sun, 8 Dec 2013 18:48:17 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please comment on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="73111" author="blakecaldwell" created="Mon, 9 Dec 2013 18:25:15 +0000"  >&lt;p&gt;We found messages that occur during recovery when one of the misbehaving clients reconnects that refers to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-617&quot; title=&quot;LBUG: (mdt_recovery.c:787:mdt_last_rcvd_update()) ASSERTION(req_is_replay(req)) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-617&quot;&gt;&lt;del&gt;LU-617&lt;/del&gt;&lt;/a&gt;. Is there any relation here? What is the cause of the smaller transaction getting written to disk, or why is the client trying to replace a larger transaction?&lt;/p&gt;

&lt;p&gt;(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-617&quot; title=&quot;LBUG: (mdt_recovery.c:787:mdt_last_rcvd_update()) ASSERTION(req_is_replay(req)) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-617&quot;&gt;&lt;del&gt;LU-617&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In context below. Note that there is not one of the &quot;Trying to overwrite bigger transno&quot; messages for every reconnection.&lt;br/&gt;
Dec  9 06:09:57 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096933.917564&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) reconnecting&lt;br/&gt;
Dec  9 06:09:57 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096933.942120&amp;#93;&lt;/span&gt; Lustre: Skipped 9 previous similar messages&lt;br/&gt;
Dec  9 06:10:17 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096953.629963&amp;#93;&lt;/span&gt; LustreError: 25257:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88072b512000&lt;br/&gt;
Dec  9 06:10:38 atlas-oss4g2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096977.515250&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas2-OST007d_UUID: not available for connect from 10.36.202.142@o2ib (no target)&lt;br/&gt;
Dec  9 06:10:38 atlas-oss4g4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096978.820263&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas2-OST007f_UUID: not available for connect from 10.36.202.142@o2ib (no target)&lt;br/&gt;
Dec  9 06:12:35 atlas-mds3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1096356.518178&amp;#93;&lt;/span&gt; LustreError: 32369:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-617&quot; title=&quot;LBUG: (mdt_recovery.c:787:mdt_last_rcvd_update()) ASSERTION(req_is_replay(req)) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-617&quot;&gt;&lt;del&gt;LU-617&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
Dec  9 06:12:42 atlas-oss4h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;510636.716395&amp;#93;&lt;/span&gt; LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fa8d40000&lt;br/&gt;
Dec  9 06:12:42 atlas-oss4h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;510636.745560&amp;#93;&lt;/span&gt; LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fc2d3a000&lt;br/&gt;
Dec  9 06:12:43 atlas-oss4h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;510636.784482&amp;#93;&lt;/span&gt; LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fa8d40000&lt;br/&gt;
Dec  9 06:12:43 atlas-oss4h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;510636.815353&amp;#93;&lt;/span&gt; LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fc2d3a000&lt;br/&gt;
Dec  9 06:12:43 atlas-oss4h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;510636.815385&amp;#93;&lt;/span&gt; LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576)  req@ffff880fe2a0e400 x1453430544373389/t0(0) o4-&amp;gt;10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0&lt;/p&gt;</comment>
                            <comment id="73139" author="blakecaldwell" created="Mon, 9 Dec 2013 21:59:06 +0000"  >&lt;p&gt;This seem very repeatable. If I try reading the file that the user process (rsync) is stuck on, I can see the same errors. It appears to be on OSS 10.36.225.185@o2ib, specifically OST009b or obdindex 155. I found this from the client dmesg. I&apos;m attaching a debug log from command 3 that hung below with flags&lt;br/&gt;
ioctl neterror warning dlmtrace error emerg ha rpctrace config console&lt;/p&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 eq&amp;#93;&lt;/span&gt;# dmesg|tail&lt;br/&gt;
LustreError: 2398:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880838928000&lt;br/&gt;
Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453958660133786 sent from atlas2-OST009b-osc-ffff880c37da6800 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).&lt;br/&gt;
  req@ffff880821b5a000 x1453958660133786/t0 o3-&amp;gt;atlas2-OST009b_UUID@10.36.225.185@o2ib:6/4 lens 448/592 e 0 to 1 dl 1386624465 ref 2 fl Rpc:/2/0 rc 0/0&lt;br/&gt;
Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages&lt;br/&gt;
Lustre: atlas2-OST009b-osc-ffff880c37da6800: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.&lt;br/&gt;
Lustre: Skipped 7 previous similar messages&lt;br/&gt;
LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16&lt;br/&gt;
LustreError: Skipped 1 previous similar message&lt;/p&gt;


&lt;p&gt;1)&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 ~&amp;#93;&lt;/span&gt;# lfs getstripe /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
/lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
lmm_stripe_count:   4&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_stripe_offset:  155&lt;br/&gt;
        obdidx           objid          objid            group&lt;br/&gt;
           155           16095         0x3edf                0&lt;br/&gt;
           156           16094         0x3ede                0&lt;br/&gt;
           157           16165         0x3f25                0&lt;br/&gt;
           158           16102         0x3ee6                0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 ~&amp;#93;&lt;/span&gt;# stat /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
  File: `/lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&apos;&lt;br/&gt;
  Size: 20807164        Blocks: 40640      IO Block: 2097152 regular file&lt;br/&gt;
Device: 2586730ah/629568266d    Inode: 144117763932180829  Links: 1&lt;br/&gt;
Access: (0600/&lt;del&gt;rw&lt;/del&gt;------)  Uid: ( 9348/ dermaid)   Gid: (18645/ dermaid)&lt;br/&gt;
Access: 2013-12-09 13:12:46.000000000 -0500&lt;br/&gt;
Modify: 2013-11-30 13:45:37.000000000 -0500&lt;br/&gt;
Change: 2013-11-30 13:45:37.000000000 -0500&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 ~&amp;#93;&lt;/span&gt;# ls -l /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
&lt;del&gt;rw&lt;/del&gt;------ 1 dermaid dermaid 20807164 Nov 30 13:45 /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 ~&amp;#93;&lt;/span&gt;# file /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor&lt;br/&gt;
^C^C&lt;/p&gt;

&lt;p&gt;2)&lt;br/&gt;
Now with a different file that works (no objects on obdindex 155):&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 eq&amp;#93;&lt;/span&gt;# stat cer1_cer2_bena_chol_4x.eq370.0000.coor&lt;br/&gt;
  File: `cer1_cer2_bena_chol_4x.eq370.0000.coor&apos;&lt;br/&gt;
  Size: 20807164        Blocks: 40640      IO Block: 2097152 regular file&lt;br/&gt;
Device: 2586730ah/629568266d    Inode: 144117777421110925  Links: 1&lt;br/&gt;
Access: (0640/&lt;del&gt;rw-r&lt;/del&gt;----)  Uid: ( 9348/ dermaid)   Gid: (18645/ dermaid)&lt;br/&gt;
Access: 2013-12-09 15:49:01.000000000 -0500&lt;br/&gt;
Modify: 2013-11-28 14:15:42.000000000 -0500&lt;br/&gt;
Change: 2013-11-28 15:26:26.000000000 -0500&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 eq&amp;#93;&lt;/span&gt;# file cer1_cer2_bena_chol_4x.eq370.0000.coor&lt;br/&gt;
cer1_cer2_bena_chol_4x.eq370.0000.coor: data&lt;/p&gt;

&lt;p&gt;3)&lt;br/&gt;
Now with another file that fails (object on obdindex 155)&lt;br/&gt;
cer1_cer2_bena_chol_4x.eq370.0001.xst&lt;br/&gt;
lmm_stripe_count:   4&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_stripe_offset:  154&lt;br/&gt;
        obdidx           objid          objid            group&lt;br/&gt;
           154           16163         0x3f23                0&lt;br/&gt;
           155           16094         0x3ede                0&lt;br/&gt;
           156           16093         0x3edd                0&lt;br/&gt;
           157           16164         0x3f24                0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 eq&amp;#93;&lt;/span&gt;# stat cer1_cer2_bena_chol_4x.eq370.0001.xst&lt;br/&gt;
  File: `cer1_cer2_bena_chol_4x.eq370.0001.xst&apos;&lt;br/&gt;
  Size: 15610           Blocks: 32         IO Block: 2097152 regular file&lt;br/&gt;
Device: 2586730ah/629568266d    Inode: 144117763932180577  Links: 1&lt;br/&gt;
Access: (0600/&lt;del&gt;rw&lt;/del&gt;------)  Uid: ( 9348/ dermaid)   Gid: (18645/ dermaid)&lt;br/&gt;
Access: 2013-12-09 15:52:45.000000000 -0500&lt;br/&gt;
Modify: 2013-11-30 13:45:37.000000000 -0500&lt;br/&gt;
Change: 2013-11-30 13:45:37.000000000 -0500&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@dtn04 eq&amp;#93;&lt;/span&gt;# file cer1_cer2_bena_chol_4x.eq370.0001.xst&lt;/p&gt;</comment>
                            <comment id="73224" author="jamesanunez" created="Tue, 10 Dec 2013 20:17:44 +0000"  >&lt;p&gt;Increased priority to Major due to customer request/comments:&lt;/p&gt;

&lt;p&gt;&quot;There are files on the filesystem that are inaccessible and I would like to rule out lingering issues from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4309&quot; title=&quot;mds_intent_policy ASSERTION(new_lock != NULL) failed: op 0x8 lockh 0x0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4309&quot;&gt;&lt;del&gt;LU-4309&lt;/del&gt;&lt;/a&gt; as a cause. At this point, our first priority is to recover the files that seem to be causing the hang.&quot;&lt;/p&gt;
</comment>
                            <comment id="73437" author="bobijam" created="Fri, 13 Dec 2013 02:41:18 +0000"  >&lt;p&gt;can we have logs from atlas-oss3b4 on which atlas2-OST009b locates, and it is still busy servicing a client BULK read but fails and rejecting the same client&apos;s reconnection.&lt;/p&gt;</comment>
                            <comment id="73512" author="blakecaldwell" created="Fri, 13 Dec 2013 21:00:38 +0000"  >&lt;p&gt;Still servicing client BULK reads:&lt;br/&gt;
Dec 13 15:44:50 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476205.395677&amp;#93;&lt;/span&gt; LustreError: 11576:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message&lt;br/&gt;
Dec 13 15:44:50 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476205.425212&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110&lt;br/&gt;
Dec 13 15:44:50 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476205.466052&amp;#93;&lt;/span&gt; Lustre: Skipped 1 previous similar message&lt;br/&gt;
Dec 13 15:49:51 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476505.278112&amp;#93;&lt;/span&gt; LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88070f632000&lt;br/&gt;
Dec 13 15:54:32 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476785.683451&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) reconnecting&lt;br/&gt;
Dec 13 15:54:32 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476785.709484&amp;#93;&lt;/span&gt; Lustre: Skipped 9 previous similar messages&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.464059&amp;#93;&lt;/span&gt; LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880717a48000&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.585964&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.626704&amp;#93;&lt;/span&gt; Lustre: Skipped 1 previous similar message&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.704087&amp;#93;&lt;/span&gt; LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff8807613a7c00 x1453958693735860/t0(0) o3-&amp;gt;3064e3d3-408f-7c76-94af-ce52cde93078@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386968389 ref 1 fl Interpret:/2/0 rc 0/0&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.777371&amp;#93;&lt;/span&gt; LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message&lt;br/&gt;
Dec 13 15:54:52 atlas-oss3b4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1476805.815244&amp;#93;&lt;/span&gt; Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110&lt;/p&gt;</comment>
                            <comment id="73645" author="niu" created="Tue, 17 Dec 2013 03:21:44 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Looks these two bad cables caused the following series of problems? The problems were not resovled even after you replaced the cables?&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4-&amp;gt;10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bulk read always fail for some reason, I think that&apos;s the reason why client fell into the loop of : bulk read timeout -&amp;gt; reconnect -&amp;gt; bulk read timeout -&amp;gt; reconnect ... Could you check if the connection between the client to OST009b is healthy?&lt;/p&gt;</comment>
                            <comment id="73669" author="blakecaldwell" created="Tue, 17 Dec 2013 13:23:09 +0000"  >&lt;p&gt;We tried many different 1.8.9/RHEL6 clients after the bad links were removed. They all fall into the same loop, when writing to various OSTs.  It does not appear to always be OST0009b, but it happens frequently enough, that the hangs are very common. When sweeping though all of the OSTs, another client hung on OST0024 for example. &lt;/p&gt;

&lt;p&gt;The example shown here was repeated after the cables were replaced.  After the case had been opened, the system was rebooted and a particular user tried again to run the same rsync command that triggered the OST0009b issues. The most recent logs from the lustre server are from the recurrence of the issue.&lt;/p&gt;

&lt;p&gt;Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.&lt;/p&gt;</comment>
                            <comment id="73676" author="niu" created="Tue, 17 Dec 2013 14:31:11 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Do you use different router for 1.8 &amp;amp; 2.4 clients?&lt;/p&gt;

&lt;p&gt;Liang, is there any simple way to check if the network between client and server is sane? Could you advise?&lt;/p&gt;</comment>
                            <comment id="73695" author="blakecaldwell" created="Tue, 17 Dec 2013 16:28:56 +0000"  >&lt;p&gt;The test cases in this LU are all with clients on the same NI as the servers &amp;#8211; no routers.&lt;/p&gt;</comment>
                            <comment id="73734" author="niu" created="Wed, 18 Dec 2013 07:26:50 +0000"  >&lt;p&gt;Could you run lnet-selftest to check the network between problematic OST and 1.8 clients?&lt;/p&gt;

&lt;p&gt;I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached:&lt;/p&gt;

&lt;p&gt;= Preparation =&lt;/p&gt;

&lt;p&gt;The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module:&lt;/p&gt;

&lt;p&gt;modprobe lnet_selftest&lt;/p&gt;

&lt;p&gt;Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.).&lt;/p&gt;

&lt;p&gt;Identify a &quot;console&quot; node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated).&lt;/p&gt;

&lt;p&gt;It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise.&lt;/p&gt;

&lt;p&gt;= Using the Wrapper Script =&lt;/p&gt;

&lt;p&gt;Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header:&lt;/p&gt;

&lt;p&gt;Single Client Throughput &#8211; LNET Selftest Read (2 Nodes, 1:1)&lt;br/&gt;
Used to establish point to point unidirectional read performance between two nodes.&lt;br/&gt;
Set the wrapper up as follows:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#Output file
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)
 
# Concurrency
CN=32
#Size
SZ=1M
# Length of time to run test (secs)
TM=60
# Which BRW test to run (read or write)
BRW=read
# Checksum calculation (simple or full)
CKSUM=simple
# The LST &lt;span class=&quot;code-quote&quot;&gt;&quot;from&quot;&lt;/span&gt; list -- e.g. Lustre clients. Space separated list of NIDs.
LFROM=&lt;span class=&quot;code-quote&quot;&gt;&quot;10.73.2.21@tcp&quot;&lt;/span&gt;
# The LST &lt;span class=&quot;code-quote&quot;&gt;&quot;to&quot;&lt;/span&gt; list -- e.g. Lustre servers. Space separated list of NIDs.
LTO=&lt;span class=&quot;code-quote&quot;&gt;&quot;10.73.2.22@tcp&quot;&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices).&lt;br/&gt;
SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M.&lt;br/&gt;
TM: test time in seconds&#8211; how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds).&lt;br/&gt;
BRW: The Bulk Read/Write test to use. There are only two choices &quot;read&quot; or &quot;write&quot;. &lt;br/&gt;
CKSUM: The checksum checking method. Choose either &quot;simple&quot; or &quot;full&quot;.&lt;br/&gt;
LFROM: a space-separated list of NIDs that represent the &quot;from&quot; list (or source) in LNET Selftest. This is often a set of clients.&lt;br/&gt;
LTO: a space-separated list of NIDs that represent the &quot;to&quot; list (or destination) in LNET Selftest. This is often a set of servers.&lt;/p&gt;

&lt;p&gt;Change the LFROM and LTO lists as required.&lt;/p&gt;

&lt;p&gt;Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.&lt;/p&gt;</comment>
                            <comment id="73743" author="blakecaldwell" created="Wed, 18 Dec 2013 12:56:51 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we&apos;d see an issue at a low concurrency setting such that we wouldn&apos;t be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise.&lt;/p&gt;

&lt;p&gt;Another thought to address whether their is an an issue with the network:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;verify issue is still present with selected client running 1.8.9 client&lt;/li&gt;
	&lt;li&gt;run rsync application across a large directory such that objects from every OST are read/written.&lt;/li&gt;
	&lt;li&gt;upgrade selected client to 2.4.1&lt;/li&gt;
	&lt;li&gt;repeat rsync test and look for any signs of network loss&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;What to look for:&lt;br/&gt;
When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I&apos;d consider the lack of these messages to be a sign of successful test.&lt;/p&gt;

&lt;p&gt;Would this be useful or is it too high level?&lt;/p&gt;</comment>
                            <comment id="73745" author="niu" created="Wed, 18 Dec 2013 13:42:31 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we&apos;d see an issue at a low concurrency setting such that we wouldn&apos;t be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Was there heavy load on servers when the bulk io timeout is seen? If the the problem always happen no matter of server load, I would think a low concurrency testing is meaningful.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Another thought to address whether their is an an issue with the network:&lt;br/&gt;
verify issue is still present with selected client running 1.8.9 client&lt;br/&gt;
run rsync application across a large directory such that objects from every OST are read/written.&lt;br/&gt;
upgrade selected client to 2.4.1&lt;br/&gt;
repeat rsync test and look for any signs of network loss&lt;/p&gt;

&lt;p&gt;What to look for:&lt;br/&gt;
When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I&apos;d consider the lack of these messages to be a sign of successful test.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes, I think this way can help us to see if it&apos;s a network or Lustre version problem.&lt;/p&gt;</comment>
                            <comment id="73746" author="liang" created="Wed, 18 Dec 2013 14:10:48 +0000"  >&lt;p&gt;I&apos;m afraid it might not be very helpful to try selftest if there is other activity on the client, it&apos;s better to have a &quot;neat&quot; run to diagnose network issue.&lt;br/&gt;
Niu, could it relate to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-793&quot; title=&quot;Reconnections should not be refused when there is a request in progress from this client.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-793&quot;&gt;&lt;del&gt;LU-793&lt;/del&gt;&lt;/a&gt;? the patch is already in master but not in 2.4 releases. &lt;/p&gt;</comment>
                            <comment id="73763" author="simmonsja" created="Wed, 18 Dec 2013 16:55:09 +0000"  >&lt;p&gt;This is happening with 1.8 clients. I looked at the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-793&quot; title=&quot;Reconnections should not be refused when there is a request in progress from this client.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-793&quot;&gt;&lt;del&gt;LU-793&lt;/del&gt;&lt;/a&gt; patch to see if I could back port it and give it a try. The problem is target_bulk_io doesn&apos;t even exist in 1.8!!! I&apos;m not that familiar with 1.8 internals to do a proper back port of this magnitude.&lt;/p&gt;</comment>
                            <comment id="73786" author="blakecaldwell" created="Wed, 18 Dec 2013 19:38:01 +0000"  >&lt;p&gt;As an update to this case, we tried running the rsync test with a lustre 1.8.9 client, but on an FDR ConnectX-3 card and could not reproduce the issue. Right now it appears to be an issue with the IB card or associated driver. It does not appear to be a lustre issue.&lt;/p&gt;</comment>
                            <comment id="74008" author="blakecaldwell" created="Mon, 23 Dec 2013 03:46:42 +0000"  >&lt;p&gt;Unfortunately, the issue is still present independent of the IB hardware.&lt;/p&gt;

&lt;p&gt;I&apos;m unsure what contributed to the string of tests succeeding, but we are now back in the same pattern of reconnects on many different 1.8.9 clients. When this happens, processes that access the filesystem remain in the D state and the log messages given above repeat.&lt;/p&gt;

&lt;p&gt;I believe the best way to pick this up is to try the 2.4 client on identical hardware to compare, and then also a LNET selftest on a dedicated client. We would have to deal with other activity on the server, but if I read Liang&apos;s comment correctly, the worry is client activity.&lt;/p&gt;</comment>
                            <comment id="74009" author="blakecaldwell" created="Mon, 23 Dec 2013 04:19:35 +0000"  >&lt;p&gt;Evicting the client from from the OST the it is connect cycling with will cause an LBUG &lt;/p&gt;

&lt;p&gt;&amp;lt;3&amp;gt;LustreError: 167-0: This client was evicted by atlas2-OST009b; in progress operations using this service will fail.&lt;br/&gt;
&amp;lt;0&amp;gt;LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE)) failed&lt;br/&gt;
&amp;lt;3&amp;gt;LustreError: 17528:0:(rw.c:1339:ll_issue_page_read()) page ffffea001ba4b0d0 map ffff880839a9aa30 index 0 flags c0000000000821 count 5 priv ffff880822484d80: read queue failed: rc -5&lt;br/&gt;
&amp;lt;0&amp;gt;LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) LBUG&lt;br/&gt;
&amp;lt;4&amp;gt;Pid: 2405, comm: ptlrpcd&lt;br/&gt;
&amp;lt;4&amp;gt;&lt;br/&gt;
&amp;lt;4&amp;gt;Call Trace:&lt;br/&gt;
&amp;lt;1&amp;gt;BUG: unable to handle kernel NULL pointer dereference at (null)&lt;br/&gt;
&amp;lt;1&amp;gt;IP: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;(null)&amp;gt;&amp;#93;&lt;/span&gt; (null)&lt;br/&gt;
&amp;lt;4&amp;gt;PGD 81984f067 PUD 8392fc067 PMD 0&lt;br/&gt;
&amp;lt;4&amp;gt;Oops: 0010 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP&lt;br/&gt;
&amp;lt;4&amp;gt;last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval&lt;br/&gt;
&amp;lt;4&amp;gt;CPU 10&lt;br/&gt;
&amp;lt;4&amp;gt;Modules linked in: iptable_nat nf_nat mptctl mptbase nfs lockd fscache auth_rpcgss nfs_acl mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptl&lt;br/&gt;
rpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) autofs4 sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_c&lt;br/&gt;
onntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_add&lt;br/&gt;
r ipv6 tcp_bic power_meter sg mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core hpilo hpwdt bnx2 myri10ge(U) dca microcode serio_raw k10temp amd64_e&lt;br/&gt;
dac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa pata_acpi ata_generic pata_atiixp ahci radeon ttm drm_km&lt;br/&gt;
s_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: scsi_wait_scan&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;&lt;br/&gt;
&amp;lt;4&amp;gt;Pid: 2405, comm: ptlrpcd Not tainted 2.6.32-358.23.2.el6.x86_64 #1 HP ProLiant DL385 G7&lt;br/&gt;
&amp;lt;4&amp;gt;RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;0000000000000000&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;(null)&amp;gt;&amp;#93;&lt;/span&gt; (null)&lt;br/&gt;
&amp;lt;4&amp;gt;RSP: 0018:ffff88041b9ddb48  EFLAGS: 00010246&lt;br/&gt;
&amp;lt;4&amp;gt;RAX: ffff88041b9ddbac RBX: ffff88041b9ddba0 RCX: ffffffffa0558260&lt;br/&gt;
&amp;lt;4&amp;gt;RDX: ffff88041b9ddbe0 RSI: ffff88041b9ddba0 RDI: ffff88041b9dc000&lt;br/&gt;
&amp;lt;4&amp;gt;RBP: ffff88041b9ddbe0 R08: 0000000000000000 R09: 0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0&lt;br/&gt;
&amp;lt;4&amp;gt;R13: ffffffffa0558260 R14: 0000000000000000 R15: ffff88044e483fc0&lt;br/&gt;
&amp;lt;4&amp;gt;FS:  00007f0e92ef9700(0000) GS:ffff88044e480000(0000) knlGS:00000000f77026c0&lt;br/&gt;
&amp;lt;4&amp;gt;CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
&amp;lt;4&amp;gt;CR2: 0000000000000000 CR3: 0000000815d93000 CR4: 00000000000007e0&lt;br/&gt;
&amp;lt;4&amp;gt;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
&amp;lt;4&amp;gt;Process ptlrpcd (pid: 2405, threadinfo ffff88041b9dc000, task ffff880438ed3500)&lt;br/&gt;
&amp;lt;4&amp;gt;Stack:&lt;br/&gt;
&amp;lt;4&amp;gt; ffffffff8100e4a0 ffff88041b9ddbac ffff880438ed3500 ffffffffa07cdf78&lt;br/&gt;
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; 00000000a07ce9a8 ffff88041b9dc000 ffff88041b9ddfd8 ffff88041b9dc000&lt;br/&gt;
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; 000000000000000a ffff88044e480000 ffff88041b9ddbe0 ffff88041b9ddbb0&lt;br/&gt;
&amp;lt;4&amp;gt;Call Trace:&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100e4a0&amp;gt;&amp;#93;&lt;/span&gt; ? dump_trace+0x190/0x3b0&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa054c835&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x55/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa054ce65&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x75/0xe0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05555d6&amp;gt;&amp;#93;&lt;/span&gt; libcfs_assertion_failed+0x66/0x70 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07c43ff&amp;gt;&amp;#93;&lt;/span&gt; brw_interpret+0xcff/0xe90 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810097cc&amp;gt;&amp;#93;&lt;/span&gt; ? __switch_to+0x1ac/0x320&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8105ea69&amp;gt;&amp;#93;&lt;/span&gt; ? find_busiest_queue+0x69/0x150&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0702a9a&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_check_set+0x24a/0x16b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81081b5b&amp;gt;&amp;#93;&lt;/span&gt; ? try_to_del_timer_sync+0x7b/0xe0&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81081be2&amp;gt;&amp;#93;&lt;/span&gt; ? del_timer_sync+0x22/0x30&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07397ad&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd_check+0x18d/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0739a50&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd+0x160/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81063990&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x0/0x20&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0ca&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07398f0&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpcd+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0c0&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;br/&gt;
&amp;lt;4&amp;gt;Code:  Bad RIP value.&lt;br/&gt;
&amp;lt;1&amp;gt;RIP  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;(null)&amp;gt;&amp;#93;&lt;/span&gt; (null)&lt;br/&gt;
&amp;lt;4&amp;gt; RSP &amp;lt;ffff88041b9ddb48&amp;gt;&lt;br/&gt;
&amp;lt;4&amp;gt;CR2: 0000000000000000&lt;/p&gt;</comment>
                            <comment id="74011" author="niu" created="Mon, 23 Dec 2013 06:51:15 +0000"  >&lt;p&gt;The ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE)) in brw_interpret was introduced by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2703&quot; title=&quot;racer: BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2703&quot;&gt;&lt;del&gt;LU-2703&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Hongchao, could you take a look? What&apos;s the intention of this assert?&lt;/p&gt;</comment>
                            <comment id="74152" author="hongchao.zhang" created="Mon, 30 Dec 2013 02:19:12 +0000"  >&lt;p&gt;Hi Niu, this issue has been tracked at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3067&quot; title=&quot;ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE))&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3067&quot;&gt;&lt;del&gt;LU-3067&lt;/del&gt;&lt;/a&gt; and there is a patch in it, Thanks!&lt;/p&gt;</comment>
                            <comment id="74557" author="blakecaldwell" created="Wed, 8 Jan 2014 13:20:42 +0000"  >&lt;p&gt;I wanted to update this case that we are still experience the timeouts, however the patch for LU 2703 took care of the LBUG. That has significantly sped up the process of reproducing the error under different software stacks and on different systems. We noticed that 1.8.6 clients on SLES do not have this issue, so we are trying rolling back the RHEL clients to earlier lustre versions.&lt;/p&gt;</comment>
                            <comment id="74568" author="blakecaldwell" created="Wed, 8 Jan 2014 16:42:40 +0000"  >&lt;p&gt;The failure was present with 1.8.8. Below are the logs from starting the the lustre 1.8.8 client, mounting filesystems, and then running file * in the directory that has been causing issue. The symptoms were the same, where the application would hang. Recognizing the problem, I went to unmount to try the next case (this has never worked in the past and before &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3067&quot; title=&quot;ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE))&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3067&quot;&gt;&lt;del&gt;LU-3067&lt;/del&gt;&lt;/a&gt;, we had to reboot the client). As usual that failed but the logs have some different messages with 1.8.8.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;
	&lt;ol&gt;
		&lt;li&gt;
		&lt;ol&gt;
			&lt;li&gt;data mismatch with ino 144115188193296385/0&lt;/li&gt;
		&lt;/ol&gt;
		&lt;/li&gt;
	&lt;/ol&gt;
	&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Does that tell us anything new?&lt;/p&gt;

&lt;p&gt;Jan  8 08:33:17 client kernel: Lustre: Listener bound to ib0:10.36.202.142:987:mlx4_0&lt;br/&gt;
Jan  8 08:33:17 client kernel: Lustre: Added LNI 10.36.202.142@o2ib &lt;span class=&quot;error&quot;&gt;&amp;#91;63/2560/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jan  8 08:33:17 client kernel: Lustre: Build Version: lustre/scripts--PRISTINE-2.6.32-358.23.2.el6.x86_64&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: Lustre Client File System; &lt;a href=&quot;http://www.lustre.org/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://www.lustre.org/&lt;/a&gt;&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: MGC10.36.226.77@o2ib: Reactivating import&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: Server MGS version (2.4.1.0) is much newer than client version (1.8.8)&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping &apos;lmv&apos; config: cmd=cf001,clilmv:lmv&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping &apos;lmv&apos; config: cmd=cf014,clilmv:linkfarm-MDT0000_UUID&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: Server linkfarm-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: client supports 64-bits dir hash/offset!&lt;br/&gt;
Jan  8 08:33:18 client kernel: Lustre: Client linkfarm-client(ffff8803f4fd1000) mount complete&lt;br/&gt;
Jan  8 08:33:19 client kernel: Lustre: Server atlas1-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)&lt;br/&gt;
Jan  8 08:33:19 client kernel: Lustre: client supports 64-bits dir hash/offset!&lt;br/&gt;
Jan  8 08:33:19 client kernel: Lustre: Client atlas1-client(ffff8803f4fd7400) mount complete&lt;br/&gt;
Jan  8 08:33:19 client kernel: Lustre: 8445:0:(obd_config.c:1127:class_config_llog_handler()) skipping &apos;lmv&apos; config: cmd=cf001,clilmv:lmv&lt;br/&gt;
Jan  8 08:33:21 client kernel: Lustre: Server atlas2-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)&lt;br/&gt;
Jan  8 08:33:21 client kernel: Lustre: client supports 64-bits dir hash/offset!&lt;br/&gt;
Jan  8 08:33:21 client kernel: Lustre: Client atlas2-client(ffff880c397d6400) mount complete&lt;br/&gt;
Jan  8 08:33:22 client kernel: Lustre: MGC10.36.227.200@o2ib: Reactivating import&lt;br/&gt;
Jan  8 08:33:22 client kernel: Lustre: Client widow1-client(ffff880fd01dec00) mount complete&lt;br/&gt;
Jan  8 08:35:14 client kernel: LustreError: 8410:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88080e1c4000&lt;br/&gt;
Jan  8 08:35:14 client kernel: Lustre: setting import linkfarm-MDT0000_UUID INACTIVE by administrator request&lt;br/&gt;
Jan  8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x&lt;br/&gt;
f06137b3b58e545f lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdca99b6a04cb5ba3 expref: -99 pid: 8466 timeout: 0&lt;br/&gt;
Jan  8 08:35:14 client kernel: Lustre: setting import linkfarm-OST0000_UUID INACTIVE by administrator request&lt;br/&gt;
Jan  8 08:35:14 client kernel: Lustre: client linkfarm-client(ffff8803f4fd1000) umount complete&lt;br/&gt;
Jan  8 08:35:14 client kernel: LustreError: 9879:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff8804249bad10) ns: atlas1-MDT0000-mdc-ffff8803f4fd7400 lock: ffff88082165ba00/0xf06137b3b58e5466 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdff80ba35f064331 expref: -99 pid: 8466 timeout: 0&lt;br/&gt;
Jan  8 08:35:16 client kernel: Lustre: client atlas1-client(ffff8803f4fd7400) umount complete&lt;br/&gt;
Jan  8 08:35:16 client kernel: Lustre: setting import atlas2-MDT0000_UUID INACTIVE by administrator request&lt;br/&gt;
Jan  8 08:35:16 client kernel: LustreError: 9881:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff880c39246cd0) ns: atlas2-MDT0000-mdc-ffff880c397d6400 lock: ffff88082165b800/0xf06137b3b58e546d lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0x7f4ba8a4cefab31f expref: -99 pid: 8466 timeout: 0&lt;br/&gt;
Jan  8 08:35:16 client kernel: LustreError: 2866:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108&lt;br/&gt;
Jan  8 08:35:16 client kernel: LustreError: 2866:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385&lt;br/&gt;
Jan  8 08:35:17 client kernel: Lustre: client widow-client(ffff88081da07c00) umount complete&lt;br/&gt;
Jan  8 08:35:17 client kernel: Lustre: setting import widow2-MDT0000_UUID INACTIVE by administrator request&lt;br/&gt;
Jan  8 08:35:18 client kernel: Lustre: client widow2-client(ffff88038e15b400) umount complete&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10318:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108&lt;br/&gt;
Jan  8 08:35:18 client kernel: LustreError: 10318:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385&lt;br/&gt;
Jan  8 08:35:32 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:35:48 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:36:04 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:36:19 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:36:35 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:36:50 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:37:06 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:37:37 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:38:24 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:39:42 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;br/&gt;
Jan  8 08:42:02 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108&lt;/p&gt;</comment>
                            <comment id="74627" author="niu" created="Thu, 9 Jan 2014 02:25:59 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1488&quot; title=&quot;2.1.2 servers, 1.8.8 clients _mdc_blocking_ast()) ### data mismatch with ino&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1488&quot;&gt;&lt;del&gt;LU-1488&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="75396" author="blakecaldwell" created="Tue, 21 Jan 2014 23:01:32 +0000"  >&lt;p&gt;The problem has been solved. We found the issue within the IB fabric. We fixed it by adjusting the weights of certain links. Routes between certain OSS and clients were taking an unexpected path through an older DDR switch with MTU 2048. Since the hosts with the RHEL6 OFED supported 4098 along with the newer switches, this worked fine most of the time. Setting ko2iblnd parameter ib_mtu=2048 provided us instant relief before the routes could be adjusted.&lt;/p&gt;

&lt;p&gt;Thanks for the help narrowing the issue down on this case.&lt;/p&gt;</comment>
                            <comment id="75413" author="jamesanunez" created="Wed, 22 Jan 2014 06:22:46 +0000"  >&lt;p&gt;Blake, &lt;/p&gt;

&lt;p&gt;Thank you for the update. We&apos;re glad the root cause of the problem was found.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="12244">LU-793</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13892" name="kiblnd_timeout_client_dtn04.log" size="46257" author="blakecaldwell" created="Sun, 8 Dec 2013 18:16:54 +0000"/>
                            <attachment id="13932" name="lst-wrapper.sh" size="1215" author="niu" created="Wed, 18 Dec 2013 07:27:36 +0000"/>
                            <attachment id="13896" name="lustre_dtn-sch01_0.log.gz" size="3138710" author="blakecaldwell" created="Sun, 8 Dec 2013 18:17:26 +0000"/>
                            <attachment id="13895" name="lustre_dtn04_0.log.gz" size="1526145" author="blakecaldwell" created="Sun, 8 Dec 2013 18:17:26 +0000"/>
                            <attachment id="13900" name="lustre_dtn04_20131209_fail.log.gz" size="557246" author="blakecaldwell" created="Mon, 9 Dec 2013 22:00:08 +0000"/>
                            <attachment id="13919" name="lustre_logs_atlas-oss3b4_flush.log.gz" size="269" author="blakecaldwell" created="Fri, 13 Dec 2013 20:59:12 +0000"/>
                            <attachment id="13893" name="stale_fh_loop_client_dtn-sch01.log" size="25390" author="blakecaldwell" created="Sun, 8 Dec 2013 18:16:54 +0000"/>
                            <attachment id="13894" name="stale_fh_loop_server_logs" size="156941" author="blakecaldwell" created="Sun, 8 Dec 2013 18:16:54 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwavz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11939</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>