<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:44:20 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4616] ls or client eviction issues (ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST </title>
                <link>https://jira.whamcloud.com/browse/LU-4616</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We seeing few issues with IU, which are on Lustre v2.1.6 with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2943&quot; title=&quot;LBUG mdt_reconstruct_open()) ASSERTION( (!(rc &amp;lt; 0) || (lustre_msg_get_transno(req-&amp;gt;rq_repmsg) == 0)) )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2943&quot;&gt;&lt;del&gt;LU-2943&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3645&quot; title=&quot;Interop 2.1.5 &amp;lt;--&amp;gt; 2.4 Write operations during failover errors out instead of stalling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3645&quot;&gt;&lt;del&gt;LU-3645&lt;/del&gt;&lt;/a&gt; patch and the issues are where sometimes clients get evicted and sometimes client doesn&apos;t lists the directory metadata information. When running an &apos;ls&apos; on directory it misses to get the metadata information from the MDS node and it seems it&apos;s getting locked up.&lt;/p&gt;

&lt;p&gt;Here is the example from one client.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_21
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_22
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_08_57_04
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Server Logs&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Feb 10 17:41:10 dc2mds01 kernel: : Lustre: dc2-MDT0000: haven&apos;t heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff8824273af800, cur 1392072070 expire 1392071920 last 1392071843
Feb 10 17:41:10 dc2oss08 kernel: : Lustre: dc2-OST004e: haven&apos;t heard from client 6d16eeb8-e52c-8156-1c35-3be057a8b0ff (at 149.165.235.51@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff8815eeb28c00, cur 1392072070 expire 1392071920 last 1392071843
Feb 10 20:22:46 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 20:42:08 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 21:02:47 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 21:22:09 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 21:42:49 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:02:11 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 22:22:50 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:42:12 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) refused reconnection, still busy with 1 active RPCs
Feb 10 22:52:16 dc2oss12 kernel: : Lustre: dc2-OST0074: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 22:52:16 dc2oss12 kernel: : Lustre: 15986:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0074: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t0 exp ffff881158ad4c00 cur 1392090736 last 1392090736
Feb 10 23:02:51 dc2mds01 kernel: : Lustre: dc2-MDT0000: Client 6aff6348-d240-0525-462c-22e1589f343f (at 149.165.235.51@tcp) reconnecting
Feb 10 23:12:34 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 20028s: evicting client at 149.165.235.51@tcp  ns: mdt-ffff881843d1a000 lock: ffff882d78e8f480/0xcf86cac7791be105 lrc: 3/0,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 211 type: IBT flags: 0x4000020 remote: 0xf3ffedf65e028084 expref: 55 pid: 28615 timeout: 7109149431
Feb 10 23:12:39 dc2mds01 kernel: : Lustre: 13955:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from 6aff6348-d240-0525-462c-22e1589f343f@149.165.235.51@tcp t225898436598 exp (null) cur 1392091959 last 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Client logs&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Feb 10 22:39:17 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 22:49:19 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 22:49:19 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899338 sent from dc2-OST0074-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline).
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection to service dc2-OST0074 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Feb 10 22:52:16 xfer01 kernel: Lustre: 13130:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1459693143899342 sent from dc2-OST0078-osc-ffff810063082c00 to NID 10.10.0.12@o2ib 17s ago has timed out (17s prior to deadline).
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0074-osc-ffff810063082c00: Connection restored to service dc2-OST0074 using nid 10.10.0.12@o2ib.
Feb 10 22:52:16 xfer01 kernel: Lustre: Server dc2-OST0074_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Feb 10 22:52:16 xfer01 kernel: Lustre: Skipped 3 previous similar messages
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0078-osc-ffff810063082c00: Connection to service dc2-OST0078 via nid 10.10.0.12@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Feb 10 22:52:16 xfer01 kernel: Lustre: dc2-OST0079-osc-ffff810063082c00: Connection restored to service dc2-OST0079 using nid 10.10.0.12@o2ib.
Feb 10 22:59:21 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 22:59:21 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 23:09:23 xfer01 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.171@o2ib. The mds_connect operation failed with -16
Feb 10 23:09:23 xfer01 kernel: LustreError: Skipped 85 previous similar messages
Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Feb 10 23:10:53 xfer01 kernel: LustreError: 13174:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Feb 10 23:12:39 xfer01 kernel: LustreError: 167-0: This client was evicted by dc2-MDT0000; in progress operations using this service will fail.
Feb 10 23:12:39 xfer01 kernel: Lustre: Server dc2-MDT0000_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Feb 10 23:12:39 xfer01 kernel: Lustre: Skipped 3 previous similar messages
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(mdc_locks.c:672:mdc_enqueue()) ldlm_cli_enqueue error: -4
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81012e144400 x1459693143907855/t0 o101-&amp;gt;dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 568/4880 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Feb 10 23:12:39 xfer01 kernel: LustreError: 13191:0:(file.c:119:ll_close_inode_openhandle()) inode 144593035466996022 mdc close failed: rc = -108
Feb 10 23:12:39 xfer01 kernel: Lustre: dc2-MDT0000-mdc-ffff810063082c00: Connection restored to service dc2-MDT0000 using nid 10.10.0.171@o2ib.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We have collected the stack trace from the MDS node and Lustre client. Upon further investigating the stack trace logs it seems its hitting with few bugs related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1775&quot; title=&quot;b1_8&amp;lt;-&amp;gt;2.3 Test failure on test suite parallel-scale, subtest test_compilebench&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1775&quot;&gt;&lt;del&gt;LU-1775&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2827&quot; title=&quot;mdt_intent_fixup_resent() cannot find the proper lock in hash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2827&quot;&gt;&lt;del&gt;LU-2827&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3546&quot; title=&quot;2.4.0&amp;lt;-&amp;gt;2.3 interop: sanity test 56w: MDS unknown opcode 61&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3546&quot;&gt;&lt;del&gt;LU-3546&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4112&quot; title=&quot;Random eviction of clients on lock callback timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4112&quot;&gt;&lt;del&gt;LU-4112&lt;/del&gt;&lt;/a&gt;. It seems we hitting with all of this issues at IU.&lt;/p&gt;

&lt;p&gt;Here are the stack trace logs which reports about IBT Flags &quot;rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737&quot;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00010000:1.0:1392065810.879717:0:28934:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 3/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 0
00010000:00010000:1.0:1392065810.879720:0:28934:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: mdt-ffff881843d1a000 lock: ffff881bc9e066c0/0xcf86cac754afad6c lrc: 4/0,0 mode: PR/PR res: 8612640447/14904 bits 0x3 rrc: 22 type: IBT flags: 0x4000020 remote: 0x9533dbd496bca06b expref: 225 pid: 28934 timeout: 7083106737
00000100:00100000:1.0:1392065810.879728:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101078:922@gni:104
00000100:00100000:1.0:1392065810.879744:0:28934:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt_77:dc2-MDT0000_UUID:28934:1456766457101079:923@gni:104
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here are the stack trace logs which reports about AST &quot;(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST &quot;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330
00010000:00010000:3.0:1392065759.685218:0:20661:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881b3e213900/0xcf86cac758e34cdc lrc: 1/0,0 mode: --/PR res: 8628747995/115215 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c25e9 expref: 1106 pid: 4259 timeout: 7083055542
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here are the stack trace logs which reports about Flock lock &quot;(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000&quot;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00010000:1.0:1392065865.202616:0:4297:0:(ldlm_flock.c:183:ldlm_process_flock_lock()) flags 0x40000 owner 837 pid 837 mode 32 start 0 end 9223372036854775807
00000100:00100000:0.0:1392065865.202617:0:11853:0:(events.c:300:request_in_callback()) peer: 12345-1032@gni
00010000:00010000:1.0:1392065865.202621:0:4297:0:(ldlm_flock.c:108:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x40000) ns: mdt-ffff881843d1a000 lock: ffff882878911240/0xcf86cac759740b5c lrc: 2/0,0 mode: PW/PW res: 8624191356/125099 rrc: 3 type: FLK pid: 837 [0-&amp;gt;9223372036854775807] flags: 0x0 remote: 0xc0abc3a2c7d67cb8 expref: 807 pid: 4259 timeout: 0

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Logs are attached here and I can also provide complete stack trace logs which ate taken from MDS node and Lustre Client node at that time period or running &apos;ls&apos; on directory structure if it&apos;s required. &lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</description>
                <environment>Lustre Server: 2.1.6 with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2943&quot; title=&quot;LBUG mdt_reconstruct_open()) ASSERTION( (!(rc &amp;lt; 0) || (lustre_msg_get_transno(req-&amp;gt;rq_repmsg) == 0)) )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2943&quot;&gt;&lt;strike&gt;LU-2943&lt;/strike&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3645&quot; title=&quot;Interop 2.1.5 &amp;lt;--&amp;gt; 2.4 Write operations during failover errors out instead of stalling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3645&quot;&gt;&lt;strike&gt;LU-3645&lt;/strike&gt;&lt;/a&gt; patch &lt;br/&gt;
Lustre Cleint: 1.8.7, 1.8.9 and 2.x</environment>
        <key id="23114">LU-4616</key>
            <summary>ls or client eviction issues (ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="manish">Manish Patel</reporter>
                        <labels>
                    </labels>
                <created>Wed, 12 Feb 2014 16:00:06 +0000</created>
                <updated>Thu, 22 May 2014 22:56:26 +0000</updated>
                            <resolved>Fri, 2 May 2014 20:11:21 +0000</resolved>
                                    <version>Lustre 2.1.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>16</watches>
                                                                            <comments>
                            <comment id="76922" author="pjones" created="Wed, 12 Feb 2014 23:53:35 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;What do you advise here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="77168" author="bobijam" created="Mon, 17 Feb 2014 08:42:53 +0000"  >&lt;p&gt;Is it reliably reproducible? what are the reproduce steps if possible? I tried to reproduce it with 2.1.6 server/1.8 client and didn&apos;t make it.&lt;/p&gt;</comment>
                            <comment id="77261" author="manish" created="Tue, 18 Feb 2014 17:05:02 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;Yes this is reproducible by just doing &apos;ls&apos; to long listing of directory. The nature of the problem starts or gets reproducible after like 15-20 days of filesystem operation from the downtime. Which means we took down the MDS and OSS to clear that same problem and after reboot of the MDS and OSS file system got online and at that time we were not able to reproduce that problem. But after like 12-15 days the same problem re-appeared and we were able to reproduce it again by just doing a &apos;ls&apos; on the directory to get the directory stats. &lt;/p&gt;


&lt;p&gt;Here is in brief explanation from IU, what they are seeing and experiencing. &lt;/p&gt;

&lt;p&gt;we&apos;re actually experiencing two issues, or at least different manifestations of a single problem. First, we have clients that hang when stat&apos;ing certain directories and eventually get evicted from the MDS, and only the MDS. These clients log &quot;The mds_connect operation failed with -16&quot; until they detect that they&apos;ve been evicted by the MDS.&lt;/p&gt;

&lt;p&gt;Second, we&apos;re seeing clients evicted from OSS&apos;s with &quot;&lt;br/&gt;
lock blocking callback time out: rc -107&quot;, which usually are not accompanied by an eviction from the MDS. Often 2-3 clients will get evicted at the same second, see the samples below:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Feb  7 15:42:32 dc2oss16 kernel: : LustreError: 138-a: dc2-OST009f: A client on nid 10.10.0.46@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 15:43:52 dc2oss16 kernel: : LustreError: 138-a: dc2-OST00a5: A client on nid 10.10.0.70@tcp was evicted due to a lock blocking callback time out: rc -107

Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 10.10.0.192@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST0082: A client on nid 10.10.0.250@tcp was evicted due to a lock completion callback time out: rc -107

Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST001b: A client on nid 10.10.0.247@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST0019: A client on nid 10.10.0.94@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 149.165.226.203@tcp was evicted due to a lock blocking callback time out: rc -107
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="77908" author="bobijam" created="Wed, 26 Feb 2014 15:40:48 +0000"  >&lt;p&gt;Would you mind trying this: when this issue happens, try to mount a new client and to check whether this new client also has this issue and collect the -1 logs (client, MDS and OSS) during the procedure?&lt;/p&gt;</comment>
                            <comment id="77912" author="bobijam" created="Wed, 26 Feb 2014 15:48:45 +0000"  >&lt;p&gt;the currently provided log does not reveal why client returns -22 to OSS server.&lt;/p&gt;</comment>
                            <comment id="77977" author="manish" created="Thu, 27 Feb 2014 02:00:32 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;In reference to the comment &quot;try to mount a new client&quot;, do you mean freshly built/created new Lustre Client with v 1.8.9 which is never connected before.  &lt;/p&gt;

&lt;p&gt;I have complete two sets of MDS and Lustre-client stack traces of those time, which you can download it from here.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://ddntsr.com/ftp/2014-02-26-LU-4616.rar&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-02-26-LU-4616.rar&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope that helps to shed some light and if not then let me know what logs are required from the next test.  &lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="77979" author="bobijam" created="Thu, 27 Feb 2014 02:18:42 +0000"  >&lt;p&gt;yes, I mean a freshly connected new Lustre client, so that you can collect as small as relevant client logs from it.&lt;/p&gt;</comment>
                            <comment id="77982" author="bobijam" created="Thu, 27 Feb 2014 03:39:19 +0000"  >&lt;p&gt;the 2014-02-26-&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4616&quot; title=&quot;ls or client eviction issues (ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4616&quot;&gt;&lt;del&gt;LU-4616&lt;/del&gt;&lt;/a&gt;.rar does not contain info to reveal the root cause, if possible, I still need your help to get -1 debug logs (newly connected client, MDS and OSS) during the rehit. If the newly created client can rehit the issue, the -1 log covering that time span would contain relevant information. Thank you in advance.&lt;/p&gt;</comment>
                            <comment id="77983" author="manish" created="Thu, 27 Feb 2014 03:54:03 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;Thank you for your prompt response. Ok I will check with IU and see if we can get this setup done, to get new set of logs. In reference to &quot;get -1 debug logs&quot;, so you mean to enable on client using below commands&lt;/p&gt;

&lt;p&gt;sysctl -w lnet.printk=-1&lt;/p&gt;

&lt;p&gt;And does that needs to be done on just that newly connected client or it has to be enabled on MDS and OSS also. &lt;/p&gt;

&lt;p&gt;Other then that, do you also need stack trace, dlmtrace or rpctrace. If yes then from which nodes and what buffer size and what is the right way to enable those debug options and collect the logs out of it.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="77985" author="bobijam" created="Thu, 27 Feb 2014 04:18:27 +0000"  >&lt;p&gt;Suppose you&apos;ve builded a new client, set debug level to -1 on all involving re-hit nodes (the new client, MDS nodes, OSS nodes) with &quot;lctl set_param debug=-1&quot;, then mount the client and try the re-hit procedure. If the rehit occurs, &quot;lctl dk &amp;lt;log file&amp;gt;&quot; to dump the -1 logs on all involving nodes, the -1 level log contains all level of logs.&lt;/p&gt;</comment>
                            <comment id="78276" author="manish" created="Mon, 3 Mar 2014 22:28:23 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;


&lt;p&gt;We have setup new client and collected the logs as you requested. Here is the &apos;ls&apos; behavior on the directory as below:&lt;/p&gt;

&lt;p&gt;The `ls -al` job timestamps and output are included below:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;The `ls -al` job timestamps and output are included below:

=====
Begin 2014-03-03_11:53:52
ls start 2014-03-03_11:53:52
total 5366452
d????????? ? ?      ?            ?            ? .
drwxr-xr-x 7 917155 210       4096 Feb  8 22:01 ..
-rw-r--r-- 1 917155 210     188103 Feb  7 23:37 673.maf
drwxr-xr-x 2 917155 210       4096 Jan 29 00:32 backup
d????????? ? ?      ?            ?            ? blasttemp
-rw-r--r-- 1 917155 210     377500 Feb  7 22:48
blasttemp2014_02_07_22_25_19_blast.log
-rw-r--r-- 1 917155 210      54362 Feb  7 22:43
blasttemp2014_02_07_22_40_05_blast.log
-rw-r--r-- 1 917155 210      34941 Feb  7 22:48
blasttemp2014_02_07_22_46_08_blast.log
drwxr-xr-x 2 917155 210       4096 Feb  8 05:32
blasttemp_2014_02_08_00_47_34
-rw-r--r-- 1 917155 210    9747929 Feb  8 05:32
blasttemp_2014_02_08_00_47_342014_02_08_00_47_34_blast.log
drwxr-xr-x 2 917155 210       4096 Feb  8 06:48
blasttemp_2014_02_08_00_49_37
-rw-r--r-- 1 917155 210   15579008 Feb  8 06:48
blasttemp_2014_02_08_00_49_372014_02_08_00_49_37_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_04_21
-????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_04_212014_02_08_01_04_21_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_06_22
-????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_01_06_222014_02_08_01_06_22_blast.log
d????????? ? ?      ?            ?            ?
blasttemp_2014_02_08_08_57_04
-rw-r--r-- 1 917155 210   20242620 Feb  8 22:03
blasttemp_2014_02_08_08_57_042014_02_08_08_57_04_blast.log
-rw-r--r-- 1 917155 210  326104853 Feb  7 18:32 lpr_comb_c95_0_18000.maf
-rw-r--r-- 1 917155 210     887278 Feb  8 05:32
lpr_comb_c95_0_18000.maf2014_02_08_00_47_34.mergelog.log
-rw-r--r-- 1 917155 210  433456072 Feb  8 05:32 lpr_comb_c95_0_18000_merge6.maf
-rw-r--r-- 1 917155 210   46735521 Feb  8 05:32
lpr_comb_c95_0_18000_merge6.maf_merged
-rw-r--r-- 1 917155 210  460095829 Feb  9 11:54 lpr_comb_c95_0_30000.maf
-????????? ? ?      ?            ?            ? lpr_comb_c95_0_50k.maf
-rw-r--r-- 1 917155 210  215968052 Feb  7 18:30 lpr_comb_c95_100001_end.maf
-????????? ? ?      ?            ?            ?
lpr_comb_c95_100001_end.maf2014_02_08_01_06_22.mergelog.log
-rw-r--r-- 1 917155 210    2089023 Feb  8 16:13
lpr_comb_c95_100001_end.maf2014_02_08_08_57_04.mergelog.log
-????????? ? ?      ?            ?            ?
lpr_comb_c95_100001_end_merge6_2.maf
-rw-r--r-- 1 917155 210   58659088 Feb  8 16:13
lpr_comb_c95_100001_end_merge6_2.maf_merged
-rw-r--r-- 1 917155 210   49456106 Feb  8 08:02
lpr_comb_c95_100001_end_merge6.maf
-rw-r--r-- 1 917155 210   59065234 Feb  8 08:02
lpr_comb_c95_100001_end_merge6.maf_merged
-rw-r--r-- 1 917155 210  226895678 Feb  5 20:05 lpr_comb_c95_10k_20k.maf
-????????? ? ?      ?            ?            ? lpr_comb_c95_18001_50000.maf
-rw-r--r-- 1 917155 210    1662375 Feb  8 06:48
lpr_comb_c95_18001_50000.maf2014_02_08_00_49_37.mergelog.log
-rw-r--r-- 1 917155 210  256010134 Feb  8 06:48 lpr_comb_c95_18001_50000_merge6.maf
-????????? ? ?      ?            ?            ?
lpr_comb_c95_18001_50000_merge6.maf_merged
-rw-r--r-- 1 917155 210  254099687 Feb  7 18:29 lpr_comb_c95_50001_100000.maf
-rw-r--r-- 1 917155 210    2298305 Feb  8 07:38
lpr_comb_c95_50001_100000.maf2014_02_08_01_04_21.mergelog.log
-rw-r--r-- 1 917155 210  149209527 Feb  8 07:38 lpr_comb_c95_50001_100000_merge6.maf
-rw-r--r-- 1 917155 210   75692995 Feb  8 07:38



ls end 2014-03-03_14:31:14
End 2014-03-03_14:31:31

=====
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; 

&lt;p&gt;Here are the URL of those logs and I have attached one file which contains kern logs from MDS and OSS nodes log file named as &quot;Server_System_logs_from_MDS_OSS_03-032014.txt&quot;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://ddntsr.com/ftp/2014-03-03-29545_run9.tar.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-03-03-29545_run9.tar.gz&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://ddntsr.com/ftp/2014-03-03-dc2ossXX_run9_debug_logs.tar.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-03-03-dc2ossXX_run9_debug_logs.tar.gz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me know if you need anything else from our end. &lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="78317" author="bobijam" created="Tue, 4 Mar 2014 07:37:59 +0000"  >&lt;p&gt;from the 29545_run9/mds_run9_debug.log, I can see the client 149.165.235.42@tcp was evicted at &quot;Tue Mar  4 03:31:13 CST 2014&quot;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;00010000:00020000:8.1:1393875073.142106:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 9437s: evicting client at 149.165.235.42@tcp  ns: mdt-ffff881843d1a000 lock: ffff882afe438000/0xcf86cae155f4b253 lrc: 3/0,0 mode: PR/PR res: 35127297/2218887223 bits 0x3 rrc: 382 type: IBT flags: 0x4000020 remote: 0x8a1cc956fcd9690 expref: 53 pid: 12292 timeout: 8892268113&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;at &quot;Tue Mar  4 03:31:14 CST 2014&quot;, the client thought that it is still connected with MDS, and tried to reconnect.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;00000100:00100000:1:1393875074.464664:0:1977:0:(import.c:720:ptlrpc_connect_import()) @@@ reconnect request 1219  req@ffff8800307ed000 x1461574032228883/t0 o38-&amp;gt;dc2-MDT0000_UUID@10.10.0.171@o2ib:12/10 lens 368/584 e 0 to 1 dl 0 ref 1 fl New:N/0/0 rc 0/0&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;but MDS has evicted it and told the client the situation, client realized the eviction.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;00010000:00000400:3.0:1393875078.627576:0:12472:0:(ldlm_lib.c:952:target_handle_connect()) dc2-MDT0000: connection from a69ef170-baaf-a6ea-059d-2e445759c567@149.165.235.42@tcp t230235724090 exp (null) cur 1393875078 last 0&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;note: exp is NULL, which means MDS does not recognize this client, and treat it as a new client connection request.&lt;/p&gt;

&lt;p&gt;the logs do not cover why the client failed to answer the MDS lock request though. &lt;/p&gt;</comment>
                            <comment id="78777" author="manish" created="Fri, 7 Mar 2014 23:40:04 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;We took a maintenance window and captured the vmcore crash dump by doing a manual Sysreq-trigger. Here is the back trace which was captured to see if that sheds some light here. There are also few Intel ticket which has some similar type of traces as in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2807&quot; title=&quot;lockup in server completion ast -&amp;gt; lu_object_find_at&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2807&quot;&gt;&lt;del&gt;LU-2807&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4733&quot; title=&quot;All mdt thread stuck in cfs_waitq_wait&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4733&quot;&gt;&lt;del&gt;LU-4733&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4335&quot; title=&quot;MDS hangs due to mdt thread hung/inactive&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4335&quot;&gt;&lt;del&gt;LU-4335&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4195&quot; title=&quot;MDT Slow with ptlrpcd using 100% cpu.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4195&quot;&gt;&lt;del&gt;LU-4195&lt;/del&gt;&lt;/a&gt;, which I thought might be helpful in investigating this issues.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 20659  TASK: ffff88180eb38ae0  CPU: 11  COMMAND: &quot;mdt_37&quot;
 #0 [ffff88141d639c60] schedule at ffffffff8150de92
 #1 [ffff88141d639d28] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff88141d639d38] ptlrpc_wait_event at ffffffffa07e19e9 [ptlrpc]
 #3 [ffff88141d639dc8] ptlrpc_main at ffffffffa07e956d [ptlrpc]
 #4 [ffff88141d639f48] kernel_thread at ffffffff8100c0ca

PID: 20660  TASK: ffff881850f5e040  CPU: 11  COMMAND: &quot;mdt_38&quot;
 #0 [ffff88141d669690] schedule at ffffffff8150de92
 #1 [ffff88141d669758] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff88141d669768] lov_create at ffffffffa0a8250a [lov]
 #3 [ffff88141d669868] mdd_lov_create at ffffffffa0c7290e [mdd]
 #4 [ffff88141d669908] mdd_create at ffffffffa0c84871 [mdd]
 #5 [ffff88141d669a48] cml_create at ffffffffa0dae3f7 [cmm]
 #6 [ffff88141d669a98] mdt_reint_open at ffffffffa0d0d19d [mdt]
 #7 [ffff88141d669b78] mdt_reint_rec at ffffffffa0cf5c81 [mdt]
 #8 [ffff88141d669b98] mdt_reint_internal at ffffffffa0ceced4 [mdt]
 #9 [ffff88141d669be8] mdt_intent_reint at ffffffffa0ced53d [mdt]
#10 [ffff88141d669c38] mdt_intent_policy at ffffffffa0cebc09 [mdt]
#11 [ffff88141d669c78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc]
#12 [ffff88141d669cd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc]
#13 [ffff88141d669d48] mdt_enqueue at ffffffffa0cec586 [mdt]
#14 [ffff88141d669d68] mdt_handle_common at ffffffffa0ce1772 [mdt]
#15 [ffff88141d669db8] mdt_regular_handle at ffffffffa0ce2665 [mdt]
#16 [ffff88141d669dc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc]
#17 [ffff88141d669f48] kernel_thread at ffffffff8100c0ca

PID: 20661  TASK: ffff882cfdadf540  CPU: 2   COMMAND: &quot;mdt_39&quot;
 #0 [ffff882745d7d870] schedule at ffffffff8150de92
 #1 [ffff882745d7d938] cfs_waitq_wait at ffffffffa04ab60e [libcfs]
 #2 [ffff882745d7d948] ldlm_completion_ast at ffffffffa07b2dca [ptlrpc]
 #3 [ffff882745d7d9e8] ldlm_cli_enqueue_local at ffffffffa07b2536 [ptlrpc]
 #4 [ffff882745d7da78] mdt_object_lock at ffffffffa0cde2a0 [mdt]
 #5 [ffff882745d7db18] mdt_getattr_name_lock at ffffffffa0ceec62 [mdt]
 #6 [ffff882745d7dbd8] mdt_intent_getattr at ffffffffa0cefc1d [mdt]
 #7 [ffff882745d7dc38] mdt_intent_policy at ffffffffa0cebc09 [mdt]
 #8 [ffff882745d7dc78] ldlm_lock_enqueue at ffffffffa0795391 [ptlrpc]
 #9 [ffff882745d7dcd8] ldlm_handle_enqueue0 at ffffffffa07bb1dd [ptlrpc]
#10 [ffff882745d7dd48] mdt_enqueue at ffffffffa0cec586 [mdt]
#11 [ffff882745d7dd68] mdt_handle_common at ffffffffa0ce1772 [mdt]
#12 [ffff882745d7ddb8] mdt_regular_handle at ffffffffa0ce2665 [mdt]
#13 [ffff882745d7ddc8] ptlrpc_main at ffffffffa07e9b9e [ptlrpc]
#14 [ffff882745d7df48] kernel_thread at ffffffff8100c0ca
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;Another thing, as the maintenance window was over and customer started mounting the Lustre-Client nodes and one of the node failed right after connecting to MDS. The only difference was that they used the different mount option. Originally that client node was mounted using &quot;loclflock&quot; option and this time it was mounted as &quot;flock&quot;.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar  5 13:18:29 c6 kernel: LustreError: 11984:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff881589b2f240 (22024075/0/0/0) (rc: 2)
Mar  5 13:18:29 c6 kernel: Lustre: dc2-OST002c-osc-ffff8880133b3400: Connection restored to service dc2-OST002c using nid 10.10.0.5@o2ib.
Mar  5 13:18:29 c6 kernel: LustreError: 11137:0:(lov_request.c:211:lov_update_enqueue_set()) enqueue objid 0x1843a subobj 0x1500f8b on OST idx 44: rc -5
Mar  5 13:18:29 c6 kernel: LustreError: 11356:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Mar  5 13:39:26 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.11@o2ib. The ldlm_enqueue operation failed with -107
Mar  5 13:39:26 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection to service dc2-OST0073 via nid 10.10.0.11@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Mar  5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Mar  5 13:39:28 c6 kernel: LustreError: 11265:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Mar  5 13:39:35 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0073; in progress operations using this service will fail.
Mar  5 13:39:37 c6 kernel: Lustre: Server dc2-OST0073_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Mar  5 13:39:37 c6 kernel: LustreError: 13543:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff881fd195bc00 x1461691364654913/t0 o4-&amp;gt;dc2-OST0073_UUID@10.10.0.11@o2ib:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Mar  5 13:39:37 c6 kernel: LustreError: 3378:0:(client.c:859:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Mar  5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0073-osc-ffff8880133b3400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Mar  5 13:39:38 c6 kernel: LustreError: 13462:0:(llite_mmap.c:210:ll_tree_unlock()) couldn&apos;t unlock -5
Mar  5 13:39:38 c6 kernel: LustreError: 13616:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff8815c0741b00 (21885599/0/0/0) (rc: 1)
Mar  5 13:39:39 c6 kernel: Lustre: dc2-OST0073-osc-ffff8880133b3400: Connection restored to service dc2-OST0073 using nid 10.10.0.11@o2ib.
Mar  5 13:44:23 c6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.8@o2ib. The ldlm_enqueue operation failed with -107
Mar  5 13:44:24 c6 kernel: Lustre: dc2-OST0052-osc-ffff8880133b3400: Connection to service dc2-OST0052 via nid 10.10.0.8@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Mar  5 13:44:31 c6 kernel: LustreError: 167-0: This client was evicted by dc2-OST0052; in progress operations using this service will fail.
Mar  5 13:44:31 c6 kernel: Lustre: Server dc2-OST0052_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Mar  5 13:44:31 c6 kernel: LustreError: 15981:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:44:31 c6 kernel: LustreError: 15948:0:(file.c:1001:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO
Mar  5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE)) failed
Mar  5 13:44:31 c6 kernel: LustreError: 3378:0:(osc_request.c:2357:brw_interpret()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I have attached the backtrace logs and the logs from server and client nodes for this client eviction incidence.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.hpdd.intel.com/secure/attachment/14246/server_logs_03_05.log&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/secure/attachment/14246/server_logs_03_05.log&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/secure/attachment/14245/rt_119855.c6.lustre_logs&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/secure/attachment/14245/rt_119855.c6.lustre_logs&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/secure/attachment/14244/backtrace.tar.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/secure/attachment/14244/backtrace.tar.gz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="78838" author="bobijam" created="Mon, 10 Mar 2014 03:24:47 +0000"  >&lt;p&gt;the last client LBUG so far is addressed by patch &lt;a href=&quot;http://review.whamcloud.com/#/c/5971&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/5971&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3067&quot; title=&quot;ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE))&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3067&quot;&gt;&lt;del&gt;LU-3067&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="78875" author="manish" created="Mon, 10 Mar 2014 14:42:40 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;Thank you for the updates. What about the backtrace of the MDS server which I provided? Can you check and see if that helps to find out about why the &quot;stats of directory listing was not showing&quot; and the &quot;lock completion callback time out&quot; and &quot;blocking timeouts&quot;? &lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="78975" author="bobijam" created="Tue, 11 Mar 2014 07:50:13 +0000"  >&lt;p&gt;Some info in server_logs_03_05.log reminds me of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4359&quot; title=&quot;1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4359&quot;&gt;&lt;del&gt;LU-4359&lt;/del&gt;&lt;/a&gt;, they are some sort look alike.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 5 active RPCs
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880767c28c00 x1461679860288976/t0(0) o3-&amp;gt;86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20387:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 2 previous similar messages
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 2 previous similar messages
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880883a49000 x1461679860288966/t0(0) o3-&amp;gt;86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043548 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:29 dc2oss03 kernel: : LustreError: 20923:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 1 previous similar message
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: dc2-OST001e: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:29 dc2oss03 kernel: : Lustre: Skipped 1 previous similar message
Mar  5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1394043489/real 1394043489]  req@ffff88181a7a8800 x1456766146495206/t0(0) o104-&amp;gt;dc2-OST002c@149.165.226.206@tcp:15/16 lens 296/192 e 0 to 1 dl 1394043496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar  5 13:18:16 dc2oss05 kernel: : Lustre: 15852:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
Mar  5 13:18:16 dc2oss05 kernel: : LustreError: 138-a: dc2-OST002c: A client on nid 149.165.226.206@tcp was evicted due to a lock blocking callback time out: rc -107
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Client 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp) refused reconnection, still busy with 4 active RPCs
Mar  5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff880bd5827400 x1461679860288975/t0(0) o3-&amp;gt;86045972-78e0-90f8-6a5f-e2580eb516f1@149.165.229.66@tcp:0/0 lens 448/400 e 1 to 0 dl 1394043551 ref 1 fl Interpret:/0/0 rc 0/0
Mar  5 13:18:26 dc2oss08 kernel: : LustreError: 27762:0:(ldlm_lib.c:2674:target_bulk_io()) Skipped 3 previous similar messages
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
Mar  5 13:18:26 dc2oss08 kernel: : Lustre: dc2-OST004c: Bulk IO read error with 86045972-78e0-90f8-6a5f-e2580eb516f1 (at 149.165.229.66@tcp), client will retry: rc -110
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It turned out that there is some problem about their network setup, does this system use hybrid network and would you mind taking consider of the network issue possibility?&lt;/p&gt;</comment>
                            <comment id="79250" author="manish" created="Thu, 13 Mar 2014 16:44:25 +0000"  >&lt;p&gt;Hi Zhenyu,&lt;/p&gt;

&lt;p&gt;Yes this is the hybrid network with IB and TCP network with multiple subnet. At the last maintenance window we did some network changes suggested by the Intel in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4531&quot; title=&quot;frequent evictions and timeouts on routed lnet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4531&quot;&gt;&lt;del&gt;LU-4531&lt;/del&gt;&lt;/a&gt;, but still we seeing the MDS lockup or directory stats listing issues, so I would like you to reconsider looking at those old backtraces which I sent last time.  &lt;/p&gt;

&lt;p&gt;Yesterday we had one more downtime due to MDS lockup and we had to reboot the MDS to fix the directory stats listings issues. We have collected the stacktrace and the logs from the server and the clients of that time period. The logs are as below:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://ddntsr.com/ftp/2014-03-12-29545_stacktraces.tgz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-03-12-29545_stacktraces.tgz&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/secure/attachment/14289/march_12_svr.log.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/secure/attachment/14289/march_12_svr.log.gz&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/secure/attachment/14290/aprun-DebugLogs.tar.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/secure/attachment/14290/aprun-DebugLogs.tar.gz&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://ddntsr.com/ftp/2014-03-13-29545_es_lustre_showall_2014-03-13_082243.tar.bz2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://ddntsr.com/ftp/2014-03-13-29545_es_lustre_showall_2014-03-13_082243.tar.bz2&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Let me know if you need anything else from our end.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="79321" author="green" created="Fri, 14 Mar 2014 07:04:59 +0000"  >&lt;p&gt;From the logs it looks like some OST became super slow for some reason (overloaded?) there are signs of MDT requests taking long time (in ost logs).&lt;br/&gt;
The &quot;all logs&quot; file sadly does not have any logs for dc2mds01 from what I see, so it&apos;s a bit harder to fully verify this theory, but it&apos;s pretty plausible.&lt;br/&gt;
We can see this picture in your earlier log excerpt too.&lt;br/&gt;
The code that&apos;s waiting for the objects to be allocated does so while holding an ldlm lock on the file and parent, so conflicting accesses also block in a cascading manner.&lt;/p&gt;

&lt;p&gt;Also we can see there&apos;s a bunch of clients dropping (no pings happening) that I am not sure why, some network problems?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 12 09:26:17 dc2oss01 kernel: : LustreError: 20182:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 82s
Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 82s
Mar 12 09:26:18 dc2oss01 kernel: : LustreError: 22824:0:(filter_io_26.c:781:filt
er_commitrw_write()) Skipped 4 previous similar messages
Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 73s
Mar 12 09:26:20 dc2oss01 kernel: : LustreError: 22816:0:(filter_io_26.c:781:filt
er_commitrw_write()) Skipped 2 previous similar messages
Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filt
er_commitrw_write()) dc2-OST000a: slow direct_io 79s
Mar 12 09:26:24 dc2oss01 kernel: : LustreError: 19592:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 4 previous similar messages
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 92s
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20421:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 8 previous similar messages
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 15568:0:(filter.c:3459:filter_setattr()) dc2-OST000a: slow i_alloc_sem 89s
Mar 12 09:26:28 dc2oss01 kernel: : LustreError: 20225:0:(filter_io.c:752:filter_preprw_write()) dc2-OST000a: slow i_alloc_sem 63s
Mar 12 09:26:34 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp
Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 111s
Mar 12 09:26:56 dc2oss01 kernel: : LustreError: 22815:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 11 previous similar messages
Mar 12 09:26:57 dc2oss01 kernel: : Lustre: dc2-OST000a: Export ffff88043011b000 already connecting from 149.165.226.218@tcp
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: dc2-OST0056: Client adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5 (at 129.79.10.14@tcp) reconnecting
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: Skipped 120 previous similar messages
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) dc2-OST0056: connection from adc9bfa5-dcf7-5426-7ed1-a1ddfa428cf5@129.79.10.14@tcp t0 exp ffff881829c07800 cur 1394630831 last 1394630831
Mar 12 09:27:11 dc2oss09 kernel: : Lustre: 16243:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 130 previous similar messages
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) dc2-OST000a: slow direct_io 130s
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 22911:0:(filter_io_26.c:781:filter_commitrw_write()) Skipped 7 previous similar messages
Mar 12 09:27:15 dc2oss01 kernel: : LustreError: 15219:0:(filter_io.c:871:filter_preprw_write()) dc2-OST000a: slow start_page_write 129s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="79329" author="liang" created="Fri, 14 Mar 2014 13:50:52 +0000"  >&lt;p&gt;Hi, I have a few questions here, it could be helpful if you can provide information about them:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;when client(s) got evicted, I assume other clients in the same network are still functional, right?&lt;/li&gt;
	&lt;li&gt;if they are still functional, could you run LNet stress tests (LNet selftest) between another client and servers&lt;/li&gt;
	&lt;li&gt;when this issue appeared, could you check all routers (I assume there are four?) are still working.
	&lt;ul&gt;
		&lt;li&gt;check router status on the evicted clients&lt;/li&gt;
		&lt;li&gt;check router status on other clients in the same network.&lt;/li&gt;
		&lt;li&gt;check router status on server&lt;/li&gt;
		&lt;li&gt;login routers to see if they are able to connect to other nodes.&lt;br/&gt;
Thanks!&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="79359" author="isaac" created="Fri, 14 Mar 2014 18:46:49 +0000"  >&lt;p&gt;When running LNet selftest between client and servers, it makes sense to do it in 3 steps:&lt;br/&gt;
1. test between clients and routers&lt;br/&gt;
2. test between routers and servers&lt;br/&gt;
3. test between clients and servers&lt;/p&gt;

&lt;p&gt;If either 1 or 2 fails, there&apos;s no need to go step 3. If step 3 fails, we know there&apos;s something wrong with LNet routing as the complete path has been verified by steps 1 and 2.&lt;/p&gt;</comment>
                            <comment id="79366" author="jupmille" created="Fri, 14 Mar 2014 19:45:44 +0000"  >&lt;p&gt;IB OSS - 10.10.0.1@o2ib&lt;br/&gt;
GNI/IB Router - 30@gni&lt;br/&gt;
GNI/IB Router - 10.10.0.51@o2ib&lt;br/&gt;
GNI Client - 396@gni&lt;/p&gt;

&lt;p&gt;1. test between client and router&lt;/p&gt;

&lt;p&gt;/opt/ddn/bin/lnet_selftest.sh -c &quot;396@gni&quot; -s &quot;30@gni&quot; -d 1:1 &amp;gt; /tmp/4616.client-router.lnet&lt;/p&gt;


&lt;p&gt;2. test between routers and servers&lt;/p&gt;

&lt;p&gt;/opt/ddn/bin/lnet_selftest.sh -c &quot;30@gni&quot; -s &quot;10.10.0.1@o2ib&quot; -d 1:1 &amp;gt; /tmp/4616.router-server.lnet&lt;/p&gt;


&lt;p&gt;3. test between clients and servers&lt;/p&gt;

&lt;p&gt;/opt/ddn/bin/lnet_selftest.sh -c &quot;396@gni&quot; -s &quot;10.10.0.1@o2ib&quot; -d 1:1 &amp;gt;&lt;br/&gt;
/tmp/4616.client-server.lnet&lt;/p&gt;</comment>
                            <comment id="79368" author="jupmille" created="Fri, 14 Mar 2014 19:57:43 +0000"  >&lt;p&gt;&amp;gt; when client(s) got evicted, I assume other clients in the same network are still functional, right?&lt;/p&gt;

&lt;p&gt;Yes, other clients remain functional&lt;/p&gt;

&lt;p&gt;&amp;gt; if they are still functional, could you run LNet stress tests (LNet selftest) between another client and servers&lt;/p&gt;

&lt;p&gt;Yes, lnet selftest has been run from another client on that network and attached to this ticket&lt;/p&gt;

&lt;p&gt;&amp;gt; when this issue appeared, could you check all routers (I assume there are four?) are still working.&lt;/p&gt;

&lt;p&gt;There are four TCP&amp;lt;-&amp;gt;IB routers for the clusters that mount over TCP.&lt;br/&gt;
There are twenty two GNI&amp;lt;-&amp;gt;IB routers for the Cray clients on GNI.&lt;/p&gt;

&lt;p&gt;The lnet selftests attached are for the GNI&amp;lt;-&amp;gt;IB router, IB server, GNI client.&lt;/p&gt;

&lt;p&gt;&amp;gt; check router status on the evicted clients&lt;/p&gt;

&lt;p&gt;All routers are in UP state&lt;/p&gt;

&lt;p&gt;&amp;gt; check router status on other clients in the same network.&lt;/p&gt;

&lt;p&gt;All routers are in UP state&lt;/p&gt;

&lt;p&gt;&amp;gt; check router status on server &lt;/p&gt;

&lt;p&gt;All routers are in UP state&lt;/p&gt;

&lt;p&gt;&amp;gt; login routers to see if they are able to connect to other nodes.&lt;/p&gt;

&lt;p&gt;GNI&amp;lt;-&amp;gt;IB routers are able to ping all GNI clients, all IB Lustre servers, including nodes in the bad state&lt;/p&gt;</comment>
                            <comment id="79378" author="isaac" created="Fri, 14 Mar 2014 21:14:27 +0000"  >&lt;p&gt;In 4616.client-server.lnet and 4616.router-server.lnet, there&apos;s lots of output like:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: inf      MB/s  Min: inf      MB/s  Max: inf      MB/s&lt;/p&gt;

&lt;p&gt;These don&apos;t look right, but I don&apos;t know why yet.&lt;/p&gt;</comment>
                            <comment id="79385" author="bfaccini" created="Fri, 14 Mar 2014 23:59:38 +0000"  >&lt;p&gt;Just an update to explain the debug session I ran earlier today with the IU team :&lt;/p&gt;

&lt;p&gt;     _ they 1st showed me &quot;live&quot; how easy they can reproduce the issue. Its interesting to note that only certain directory seem to be always able to trigger and all others do not !!! When we do a simple &quot;ls&quot; in a known directory to trigger, it just hangs and the Client will be later evicted and can no longer reconnect.&lt;/p&gt;

&lt;p&gt;     _ after that, I asked to redo it from a fresh Client. Before to reproduce, I enabled the full debug-logs and grew the buffer size to 2M, and cleared the debug-log on both the MDS and the Client. And a few second after we reproduced the &quot;ls&quot; hang, I dumped the debug-log in /tmp/lustre-debug.bfi on both the MDS and the Client, then I forced-crash the Client and took a full stacks trace on the MDS (to be found in /var/log/messages).&lt;/p&gt;

&lt;p&gt;Here is the Console output of the cmds I ran on the Client, that shows the &quot;ls&quot; PID :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c4 ~]# lctl clear
[root@c4 ~]# date ; ls /N/dc2/scratch/smccaula &amp;amp;
Fri Mar 14 12:51:52 EDT 2014
[1] 70546
[root@c4 ~]# lctl dk /tmp/lustre-log.bfi
Debug log: 115742 lines, 115742 kept, 0 dropped, 0 bad.
[root@c4 ~]# sync
[root@c4 ~]# echo 1 &amp;gt; /proc/sys/kernel/sysrq
[root@c4 ~]# echo c &amp;gt; /proc/sys
sys/           sysrq-trigger  sysvipc/
[root@c4 ~]# echo c &amp;gt; /proc/sysrq-trigger
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After that, I requested via email to the IU team to provide us the following infos :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/tmp/lustre-log.bfi for both Client/c4 and MDS
/var/log/messages for both Client/c4 and MDS
crash-dump of Client + lustre-modules/kernel-debuginfo* rpms
LNET NIDs of Client+MDS
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="79396" author="bfaccini" created="Sat, 15 Mar 2014 01:16:24 +0000"  >&lt;p&gt;I checked the uploaded stuff and seems that the /tmp/lustre-log.bfi (full debug-log) of both Client/c4 and MDS are missing, along with the kernel-debuginfo* rpms. They are not available on the IU dropbox too.&lt;/p&gt;

&lt;p&gt;Seems that others debug-logs have been provided but their date is more than 2 hours later than the debug session I ran.&lt;/p&gt;

&lt;p&gt;Also does the Client/c4 run with a stock 2.6.32-358.11.1.el6.x86_64 kernel ?&lt;/p&gt;
</comment>
                            <comment id="79399" author="pjones" created="Sat, 15 Mar 2014 03:19:26 +0000"  >&lt;p&gt;Just to keep all the info in the ticket for those monitoring&lt;/p&gt;

&lt;p&gt;&quot;We uploaded the missing /tmp/lustre-log.bfi to the ftp server as well as&lt;br/&gt;
the box site.&lt;/p&gt;

&lt;p&gt;I am not sure why the timestamps are different. Are the different inside&lt;br/&gt;
the file (log entry timestamps) or is the file&apos;s timestamp off?&lt;/p&gt;

&lt;p&gt;I believe the kernel-debuginfo* files you are looking for are inside the&lt;br/&gt;
tar files we uploaded. If they are not, we can re-upload.&lt;/p&gt;

&lt;p&gt;I will check with our Cray admin to get the answer on the c4 client&lt;br/&gt;
version.&lt;/p&gt;

&lt;p&gt;Please let me know if you need anything else.&quot;&lt;/p&gt;

&lt;p&gt;then&lt;/p&gt;

&lt;p&gt;&quot;Concerning he time stamps being off, we had to perform the debug/crash&lt;br/&gt;
procedure multiple times before we properly captured all the debug info.&lt;br/&gt;
So it makes sense that the time stamps are later, as we performed the&lt;br/&gt;
actions later.&lt;/p&gt;

&lt;p&gt;I will let you know about the client version when we get an update.&quot;&lt;/p&gt;

&lt;p&gt;then &lt;/p&gt;

&lt;p&gt;&quot;c4 is running 1.8.9 with the ll_getname patch.&quot;&lt;/p&gt;
</comment>
                            <comment id="79400" author="jupmille" created="Sat, 15 Mar 2014 04:01:28 +0000"  >&lt;p&gt;the lustre-log.bfi from the MDS has been upload to the ftp and box.iu.edu as dc2mds01.lustre-log.bfi.gz&lt;/p&gt;

&lt;p&gt;the kernel-debuginfo* rpms are in box.iu.edu as well, or also in the client and mds tgz on the ftp site.&lt;/p&gt;

&lt;p&gt;Concerning the time stamps being off, we had to perform the debug/crash procedure multiple times before we properly captured all the debug info. So it makes sense that the time stamps are later, as we performed the&lt;br/&gt;
actions later. &lt;/p&gt;

&lt;p&gt;The full debug information provided is from a host named oss30, which is TCP client that connects through the dc2xfer&lt;span class=&quot;error&quot;&gt;&amp;#91;01-04&amp;#93;&lt;/span&gt; routers.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss30 ~&amp;#93;&lt;/span&gt;# rpm -qa | grep lustre&lt;br/&gt;
lustre-client-2.1.6-2.6.32_358.11.1.el6.x86_64.x86_64&lt;br/&gt;
lustre-client-modules-2.1.6-2.6.32_358.11.1.el6.x86_64.x86_64&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss30 ~&amp;#93;&lt;/span&gt;# uname -a&lt;br/&gt;
Linux oss30 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux&lt;/p&gt;

&lt;p&gt;there is some confusion about the c4 client that needs cleared up. c4 was the client we were using for interactive debugging with Bruno, but its debug dumps got lost. We followed the same procedure for crashing and collecting on another client oss30&lt;/p&gt;</comment>
                            <comment id="79401" author="di.wang" created="Sat, 15 Mar 2014 06:00:54 +0000"  >&lt;p&gt;A few information got from discussion with Justin&lt;/p&gt;

&lt;p&gt;1. when &quot;ls&quot; failure happens, it only hit the certain directories, i.e.  client only got evicted when ls these certain directories. But these &quot;certain&quot; directories might change after each restart.&lt;/p&gt;

&lt;p&gt;2. The FS is created with lustre 2.1.3, right now it is 2.1.6 plus some patches. But client is 1.8.6, which is be mounted at least 2 filesystem, one is 2.1.6, another one is 1.8. (Justin, please confirm?)&lt;/p&gt;

&lt;p&gt;3. The &quot;ls&quot; failure will happen on all clients, no matter 1.8 and 2.1.3 clients.&lt;/p&gt;

&lt;p&gt;4. Before the failure happens on these directory, no correlated applications are being identified to touch these directories. But needs confirm.&lt;/p&gt;
</comment>
                            <comment id="79402" author="pjones" created="Sat, 15 Mar 2014 06:09:39 +0000"  >&lt;p&gt;Thanks Di. For #2 the patches are &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2943&quot; title=&quot;LBUG mdt_reconstruct_open()) ASSERTION( (!(rc &amp;lt; 0) || (lustre_msg_get_transno(req-&amp;gt;rq_repmsg) == 0)) )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2943&quot;&gt;&lt;del&gt;LU-2943&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3645&quot; title=&quot;Interop 2.1.5 &amp;lt;--&amp;gt; 2.4 Write operations during failover errors out instead of stalling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3645&quot;&gt;&lt;del&gt;LU-3645&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="79403" author="bfaccini" created="Sat, 15 Mar 2014 09:09:39 +0000"  >&lt;p&gt;Justin: Thanks for the clarification. But the &quot;date/time&quot; confusion I seemed to introduce, comes from the fact I was expecting files/logs/crash-dump from the session on c4/Client and did not find any clear indication/warning that you re-ran the session on an other node!!&lt;/p&gt;</comment>
                            <comment id="79405" author="bfaccini" created="Sat, 15 Mar 2014 14:35:08 +0000"  >&lt;p&gt;Andreas, Di, feel free to correct/add !!&lt;br/&gt;
The MDS full stacks output in the syslog/messages shows a lot of threads being stuck doing renames. Actually we wonder if this could not be related to the problem.&lt;br/&gt;
In order to confirm this it would be nice if you can also provide us with the syslog/messages file from the MDS (dc2mds01 ?) that was used during the time when I was connected at IU and we reproduced with Client/c4, can you also provide this ??&lt;/p&gt;

&lt;p&gt;Also, thanks to provide the dc2mds01.lustre-log.bfi from my debug session, but I still don&apos;t find the Client/c4 one in what is on ftp or dropbox. In fact, I may be wrong, but I think you did not actually provide anything from my debug session when connected at IU but only what you got from the remake on your own, can you help there ?&lt;/p&gt;</comment>
                            <comment id="79406" author="jupmille" created="Sat, 15 Mar 2014 16:33:20 +0000"  >&lt;p&gt;Bruno: You are right, we should have been clear about the logs that were upload. We unfortunately were unable to save full stack output from c4. We recreated the full stack output from another very similar client following the same procedures, and that is what we shared with everyone. They&apos;re from oss30 which I described previously.&lt;/p&gt;

&lt;p&gt;We do have the dc2mds01 syslog/messages from the time of our interactive debug session while using c4. I have uploaded that log to box as 4616.mds01syslog.20130314.txt In fact that file is the entire March 14 messages file for dc2mds01, so it will have the c4 client debug, as well as our later oss30 client debug. &lt;/p&gt;
</comment>
                            <comment id="79407" author="adilger" created="Sat, 15 Mar 2014 23:54:29 +0000"  >&lt;p&gt;Looking through the mds_messages.log file, and filtering out the many other messages which are being printed there (e.g. spurious llog -ENOSPC error that would be fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2336&quot; title=&quot;mds llog_write_rec  &amp;#39;No space left on device&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2336&quot;&gt;&lt;del&gt;LU-2336&lt;/del&gt;&lt;/a&gt; patch &lt;a href=&quot;http://review.whamcloud.com/5146&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5146&lt;/a&gt;, patch needed for b1_8 clients to fix &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-532&quot; title=&quot;2.1 does not filter extended attributes list based on permissions, showing entire list at all times&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-532&quot;&gt;&lt;del&gt;LU-532&lt;/del&gt;&lt;/a&gt; &lt;a href=&quot;http://review.whamcloud.com/1161&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/1161&lt;/a&gt; warning, etc), it seems the first real problem in that log is Wednesday morning (2014-03-12 09:31):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 12 09:31:29 dc2mds01 kernel: : Lustre: Service thread pid 11131 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 12 09:31:29 dc2mds01 kernel: : Pid: 11131, comm: mdt_40
Mar 12 09:31:29 dc2mds01 kernel: :
Mar 12 09:31:29 dc2mds01 kernel: : Call Trace:
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa04b260e&amp;gt;] cfs_waitq_wait+0xe/0x10 [libcfs]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0a2c50a&amp;gt;] lov_create+0xbaa/0x1400 [lov]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c0f90e&amp;gt;] mdd_lov_create+0x9ee/0x1ba0 [mdd]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c21871&amp;gt;] mdd_create+0xf81/0x1a90 [mdd]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0d4b3f7&amp;gt;] cml_create+0x97/0x250 [cmm]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0caa19d&amp;gt;] mdt_reint_open+0x1abd/0x28f0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c92c81&amp;gt;] mdt_reint_rec+0x41/0xe0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c89ed4&amp;gt;] mdt_reint_internal+0x544/0x8e0 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c8a53d&amp;gt;] mdt_intent_reint+0x1ed/0x530 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c88c09&amp;gt;] mdt_intent_policy+0x379/0x690 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa06ff391&amp;gt;] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa07251dd&amp;gt;] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c89586&amp;gt;] mdt_enqueue+0x46/0x130 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c7e772&amp;gt;] mdt_handle_common+0x932/0x1750 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0c7f665&amp;gt;] mdt_regular_handle+0x15/0x20 [mdt]
Mar 12 09:31:29 dc2mds01 kernel: : [&amp;lt;ffffffffa0753b9e&amp;gt;] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Mar 12 09:31:29 dc2mds01 kernel: : LustreError: dumping log to /tmp/lustre-log.1394631089.11131
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After this quite a number of other MDS threads also reported waiting and dumping stacks in lov_create().  Later on Wednesday the MDS was rebooted (17:04 and again at 18:02).  The same problem with create threads waiting in lov_create() looks to have happened again a few hours later that evening (2014-03-12 23:01).  An increasing number of threads appeared to become permanently blocked on the 13th and 14th, until the MDS was taken offline on the 14th, but I suspect those are just a byproduct of the other threads holding locks on the affected directories.&lt;/p&gt;

&lt;p&gt;It isn&apos;t totally clear, but this may be caused by the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3645&quot; title=&quot;Interop 2.1.5 &amp;lt;--&amp;gt; 2.4 Write operations during failover errors out instead of stalling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3645&quot;&gt;&lt;del&gt;LU-3645&lt;/del&gt;&lt;/a&gt; patch &lt;a href=&quot;http://review.whamcloud.com/7559&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7559&lt;/a&gt; that is (AFAIK) installed at IU, but not landed on b2_1 yet.  Fan Yong&apos;s comments on that patch are:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;osc_create_with_fakereq() is blocked operation, call it within the interpret environment may cause the ptlrpcd to be blocked. Such behaviour should be avoided, otherwise the whole system may be hung.&lt;/p&gt;

&lt;p&gt;If you really need to make some blocked operation, you can make it to be processed by some non-ptlrpcd environment. For example, a dedicated thread just handle the async_create callback.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I can&apos;t say for sure that this is the problem since the stacks of the other threads on the MDS are not available, but it does seem possible that this patch is causing problems.  It might also just be that the OSTs are overloaded, but I never saw any messages in the rest of the console logs for any of the MDS threads that reported being blocked, and if the OSTs were just busy the threads would eventually have become unstuck.  Given the number of other messages being printed, I would have expected &lt;em&gt;some&lt;/em&gt; messages to be printed by those threads (there were other unrelated messages occasionally being printed before the threads dumped stack).&lt;/p&gt;

&lt;p&gt;I would recommend that this patch be reverted from the MDS for now, to see if it is the cause of the problem.  Should the problem happen again, in particular either of the messages:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: Service thread pid NNNNN was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later.
LustreError: ### lock timed out (enqueued at NNNNNNNNNN 200s ago); not entering recovery in server code, just going back to sleep
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;indicating that there are hung threads on the MDS, please dump all of the process stacks on the MDS via &quot;echo t &amp;gt; /proc/sysrq-trigger&quot; and attach the stack traces from the console here.  It may be that only a partial dump of the stack traces is shown in /var/log/messages because of the large volume of messages, so preferably they can be captured directly from a serial console.&lt;/p&gt;

&lt;p&gt;At this point, I don&apos;t &lt;em&gt;think&lt;/em&gt; that the clients have anything to do with the current problem, they are just suffering from the hung MDS threads not being able to service requests, or blocking on locks held by the hung MDS service threads.  I don&apos;t think the current problem relates to the one reported earlier in this ticket. &lt;/p&gt;

&lt;p&gt;The original problem may potentially be caused by the MDS being busy and not processing RPC requests in a timely manner and evicting clients too soon.  I saw a number of cases in the logs where a client was evicted due to a timed-out lock callback, and then a few seconds later the MDS rejected a cancel request from the same client.  That kind of problem might be fixed by increasing the AT max timeout, or more recent changes to high-priority request handling.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 10 17:52:11 dc2mds01 kernel: : LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 149.165.238.1@tcp  ns: mdt-ffff881837d4e000 lock: ffff881d6af0c000/0x78c70fdb970d7e0 lrc: 3/0,0 mode: PR/PR res: 8589943942/77226 bits 0x3 rrc: 13 type: IBT flags: 0x4000020 remote: 0xe213f1ffc604946c expref: 54330 pid: 11173 timeout: 4825691675
Mar 10 17:52:12 dc2mds01 kernel: : LustreError: 20011:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 149.165.238.1@tcp arrived at 1394488331 with bad export cookie 543933852487261410
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It would probably be good to split these issues into two separate bugs.&lt;/p&gt;</comment>
                            <comment id="79408" author="yong.fan" created="Sun, 16 Mar 2014 03:03:10 +0000"  >&lt;p&gt;mdt_reint (include create case) and mdt_getattr are handled via the same request portal on the MDS, as long as the create operations as blocked because of interpret hung, even though other rdpg/seq service threads are idle, the &quot;ls -l&quot; on client still will be affected. Just as you have seen that the files namelist is available (via rpgd service), but you cannot get the owner/mode information (via mdt_getattr/mdt_getattr_name).&lt;/p&gt;

&lt;p&gt;So I agree with Andreas to drop the patch &lt;a href=&quot;http://review.whamcloud.com/7559&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7559&lt;/a&gt; as a temporary solution.&lt;/p&gt;</comment>
                            <comment id="83099" author="pjones" created="Fri, 2 May 2014 20:11:21 +0000"  >&lt;p&gt;As per DDN ok to close this ticket&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="14298" name="4616.client-router.lnet" size="15531" author="jupmille" created="Fri, 14 Mar 2014 19:45:44 +0000"/>
                            <attachment id="14297" name="4616.client-server.lnet" size="4209" author="jupmille" created="Fri, 14 Mar 2014 19:45:44 +0000"/>
                            <attachment id="14299" name="4616.router-server.lnet" size="4209" author="jupmille" created="Fri, 14 Mar 2014 19:45:44 +0000"/>
                            <attachment id="14099" name="Flock_lock_logs.txt" size="4834" author="manish" created="Wed, 12 Feb 2014 16:00:06 +0000"/>
                            <attachment id="14097" name="IBT flags_logs.txt" size="24185" author="manish" created="Wed, 12 Feb 2014 16:00:06 +0000"/>
                            <attachment id="14207" name="Server_System_logs_from_MDS_OSS_03-032014.txt" size="219384" author="manish" created="Mon, 3 Mar 2014 22:25:18 +0000"/>
                            <attachment id="14290" name="aprun-DebugLogs.tar.gz" size="384722" author="manish" created="Thu, 13 Mar 2014 16:29:11 +0000"/>
                            <attachment id="14244" name="backtrace.tar.gz" size="226506" author="manish" created="Fri, 7 Mar 2014 23:28:38 +0000"/>
                            <attachment id="14289" name="march_12_svr.log.gz" size="250476" author="manish" created="Thu, 13 Mar 2014 16:29:11 +0000"/>
                            <attachment id="14098" name="returned -22 from blocking AST_logs.txt" size="15794" author="manish" created="Wed, 12 Feb 2014 16:00:06 +0000"/>
                            <attachment id="14245" name="rt_119855.c6.lustre_logs" size="55924" author="manish" created="Fri, 7 Mar 2014 23:28:38 +0000"/>
                            <attachment id="14095" name="server_log_2_10_2014.log" size="10866" author="manish" created="Wed, 12 Feb 2014 16:00:06 +0000"/>
                            <attachment id="14246" name="server_logs_03_05.log" size="1540112" author="manish" created="Fri, 7 Mar 2014 23:28:38 +0000"/>
                            <attachment id="14096" name="xfer_client_2_10_2014.log" size="13171" author="manish" created="Wed, 12 Feb 2014 16:00:06 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwewv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12632</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>