<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:44:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4591] Related cl_lock failures on master/2.5</title>
                <link>https://jira.whamcloud.com/browse/LU-4591</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;ve seen a number of different cl_clock bugs in master/2.4.1/2.5 that we believe are related.&lt;/p&gt;

&lt;p&gt;We have seen these in master as recently as two weeks ago (current master cannot run mmstress at all due to a problem in the paging code).  &lt;/p&gt;

&lt;p&gt;These bugs are not present in Intel released 2.4.0, but we&apos;ve seen them in Cray 2.4.1 and 2.5 (which do not track precisely with the Intel versions of 2.4.1 and 2.5).&lt;/p&gt;

&lt;p&gt;We&apos;ve seen all of these bugs during our general purpose testing, but we believe they&apos;re related, because all of them are reproduced easily by running multiple copies (on multiple nodes) of mmstress from the Linux Test Project (mtest05 - I will attach the source), and none of them seem to be present in 2.4.0.  (At least, none of them are reproduced in that context.)&lt;/p&gt;

&lt;p&gt;Not all of the stack traces below are from runs on master (sorry - it&apos;s not what I&apos;ve got handy), but all of the listed bugs have been reproduced on master:&lt;/p&gt;

&lt;p&gt;General protection fault in osc_lock_detach (this one seems to be the most common):&lt;br/&gt;
&amp;#8212;&lt;br/&gt;
&amp;gt; 20:00:29 Pid: 3120, comm: ldlm_bl_04 Tainted: P            3.0.82-0.7.9_1.0000.7690-cray_gem_c #1  &lt;br/&gt;
&amp;gt; 20:00:30 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0635586&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0635586&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_detach+0x46/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 RSP: 0000:ffff880204065cc0  EFLAGS: 00010206&lt;br/&gt;
&amp;gt; 20:00:30 RAX: 0000000000005050 RBX: 5a5a5a5a5a5a5a5a RCX: ffff880204065cc0&lt;br/&gt;
&amp;gt; 20:00:30 RDX: ffff880204065cc0 RSI: ffff880203c38b50 RDI: ffffffffa066ba00&lt;br/&gt;
&amp;gt; 20:00:30 RBP: ffff880204065cf0 R08: 0000000000000020 R09: ffffffff8136fd38&lt;br/&gt;
&amp;gt; 20:00:30 R10: 0000000000000400 R11: 0000000000000009 R12: ffff880203c38b50&lt;br/&gt;
&amp;gt; 20:00:30 R13: 0000000000000000 R14: ffff88020e33ab40 R15: ffff880201fd23e0&lt;br/&gt;
&amp;gt; 20:00:30 FS:  00002aaaaec54700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000&lt;br/&gt;
&amp;gt; 20:00:30 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
&amp;gt; 20:00:30 CR2: 00002aaaad101000 CR3: 000000020439d000 CR4: 00000000000407e0&lt;br/&gt;
&amp;gt; 20:00:30 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
&amp;gt; 20:00:30 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
&amp;gt; 20:00:30 Process ldlm_bl_04 (pid: 3120, threadinfo ffff880204064000, task ffff88020e73d7e0)&lt;br/&gt;
&amp;gt; 20:00:30 Stack:&lt;br/&gt;
&amp;gt; 20:00:30 ffff880204065cc0 0000000000000000 ffff880203c38b50 0000000000000000&lt;br/&gt;
&amp;gt; 20:00:30 ffff88020e33ab40 ffff880201fd23e0 ffff880204065d60 ffffffffa0635a7a&lt;br/&gt;
&amp;gt; 20:00:30 ffff880201fd23e0 ffff880150f57ed0 0000000001fd23e0 ffff880150f57ed0&lt;br/&gt;
&amp;gt; 20:00:30 Call Trace:&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0635a7a&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_cancel+0xca/0x410 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b874d&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x6d/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b948b&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel+0x13b/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0636e6c&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x20c/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03c5334&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0xd4/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03c588c&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x1fc/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8106610e&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
&amp;gt; 20:00:30 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81363ff4&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;br/&gt;
&amp;gt; 20:00:30 Code: f8 66 66 66 66 90 49 89 f4 49 89 ff 48 c7 c7 00 ba 66 a0 e8 0d cb d2 e0 49 8b 5c 24 28 48 85 db 74 7b 49 c7 44 24 28 00 00 00 00 &lt;br/&gt;
&amp;gt; 20:00:30 c7 83 60 01 00 00 00 00 00 00 49 c7 44 24 70 00 00 00 00 fe &lt;br/&gt;
&amp;gt; 20:00:30 RIP  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0635586&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_detach+0x46/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 20:00:30 RSP &amp;lt;ffff880204065cc0&amp;gt;&lt;br/&gt;
&amp;gt; 20:00:30 --&lt;del&gt;[ end trace 317fa078a5344509 ]&lt;/del&gt;--&lt;/p&gt;

&lt;p&gt;&amp;#8212;&lt;/p&gt;

&lt;p&gt;(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;br/&gt;
&amp;#8212;&lt;br/&gt;
&amp;gt; 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;br/&gt;
&amp;gt; 23:39:52 LustreError: 6456:0:(osc_lock.c:1134:osc_lock_enqueue()) LBUG&lt;br/&gt;
&amp;gt; 23:39:52 Pid: 6456, comm: mmstress&lt;br/&gt;
&amp;gt; 23:39:52 Call Trace:&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810065b1&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004dd9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa016d897&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x57/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa016dde7&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa063b8ef&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_enqueue+0x74f/0x8d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0dab&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06d13dd&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x1fd/0x880 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0dab&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c1c8f&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_locked+0x7f/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c287e&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x7e/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c7db4&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x394/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c807a&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0x9a/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078bf78&amp;gt;&amp;#93;&lt;/span&gt; ll_fault+0x308/0x4e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81111ce6&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0x76/0x570&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81112284&amp;gt;&amp;#93;&lt;/span&gt; handle_pte_fault+0xa4/0xcc0&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8111304e&amp;gt;&amp;#93;&lt;/span&gt; handle_mm_fault+0x1ae/0x240&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026c6f&amp;gt;&amp;#93;&lt;/span&gt; do_page_fault+0x18f/0x420&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813628cf&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;br/&gt;
&amp;gt; 23:39:52 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00000000200007ea&amp;gt;&amp;#93;&lt;/span&gt; 0x200007ea&lt;br/&gt;
&amp;gt; 23:39:52 Kernel panic - not syncing: LBUG&lt;br/&gt;
&amp;#8212;&lt;/p&gt;


&lt;p&gt;lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub-&amp;gt;lss_cl.cls_lock) ) failed:&lt;br/&gt;
&amp;#8211;&lt;br/&gt;
2013-12-06T21:36:44.086298-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub-&amp;gt;lss_cl.cls_lock) ) failed:&lt;br/&gt;
2013-12-06T21:36:44.111468-06:00 c0-0c0s11n2 LustreError: 9051:0:(lov_lock.c:1092:lov_lock_link_find()) LBUG&lt;br/&gt;
2013-12-06T21:36:44.111497-06:00 c0-0c0s11n2 Pid: 9051, comm: mmstress&lt;br/&gt;
2013-12-06T21:36:44.111536-06:00 c0-0c0s11n2 Call Trace:&lt;br/&gt;
2013-12-06T21:36:44.136717-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005db9&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x169/0x1b0&lt;br/&gt;
2013-12-06T21:36:44.136757-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004849&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x450&lt;br/&gt;
2013-12-06T21:36:44.136787-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02128d7&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x57/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.169446-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0212e37&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.169474-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0778f7a&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_link_find+0x16a/0x170 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.169502-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07799cf&amp;gt;&amp;#93;&lt;/span&gt; lov_sublock_adopt+0x8f/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.194648-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa077cba7&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_init_raid0+0x637/0xd50 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.194687-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0773c0e&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_init+0x1e/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.194704-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036961a&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_hold_mutex+0x32a/0x640 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.219871-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0369ab2&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x62/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.219901-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036f0ce&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x39e/0x5d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.245175-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036f3a2&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0xa2/0x1b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.245216-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0838760&amp;gt;&amp;#93;&lt;/span&gt; ll_page_mkwrite+0x280/0x680 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
2013-12-06T21:36:44.245228-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81118019&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0xf9/0x5b0&lt;br/&gt;
2013-12-06T21:36:44.270380-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81118574&amp;gt;&amp;#93;&lt;/span&gt; handle_pte_fault+0xa4/0xcd0&lt;br/&gt;
2013-12-06T21:36:44.270420-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8111934e&amp;gt;&amp;#93;&lt;/span&gt; handle_mm_fault+0x1ae/0x240&lt;br/&gt;
2013-12-06T21:36:44.270431-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8138a915&amp;gt;&amp;#93;&lt;/span&gt; do_page_fault+0x1e5/0x4a0&lt;br/&gt;
2013-12-06T21:36:44.295606-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8138778f&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;br/&gt;
2013-12-06T21:36:44.295648-06:00 c0-0c0s11n2 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00000000200007fa&amp;gt;&amp;#93;&lt;/span&gt; 0x200007fa&lt;br/&gt;
&amp;#8212;&lt;/p&gt;


&lt;p&gt;General protection fault in cl_lock_put (possibly the same issue as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;?):&lt;br/&gt;
&amp;#8212;&lt;br/&gt;
&amp;gt; 19:55:11 general protection fault: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP &lt;br/&gt;
&amp;gt; 19:55:11 CPU 19 &lt;br/&gt;
&amp;gt; 19:55:11 Modules linked in: mic xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) bpmcdmod nic_compat cmsr lmv mgc lustre lov osc mdc fid fld kgnilnd ptlrpc obdclass lnet lvfs sha1_generic md5 libcfs ib_core pcie_link_bw_monitor kdreg gpcd_ari ipogif_ari kgni_ari hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_ari craytrace&lt;br/&gt;
&amp;gt; 19:55:11 Pid: 8720, comm: ldlm_bl_02 Tainted: P            3.0.80-0.5.1_1.0501.7664-cray_ari_c #1 Cray Inc. Cascade/Cascade&lt;br/&gt;
&amp;gt; 19:55:11 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036b553&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036b553&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_put+0x103/0x410 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 RSP: 0018:ffff8807ef0fdc70  EFLAGS: 00010246&lt;br/&gt;
&amp;gt; 19:55:11 RAX: 0000000000000001 RBX: 5a5a5a5a5a5a5a5a RCX: ffff8807f30a7af8&lt;br/&gt;
&amp;gt; 19:55:11 RDX: ffffffffa038da5b RSI: 5a5a5a5a5a5a5a5a RDI: ffff8807f1d169c8&lt;br/&gt;
&amp;gt; 19:55:11 RBP: ffff8807ef0fdc90 R08: ffffffffa037ebc0 R09: 00000000000002f8&lt;br/&gt;
&amp;gt; 19:55:11 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000&lt;br/&gt;
&amp;gt; 19:55:11 R13: ffff8807f1d169c8 R14: ffff8807f30a7af8 R15: 0000000000000001&lt;br/&gt;
&amp;gt; 19:55:11 FS:  00007ffff7ff6700(0000) GS:ffff88087f6c0000(0000) knlGS:0000000000000000&lt;br/&gt;
&amp;gt; 19:55:11 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
&amp;gt; 19:55:11 CR2: 00005555557af328 CR3: 0000000001661000 CR4: 00000000001406e0&lt;br/&gt;
&amp;gt; 19:55:11 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
&amp;gt; 19:55:11 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
&amp;gt; 19:55:11 Process ldlm_bl_02 (pid: 8720, threadinfo ffff8807ef0fc000, task ffff88083c2ae0c0)&lt;br/&gt;
&amp;gt; 19:55:11 Stack:&lt;br/&gt;
&amp;gt; 19:55:11 ffff88083bf7e840 0000000000000000 ffff8807f1d169c8 ffff8807f30a7af8&lt;br/&gt;
&amp;gt; 19:55:11 ffff8807ef0fdcf0 ffffffffa07194ca ffff88083bf7e840 ffff88050caec000&lt;br/&gt;
&amp;gt; 19:55:11 0000000000000001 0000000000000000 ffff8807ef0fdcf0 ffff88050caec000&lt;br/&gt;
&amp;gt; 19:55:11 Call Trace:&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07194ca&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0xaa/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0460aeb&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cancel_callback+0x6b/0x190 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa046f41a&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_cancel_local+0x8a/0x470 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0473ec0&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_cancel+0x60/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa071819e&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_cancel+0xfe/0x1c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0368145&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x75/0x160 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0368e8b&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel+0x13b/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa071962c&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x20c/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa04777d4&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0xd4/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0477d5c&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x22c/0x450 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8138e894&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;br/&gt;
&amp;gt; 19:55:11 Code: 00 00 00 c7 05 23 e3 06 00 01 00 00 00 4c 8b 45 08 8b 13 48 89 d9 48 c7 c6 60 82 38 a0 48 c7 c7 40 98 3d a0 31 c0 e8 5d 6a eb ff &amp;lt;f0&amp;gt; ff 0b 0f 94 c0 84 c0 74 06 83 7b 50 06 74 7d f6 05 5a b4 ed &lt;br/&gt;
&amp;gt; 19:55:11 RIP  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036b553&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_put+0x103/0x410 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 19:55:11 RSP &amp;lt;ffff8807ef0fdc70&amp;gt;&lt;br/&gt;
&amp;#8212;&lt;/p&gt;


&lt;p&gt;lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&lt;br/&gt;
&amp;#8212;&lt;br/&gt;
&amp;gt; 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&lt;br/&gt;
&amp;gt; 04:59:03 LustreError: 19083:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG&lt;br/&gt;
&amp;gt; 04:59:03 Pid: 19083, comm: mmstress&lt;br/&gt;
&amp;gt; 04:59:03 Call Trace:&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810065b1&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004dd9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0168897&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x57/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0168de7&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06d458e&amp;gt;&amp;#93;&lt;/span&gt; lovsub_lock_state+0x19e/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b7b80&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_signal+0x60/0x160 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b7d4d&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_set+0xcd/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbe0b&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0x14b/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06cc3dd&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x1fd/0x880 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbdbb&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bcc9f&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_locked+0x7f/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bd88e&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x7e/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c2dc4&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x394/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c308a&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0x9a/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0786ee8&amp;gt;&amp;#93;&lt;/span&gt; ll_fault+0x308/0x4e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81111d16&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0x76/0x570&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811122b4&amp;gt;&amp;#93;&lt;/span&gt; handle_pte_fault+0xa4/0xcc0&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8136290f&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;br/&gt;
&amp;gt; 04:59:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00000000200007ea&amp;gt;&amp;#93;&lt;/span&gt; 0x200007ea&lt;br/&gt;
&amp;#8212;&lt;/p&gt;

&lt;p&gt;General protection fault in cl_lock_delete:&lt;br/&gt;
&amp;#8212;&lt;br/&gt;
&amp;gt; 22:37:45 Pid: 6333, comm: ldlm_bl_13 Tainted: P            3.0.82-0.7.9_1.0502.7742-cray_gem_c #1  &lt;br/&gt;
&amp;gt; 22:37:45 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b8eb0&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b8eb0&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete0+0x190/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:45 RSP: 0018:ffff8803d0403bc0  EFLAGS: 00010296&lt;br/&gt;
&amp;gt; 22:37:45 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: ffff8803d7509ca8&lt;br/&gt;
&amp;gt; 22:37:45 RDX: ffff8803d54b6800 RSI: ffff8803fbfdceb8 RDI: ffff8803d54b67c0&lt;br/&gt;
&amp;gt; 22:37:45 RBP: ffff8803d0403be0 R08: ffff8803d55a8858 R09: 0000000000000000&lt;br/&gt;
&amp;gt; 22:37:45 R10: 0000000000000023 R11: 0000000000000000 R12: ffff8803d55a8818&lt;br/&gt;
&amp;gt; 22:37:45 R13: ffff8803d55a8810 R14: ffff880803ddcdb8 R15: 0000000000000001&lt;br/&gt;
&amp;gt; 22:37:45 FS:  000000004013d880(0000) GS:ffff88081fcc0000(0000) knlGS:0000000000000000&lt;br/&gt;
&amp;gt; 22:37:46 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
&amp;gt; 22:37:46 CR2: 00002aaaaab38000 CR3: 0000000803db3000 CR4: 00000000000407e0&lt;br/&gt;
&amp;gt; 22:37:46 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
&amp;gt; 22:37:46 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
&amp;gt; 22:37:46 Process ldlm_bl_13 (pid: 6333, threadinfo ffff8803d0402000, task ffff880403064040)&lt;br/&gt;
&amp;gt; 22:37:46 Stack:&lt;br/&gt;
&amp;gt; 22:37:46 ffff8803d55a8810 ffff880803ddcdb8 ffff880803ddcdb8 ffff8803d55a8810&lt;br/&gt;
&amp;gt; 22:37:46 ffff8803d0403c00 ffffffffa02b9063 ffff8801e05fbb50 0000000000000000&lt;br/&gt;
&amp;gt; 22:37:46 ffff8803d0403c60 ffffffffa0636e7a ffff8801e05fbb50 ffff8803d54b67c0&lt;br/&gt;
&amp;gt; 22:37:46 Call Trace:&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b9063&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete+0x153/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0636e7a&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x21a/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03adabb&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cancel_callback+0x6b/0x190 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03bc7ca&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_cancel_local+0x8a/0x470 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03c11bb&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_cancel+0x6b/0x380 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0635c47&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_cancel+0x297/0x410 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b770d&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x6d/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b844b&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel+0x13b/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0636e6c&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x20c/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03c4944&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0xd4/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03c4e9c&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x1fc/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8106610e&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
&amp;gt; 22:37:46 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81364034&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;br/&gt;
&amp;gt; 22:37:46 Code: 08 48 89 10 49 89 4d 18 49 89 4d 20 41 fe 44 24 5c 49 8b 45 10 4d 8d 65 08 49 39 c4 48 8d 58 e8 0f 84 be fe ff ff 0f 1f 44 00 00 &lt;br/&gt;
&amp;gt; 22:37:46 8b 43 10 48 8b 40 50 48 85 c0 74 08 48 89 de 4c 89 f7 ff d0 &lt;br/&gt;
&amp;gt; 22:37:46 RIP  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02b8eb0&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete0+0x190/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 22:37:46 RSP &amp;lt;ffff8803d0403bc0&amp;gt;&lt;br/&gt;
&amp;gt; 22:37:46 --&lt;del&gt;[ end trace cfc07b184a378ec7 ]&lt;/del&gt;--&lt;br/&gt;
&amp;#8212;&lt;/p&gt;

&lt;p&gt;One more, which is slightly different, but still caused by the same tests, not found in 2.4.0, etc.&lt;/p&gt;

&lt;p&gt;Lustre is getting stuck looping in cl_locks_prune.  We have many cases of applications failing to exit with processes stuck somewhere under cl_locks_prune - Here&apos;s two examples:&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005eb9&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x169/0x1b0&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004919&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x450&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100591c&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005955&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813bd54b&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c9bf9&amp;gt;&amp;#93;&lt;/span&gt; __rcu_pending+0x199/0x410&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c9eda&amp;gt;&amp;#93;&lt;/span&gt; rcu_check_callbacks+0x6a/0x120&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8105cb96&amp;gt;&amp;#93;&lt;/span&gt; update_process_times+0x46/0x90&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81080a16&amp;gt;&amp;#93;&lt;/span&gt; tick_sched_timer+0x66/0xc0&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8107386f&amp;gt;&amp;#93;&lt;/span&gt; __run_hrtimer+0xcf/0x1d0&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81073bb7&amp;gt;&amp;#93;&lt;/span&gt; hrtimer_interrupt+0xe7/0x220&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c9dd9&amp;gt;&amp;#93;&lt;/span&gt; smp_apic_timer_interrupt+0x69/0x99&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c8d13&amp;gt;&amp;#93;&lt;/span&gt; apic_timer_interrupt+0x13/0x20&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0373a79&amp;gt;&amp;#93;&lt;/span&gt; cl_env_info+0x9/0x20 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa037dfc6&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_mutex_get+0x36/0xd0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0380319&amp;gt;&amp;#93;&lt;/span&gt; cl_locks_prune+0xd9/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078dc91&amp;gt;&amp;#93;&lt;/span&gt; lov_delete_raid0+0xe1/0x3f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078d059&amp;gt;&amp;#93;&lt;/span&gt; lov_object_delete+0x69/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036bda5&amp;gt;&amp;#93;&lt;/span&gt; lu_object_free+0x85/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036c2de&amp;gt;&amp;#93;&lt;/span&gt; lu_object_put+0xbe/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0374d1e&amp;gt;&amp;#93;&lt;/span&gt; cl_object_put+0xe/0x10 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa086803d&amp;gt;&amp;#93;&lt;/span&gt; cl_inode_fini+0xbd/0x2a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa082f160&amp;gt;&amp;#93;&lt;/span&gt; ll_clear_inode+0x2d0/0x930 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa082f84d&amp;gt;&amp;#93;&lt;/span&gt; ll_delete_inode+0x8d/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811662f1&amp;gt;&amp;#93;&lt;/span&gt; evict+0x91/0x170&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81166762&amp;gt;&amp;#93;&lt;/span&gt; iput+0xc2/0x180&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07ffaf6&amp;gt;&amp;#93;&lt;/span&gt; ll_d_iput+0x2e6/0x830 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8116106b&amp;gt;&amp;#93;&lt;/span&gt; d_kill+0xcb/0x130&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81164401&amp;gt;&amp;#93;&lt;/span&gt; dput+0xc1/0x190&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8114ec27&amp;gt;&amp;#93;&lt;/span&gt; fput+0x167/0x200&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8115475f&amp;gt;&amp;#93;&lt;/span&gt; do_execve_common+0x16f/0x300&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8115497f&amp;gt;&amp;#93;&lt;/span&gt; do_execve+0x3f/0x50&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100a89e&amp;gt;&amp;#93;&lt;/span&gt; sys_execve+0x4e/0x80&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c87bc&amp;gt;&amp;#93;&lt;/span&gt; stub_execve+0x6c/0xc0&lt;br/&gt;
&amp;gt; 05:09:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00002aaaabe582c7&amp;gt;&amp;#93;&lt;/span&gt; 0x2aaaabe582c6&lt;/p&gt;

&lt;p&gt;&amp;gt; 05:18:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005eb9&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x169/0x1b0&lt;br/&gt;
&amp;gt; 05:18:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004919&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x450&lt;br/&gt;
&amp;gt; 05:18:21 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100591c&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005955&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813bd54b&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c9bf9&amp;gt;&amp;#93;&lt;/span&gt; __rcu_pending+0x199/0x410&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c9eda&amp;gt;&amp;#93;&lt;/span&gt; rcu_check_callbacks+0x6a/0x120&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8105cb96&amp;gt;&amp;#93;&lt;/span&gt; update_process_times+0x46/0x90&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81080a16&amp;gt;&amp;#93;&lt;/span&gt; tick_sched_timer+0x66/0xc0&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8107386f&amp;gt;&amp;#93;&lt;/span&gt; __run_hrtimer+0xcf/0x1d0&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81073bb7&amp;gt;&amp;#93;&lt;/span&gt; hrtimer_interrupt+0xe7/0x220&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c9dd9&amp;gt;&amp;#93;&lt;/span&gt; smp_apic_timer_interrupt+0x69/0x99&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c8d13&amp;gt;&amp;#93;&lt;/span&gt; apic_timer_interrupt+0x13/0x20&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa037e419&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_get_trust+0x89/0x90 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0380307&amp;gt;&amp;#93;&lt;/span&gt; cl_locks_prune+0xc7/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078dc91&amp;gt;&amp;#93;&lt;/span&gt; lov_delete_raid0+0xe1/0x3f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078d059&amp;gt;&amp;#93;&lt;/span&gt; lov_object_delete+0x69/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036bda5&amp;gt;&amp;#93;&lt;/span&gt; lu_object_free+0x85/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa036c2de&amp;gt;&amp;#93;&lt;/span&gt; lu_object_put+0xbe/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0374d1e&amp;gt;&amp;#93;&lt;/span&gt; cl_object_put+0xe/0x10 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa086803d&amp;gt;&amp;#93;&lt;/span&gt; cl_inode_fini+0xbd/0x2a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa082f160&amp;gt;&amp;#93;&lt;/span&gt; ll_clear_inode+0x2d0/0x930 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa082f84d&amp;gt;&amp;#93;&lt;/span&gt; ll_delete_inode+0x8d/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811662f1&amp;gt;&amp;#93;&lt;/span&gt; evict+0x91/0x170&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81166762&amp;gt;&amp;#93;&lt;/span&gt; iput+0xc2/0x180&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07ffaf6&amp;gt;&amp;#93;&lt;/span&gt; ll_d_iput+0x2e6/0x830 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8116106b&amp;gt;&amp;#93;&lt;/span&gt; d_kill+0xcb/0x130&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81164401&amp;gt;&amp;#93;&lt;/span&gt; dput+0xc1/0x190&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8114ec27&amp;gt;&amp;#93;&lt;/span&gt; fput+0x167/0x200&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8115475f&amp;gt;&amp;#93;&lt;/span&gt; do_execve_common+0x16f/0x300&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8115497f&amp;gt;&amp;#93;&lt;/span&gt; do_execve+0x3f/0x50&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100a89e&amp;gt;&amp;#93;&lt;/span&gt; sys_execve+0x4e/0x80&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813c87bc&amp;gt;&amp;#93;&lt;/span&gt; stub_execve+0x6c/0xc0&lt;br/&gt;
&amp;gt; 05:18:22 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00002aaaabe582c7&amp;gt;&amp;#93;&lt;/span&gt; 0x2aaaabe582c6&lt;br/&gt;
&amp;#8212;&lt;/p&gt;

&lt;p&gt;Sorry for the massive dump of information in one bug, but we strongly suspect these bugs have a single cause or several tightly related causes.&lt;/p&gt;

&lt;p&gt;With assistance from Xyratex, we&apos;ve singled these patches out as possible patches of interest that have come in between 2.4.0 and master:&lt;/p&gt;

&lt;p&gt;most suspicious:&lt;br/&gt;
18834a5 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt; clio: remove stackable cl_page completely&lt;br/&gt;
0a259bd &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt; clio: collapse layer of cl_page&lt;br/&gt;
less suspicious:&lt;br/&gt;
13079de &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; osc: Allow lock to be canceled at ENQ time&lt;br/&gt;
7168ea8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio: Do not shrink sublock at cancel&lt;br/&gt;
521335c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; clio: wrong cl_lock usage&lt;/p&gt;

&lt;p&gt;On my to-do list is testing master with some of these patches removed to see what, if any, affect this has on the bugs listed above.&lt;/p&gt;</description>
                <environment>Master clients on SLES11SP3, server version irrelevant (tested against 2.1,2.4,2.4.1,2.5).</environment>
        <key id="23022">LU-4591</key>
            <summary>Related cl_lock failures on master/2.5</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="paf">Patrick Farrell</reporter>
                        <labels>
                            <label>MB</label>
                            <label>mn4</label>
                    </labels>
                <created>Wed, 5 Feb 2014 22:53:18 +0000</created>
                <updated>Wed, 12 Nov 2014 18:07:17 +0000</updated>
                            <resolved>Fri, 4 Apr 2014 16:24:09 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                    <version>Lustre 2.6.0</version>
                    <version>Lustre 2.4.2</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                    <fixVersion>Lustre 2.5.2</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>21</watches>
                                                                            <comments>
                            <comment id="76316" author="paf" created="Wed, 5 Feb 2014 23:17:22 +0000"  >&lt;p&gt;mmstress&lt;/p&gt;</comment>
                            <comment id="76319" author="jay" created="Wed, 5 Feb 2014 23:25:57 +0000"  >&lt;p&gt;Thanks for the work. I&apos;d suggest you to try to remove those patches:&lt;/p&gt;

&lt;p&gt;13079de &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; osc: Allow lock to be canceled at ENQ time&lt;br/&gt;
7168ea8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio: Do not shrink sublock at cancel&lt;br/&gt;
521335c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; clio: wrong cl_lock usage&lt;/p&gt;


&lt;p&gt;and see what will happen.&lt;/p&gt;

&lt;p&gt;Yes, I agree with you that this looks really like a single cause. Will you please get some logs so that our guys can take a look?&lt;/p&gt;</comment>
                            <comment id="76320" author="jay" created="Wed, 5 Feb 2014 23:26:43 +0000"  >&lt;p&gt;Can you please tell me how you ran the test program? &lt;/p&gt;</comment>
                            <comment id="76321" author="paf" created="Wed, 5 Feb 2014 23:37:46 +0000"  >&lt;p&gt;Jinshan,&lt;/p&gt;

&lt;p&gt;Thanks for the quick response.&lt;/p&gt;

&lt;p&gt;About logs: Unfortunately, these problems don&apos;t happen with dlmtrace (or any of the other large debug flags - such as trace or rpctrace) enabled.  I created a special debug patch with all calls to cl_lock_trace under a special debug flag and was able to hit it with only that enabled.&lt;/p&gt;

&lt;p&gt;I should be able to get those logs for you tomorrow morning.  (Sorry I don&apos;t have them on hand, I had to clean out my old dumps/logs.)&lt;/p&gt;

&lt;p&gt;Just a heads up, Vitaly Fertman of Xyratex has been looking in to this with us.&lt;/p&gt;

&lt;p&gt;About mmstress - It&apos;s executed with no arguments, but we started multiple copies with our workload manager:&lt;br/&gt;
e.g.&lt;br/&gt;
aprun -n 100 ./mmstress&lt;/p&gt;

&lt;p&gt;Would run it on 100 cores, which is enough that we see the problems pretty quickly (with debug at default).  That core count is allocated by putting NUM_CPUs jobs on each node.  So if nodes had 8 cores, with 100 jobs, we&apos;d get 12 nodes with 8 jobs each, and one with 4 jobs.&lt;/p&gt;</comment>
                            <comment id="76322" author="paf" created="Wed, 5 Feb 2014 23:38:57 +0000"  >&lt;p&gt;Also, I&apos;ll (hopefully, system problems may interfere) be testing removing those patches tomorrow as well.&lt;/p&gt;</comment>
                            <comment id="76332" author="aboyko" created="Thu, 6 Feb 2014 08:34:09 +0000"  >&lt;p&gt;I have added info from crash for one case.&lt;/p&gt;</comment>
                            <comment id="76411" author="paf" created="Thu, 6 Feb 2014 23:24:01 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Edit&amp;#93;&lt;/span&gt; Sorry, I forgot a bit of background info.&lt;/p&gt;

&lt;p&gt;There are actually two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches, rather than one (and there is also the earlier patch which was removed).  I decided to look at them both.&lt;/p&gt;

&lt;p&gt;Here&apos;s the list of patches I explored today:&lt;br/&gt;
13079de &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; osc: Allow lock to be canceled at ENQ time (&lt;a href=&quot;http://review.whamcloud.com/#/c/8405/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8405/&lt;/a&gt;)&lt;br/&gt;
7168ea8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio: Do not shrink sublock at cancel (&lt;a href=&quot;http://review.whamcloud.com/#/c/7569/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7569/&lt;/a&gt;)&lt;br/&gt;
521335c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; clio: wrong cl_lock usage (&lt;a href=&quot;http://review.whamcloud.com/#/c/6709/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6709/&lt;/a&gt;)&lt;br/&gt;
I1ea629 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov: to not modify lov lock when sublock is canceled (&lt;a href=&quot;http://review.whamcloud.com/#/c/7841/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7841/&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I think the problem is a bad interaction between:&lt;br/&gt;
I1ea629 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov: to not modify lov lock when sublock is canceled (&lt;a href=&quot;http://review.whamcloud.com/#/c/7841/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7841/&lt;/a&gt;)&lt;br/&gt;
and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I started by removing all four of those patches I noted above from Cray 2.5, and confirmed there&apos;s no problem.&lt;/p&gt;

&lt;p&gt;I tested the two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches together, no problem.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3899&quot; title=&quot;lfs getstripe --raw option is ignored&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3899&quot;&gt;&lt;del&gt;LU-3899&lt;/del&gt;&lt;/a&gt; together, no problem.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; and both &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches, I saw LELUS-203 and several of the related bugs.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; and both &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches, and again I saw LELUS-203 and several of the related bugs.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt;, and 7168ea8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio: Do not shrink sublock at cancel (&lt;a href=&quot;http://review.whamcloud.com/#/c/7569/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7569/&lt;/a&gt;), no problem.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt;, and I1ea629 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov: to not modify lov lock when sublock is canceled (&lt;a href=&quot;http://review.whamcloud.com/#/c/7841/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7841/&lt;/a&gt;), and I hit (in 10 minutes or so - my other runs here were 30 minutes each):&lt;br/&gt;
GPF in osc_lock_detach&lt;br/&gt;
GPF in cl_lock_put&lt;br/&gt;
GPF in cl_lock_delete&lt;br/&gt;
osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;br/&gt;
cl_locks_prune (no exit)&lt;/p&gt;

&lt;p&gt;I&apos;m going to test the two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches together again, as the lack of failures with the lov patch in place surprises me.&lt;br/&gt;
I&apos;m also going to do some general stress testing with the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch removed and the other three patches in place.&lt;/p&gt;</comment>
                            <comment id="76416" author="paf" created="Thu, 6 Feb 2014 23:50:55 +0000"  >&lt;p&gt;With further testing of the two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches together, I hit the GPF in osc_lock_detach.&lt;/p&gt;</comment>
                            <comment id="76430" author="jay" created="Fri, 7 Feb 2014 04:14:30 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Thank you for the information, I will take a look. Just to confirm, you still can&apos;t reproduce this problem with dlmtrace enabled, is that right?&lt;/p&gt;</comment>
                            <comment id="76432" author="paf" created="Fri, 7 Feb 2014 04:31:10 +0000"  >&lt;p&gt;Jinshan - Correct.&lt;/p&gt;

&lt;p&gt;The logs Alex Boyko provided are from a special debug patch with the calls to cl_lock_trace moved to their own debug flags (actually, two different ones).  Both of those flags were on, but nothing else.  So all calls to cl_lock_trace should be logged there.&lt;/p&gt;

&lt;p&gt;Also, in further stress testing (notice it is mmstress again) with the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch removed, I did hit one of the assertions we&apos;ve got as tentatively related - but so far, only this one, and after much longer testing than I did above (when I hit several bugs with that patch installed):&lt;br/&gt;
2014-02-06T20:33:06.582316-06:00 c0-0c2s0n0 LustreError: 4302:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&lt;br/&gt;
2014-02-06T20:33:06.582369-06:00 c0-0c2s0n0 LustreError: 4302:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG&lt;br/&gt;
2014-02-06T20:33:06.582377-06:00 c0-0c2s0n0 Pid: 4302, comm: mmstress&lt;br/&gt;
2014-02-06T20:33:06.582383-06:00 c0-0c2s0n0 Call Trace:&lt;br/&gt;
2014-02-06T20:33:06.582398-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810065b1&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
2014-02-06T20:33:06.582414-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004dd9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
2014-02-06T20:33:06.611611-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa016b897&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x57/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.611627-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa016bde7&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.611636-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dc26e&amp;gt;&amp;#93;&lt;/span&gt; lovsub_lock_state+0x19e/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.611649-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbb80&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_signal+0x60/0x160 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.611674-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbd4d&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_set+0xcd/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.611685-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfe0b&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0x14b/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641543-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06d46cb&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x1fb/0xf80 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641565-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfdbb&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641580-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0c9f&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_locked+0x7f/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641587-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c188e&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x7e/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641598-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c6dc4&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x394/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.641605-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c708a&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0x9a/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.670989-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078f350&amp;gt;&amp;#93;&lt;/span&gt; ll_page_mkwrite+0x280/0x680 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.671014-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81112267&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0xe7/0x570&lt;br/&gt;
2014-02-06T20:33:06.671030-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81112794&amp;gt;&amp;#93;&lt;/span&gt; handle_pte_fault+0xa4/0xcc0&lt;br/&gt;
2014-02-06T20:33:06.671049-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8111355e&amp;gt;&amp;#93;&lt;/span&gt; handle_mm_fault+0x1ae/0x240&lt;br/&gt;
2014-02-06T20:33:06.671069-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026caf&amp;gt;&amp;#93;&lt;/span&gt; do_page_fault+0x18f/0x420&lt;br/&gt;
2014-02-06T20:33:06.671080-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81367acf&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;br/&gt;
2014-02-06T20:33:06.671087-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00000000200007ea&amp;gt;&amp;#93;&lt;/span&gt; 0x200007ea&lt;br/&gt;
2014-02-06T20:33:06.701413-06:00 c0-0c2s0n0 Kernel panic - not syncing: LBUG&lt;br/&gt;
2014-02-06T20:33:06.701431-06:00 c0-0c2s0n0 Pid: 4302, comm: mmstress Tainted: P            3.0.93-0.8.2_1.0000.7755-cray_gem_c #1&lt;br/&gt;
2014-02-06T20:33:06.701445-06:00 c0-0c2s0n0 Call Trace:&lt;br/&gt;
2014-02-06T20:33:06.701453-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810065b1&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
2014-02-06T20:33:06.701463-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004dd9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
2014-02-06T20:33:06.701481-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100601c&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
2014-02-06T20:33:06.701494-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006055&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
2014-02-06T20:33:06.733566-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81364432&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
2014-02-06T20:33:06.733584-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff813644d1&amp;gt;&amp;#93;&lt;/span&gt; panic+0x94/0x1da&lt;br/&gt;
2014-02-06T20:33:06.733592-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa016be4b&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0xab/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.733605-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dc26e&amp;gt;&amp;#93;&lt;/span&gt; lovsub_lock_state+0x19e/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.733642-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbb80&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_signal+0x60/0x160 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.733655-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bbd4d&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_state_set+0xcd/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.733663-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfe0b&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0x14b/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.761443-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06d46cb&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x1fb/0xf80 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.787133-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfdbb&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.787151-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0c9f&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_locked+0x7f/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.787181-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c188e&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x7e/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.787197-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c6dc4&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x394/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.812740-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c708a&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0x9a/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.812757-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa078f350&amp;gt;&amp;#93;&lt;/span&gt; ll_page_mkwrite+0x280/0x680 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-02-06T20:33:06.812786-06:00 c0-0c2s0n0 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81112267&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0xe7/0x570&lt;/p&gt;</comment>
                            <comment id="76436" author="paf" created="Fri, 7 Feb 2014 05:14:29 +0000"  >&lt;p&gt;Jinshan - Something that came up in a Cray discussion of the history of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When we opened &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;, Oleg pointed us at the first of the two current patches for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; (ignoring the one which was removed for causing a regression).  That fixed our original reproducer that I opened &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; with.&lt;/p&gt;

&lt;p&gt;Then Oleg reported Intel was still seeing the assertion from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;, and noted there was a second patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;.  We (and you) picked that one up, but then both Cray and Intel continued to see the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; assertion.&lt;/p&gt;

&lt;p&gt;Then we found racer.sh could reproduce the problem, and you and several of the Xyratex guys worked out a patch for it, which was labeled with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;.  After getting that patch, we have not seen the assertion from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; again.&lt;/p&gt;

&lt;p&gt;So I don&apos;t think we have any hard evidence that the second &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patch fixed any of the bugs we were looking at at the time.&lt;/p&gt;</comment>
                            <comment id="76467" author="paf" created="Fri, 7 Feb 2014 15:43:56 +0000"  >&lt;p&gt;One further thought...&lt;br/&gt;
As these bugs don&apos;t usually happen with any of the tracing options turned on, they&apos;re clearly timing sensitive.  That brings up the possibility that reverting the second &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3207&quot; title=&quot;Interop 2.1.5&amp;lt;-&amp;gt;2.4 failure on test suite lustre-rsync-test test_7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3207&quot;&gt;&lt;del&gt;LU-3207&lt;/del&gt;&lt;/a&gt; patch is just changing timing, rather than the problem being in the patch itself.  Just something else to chew on.&lt;/p&gt;</comment>
                            <comment id="76616" author="jay" created="Mon, 10 Feb 2014 17:24:05 +0000"  >&lt;p&gt;working on this ...&lt;/p&gt;</comment>
                            <comment id="76630" author="paf" created="Mon, 10 Feb 2014 18:51:56 +0000"  >&lt;p&gt;Jinshan - I know from experience these bugs can be hard to replicate without a larger system.  If you&apos;ve got something you&apos;d like tested (including a debug patch), I can make time to test it on one of our in house systems here where we can replicate the problems.&lt;/p&gt;</comment>
                            <comment id="76635" author="jay" created="Mon, 10 Feb 2014 19:32:39 +0000"  >&lt;p&gt;Patrick - will you please attach osc.ko here?&lt;/p&gt;</comment>
                            <comment id="76639" author="paf" created="Mon, 10 Feb 2014 19:49:03 +0000"  >&lt;p&gt;Jinshan - Can you be more specific?  osc.ko that goes with the logs that Alex Boyko provided?  If that one, he&apos;ll have to provide that, unfortunately.&lt;/p&gt;</comment>
                            <comment id="76643" author="jay" created="Mon, 10 Feb 2014 20:16:25 +0000"  >&lt;p&gt;I just want to know the source of osc_lock_detach+0x46 so that I will know which freed data it was trying to access.&lt;/p&gt;</comment>
                            <comment id="76644" author="jay" created="Mon, 10 Feb 2014 20:17:30 +0000"  >&lt;p&gt;Mostly it implies dlmlock was already freed, just want to make sure.&lt;/p&gt;</comment>
                            <comment id="76645" author="paf" created="Mon, 10 Feb 2014 20:23:13 +0000"  >&lt;p&gt;Ah, OK.&lt;/p&gt;

&lt;p&gt;I&apos;ll attach another osc.ko in a moment where we get the crash at the same line...&lt;/p&gt;

&lt;p&gt;In case it&apos;s enough, here&apos;s a disassemble of osc_lock_detach and a line number from the ko I&apos;m going to attach:&lt;br/&gt;
crash&amp;gt; gdb list *(osc_lock_detach+46)&lt;br/&gt;
0xffffffffa063e56e is in osc_lock_detach (/usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spi&lt;br/&gt;
nlock.h:285).&lt;br/&gt;
280     /usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spinlock.h: No such file or directory&lt;br/&gt;
.&lt;br/&gt;
        in /usr/src/linux-3.0.93-0.8.2_1.0000.7747/include/linux/spinlock.h&lt;/p&gt;


&lt;p&gt;crash&amp;gt; disassemble osc_lock_detach&lt;br/&gt;
Dump of assembler code for function osc_lock_detach:&lt;br/&gt;
   0xffffffffa063e540 &amp;lt;+0&amp;gt;:     push   %rbp&lt;br/&gt;
   0xffffffffa063e541 &amp;lt;+1&amp;gt;:     mov    %rsp,%rbp&lt;br/&gt;
   0xffffffffa063e544 &amp;lt;+4&amp;gt;:     sub    $0x30,%rsp&lt;br/&gt;
   0xffffffffa063e548 &amp;lt;+8&amp;gt;:     mov    %rbx,-0x28(%rbp)&lt;br/&gt;
   0xffffffffa063e54c &amp;lt;+12&amp;gt;:    mov    %r12,-0x20(%rbp)&lt;br/&gt;
   0xffffffffa063e550 &amp;lt;+16&amp;gt;:    mov    %r13,-0x18(%rbp)&lt;br/&gt;
   0xffffffffa063e554 &amp;lt;+20&amp;gt;:    mov    %r14,-0x10(%rbp)&lt;br/&gt;
   0xffffffffa063e558 &amp;lt;+24&amp;gt;:    mov    %r15,-0x8(%rbp)&lt;br/&gt;
   0xffffffffa063e55c &amp;lt;+28&amp;gt;:    data32 data32 data32 xchg %ax,%ax&lt;br/&gt;
   0xffffffffa063e561 &amp;lt;+33&amp;gt;:    mov    %rsi,%r12&lt;br/&gt;
   0xffffffffa063e564 &amp;lt;+36&amp;gt;:    mov    %rdi,%r15&lt;br/&gt;
   0xffffffffa063e567 &amp;lt;+39&amp;gt;:    mov    $0xffffffffa0674e00,%rdi&lt;br/&gt;
   0xffffffffa063e56e &amp;lt;+46&amp;gt;:    callq  0xffffffff81367280 &amp;lt;_raw_spin_lock&amp;gt;&lt;br/&gt;
   0xffffffffa063e573 &amp;lt;+51&amp;gt;:    mov    0x28(%r12),%rbx&lt;br/&gt;
   0xffffffffa063e578 &amp;lt;+56&amp;gt;:    test   %rbx,%rbx&lt;br/&gt;
   0xffffffffa063e57b &amp;lt;+59&amp;gt;:    je     0xffffffffa063e5f8 &amp;lt;osc_lock_detach+184&amp;gt;&lt;br/&gt;
   0xffffffffa063e57d &amp;lt;+61&amp;gt;:    movq   $0x0,0x28(%r12)&lt;br/&gt;
   0xffffffffa063e586 &amp;lt;+70&amp;gt;:    movq   $0x0,0x160(%rbx)&lt;br/&gt;
   0xffffffffa063e591 &amp;lt;+81&amp;gt;:    movq   $0x0,0x70(%r12)&lt;br/&gt;
   0xffffffffa063e59a &amp;lt;+90&amp;gt;:    incb   0x36860(%rip)        # 0xffffffffa0674e00 &amp;lt;osc_ast_guard&amp;gt;&lt;br/&gt;
   0xffffffffa063e5a0 &amp;lt;+96&amp;gt;:    mov    %rbx,%rdi&lt;br/&gt;
   0xffffffffa063e5a3 &amp;lt;+99&amp;gt;:    callq  0xffffffffa03ae030 &amp;lt;lock_res_and_lock&amp;gt;&lt;br/&gt;
   0xffffffffa063e5a8 &amp;lt;+104&amp;gt;:   mov    0x9c(%rbx),%eax&lt;br/&gt;
   0xffffffffa063e5ae &amp;lt;+110&amp;gt;:   cmp    0x98(%rbx),%eax&lt;br/&gt;
   0xffffffffa063e5b4 &amp;lt;+116&amp;gt;:   je     0xffffffffa063e600 &amp;lt;osc_lock_detach+192&amp;gt;&lt;br/&gt;
   0xffffffffa063e5b6 &amp;lt;+118&amp;gt;:   mov    %rbx,%rdi&lt;br/&gt;
   0xffffffffa063e5b9 &amp;lt;+121&amp;gt;:   callq  0xffffffffa03ae000 &amp;lt;unlock_res_and_lock&amp;gt;&lt;br/&gt;
   0xffffffffa063e5be &amp;lt;+126&amp;gt;:   testb  $0x2,0xa8(%r12)&lt;br/&gt;
   0xffffffffa063e5c7 &amp;lt;+135&amp;gt;:   je     0xffffffffa063e65c &amp;lt;osc_lock_detach+284&amp;gt;&lt;br/&gt;
   0xffffffffa063e5cd &amp;lt;+141&amp;gt;:   mov    %rbx,%rdi&lt;br/&gt;
   0xffffffffa063e5d0 &amp;lt;+144&amp;gt;:   callq  0xffffffffa03b0450 &amp;lt;ldlm_lock_put&amp;gt;&lt;br/&gt;
   0xffffffffa063e5d5 &amp;lt;+149&amp;gt;:   andb   $0xfd,0xa8(%r12)&lt;br/&gt;
   0xffffffffa063e5de &amp;lt;+158&amp;gt;:   mov    -0x28(%rbp),%rbx&lt;br/&gt;
   0xffffffffa063e5e2 &amp;lt;+162&amp;gt;:   mov    -0x20(%rbp),%r12&lt;br/&gt;
   0xffffffffa063e5e6 &amp;lt;+166&amp;gt;:   mov    -0x18(%rbp),%r13&lt;br/&gt;
   0xffffffffa063e5ea &amp;lt;+170&amp;gt;:   mov    -0x10(%rbp),%r14&lt;br/&gt;
   0xffffffffa063e5ee &amp;lt;+174&amp;gt;:   mov    -0x8(%rbp),%r15&lt;br/&gt;
   0xffffffffa063e5f2 &amp;lt;+178&amp;gt;:   leaveq &lt;br/&gt;
   0xffffffffa063e5f3 &amp;lt;+179&amp;gt;:   retq   &lt;br/&gt;
   0xffffffffa063e5f4 &amp;lt;+180&amp;gt;:   nopl   0x0(%rax)&lt;br/&gt;
   0xffffffffa063e5f8 &amp;lt;+184&amp;gt;:   incb   0x36802(%rip)        # 0xffffffffa0674e00 &amp;lt;osc_ast_guard&amp;gt;&lt;br/&gt;
   0xffffffffa063e5fe &amp;lt;+190&amp;gt;:   jmp    0xffffffffa063e5de &amp;lt;osc_lock_detach+158&amp;gt;&lt;br/&gt;
   0xffffffffa063e600 &amp;lt;+192&amp;gt;:   mov    $0xffffffffa066b440,%rsi&lt;br/&gt;
   0xffffffffa063e607 &amp;lt;+199&amp;gt;:   mov    %r15,%rdi&lt;br/&gt;
   0xffffffffa063e60a &amp;lt;+202&amp;gt;:   mov    0x8(%r12),%r14&lt;br/&gt;
   0xffffffffa063e60f &amp;lt;+207&amp;gt;:   callq  0xffffffffa02a8a50 &amp;lt;lu_context_key_get&amp;gt;&lt;br/&gt;
   0xffffffffa063e614 &amp;lt;+212&amp;gt;:   test   %rax,%rax&lt;br/&gt;
   0xffffffffa063e617 &amp;lt;+215&amp;gt;:   mov    %rax,%r13&lt;br/&gt;
   0xffffffffa063e61a &amp;lt;+218&amp;gt;:   je     0xffffffffa063e68e &amp;lt;osc_lock_detach+334&amp;gt;&lt;br/&gt;
   0xffffffffa063e61c &amp;lt;+220&amp;gt;:   mov    %r14,%rdi&lt;br/&gt;
   0xffffffffa063e61f &amp;lt;+223&amp;gt;:   callq  0xffffffffa02b3f50 &amp;lt;cl_object_attr_lock&amp;gt;&lt;br/&gt;
   0xffffffffa063e624 &amp;lt;+228&amp;gt;:   mov    0x38(%r14),%rax&lt;br/&gt;
   0xffffffffa063e628 &amp;lt;+232&amp;gt;:   mov    %rbx,%rdi&lt;br/&gt;
   0xffffffffa063e62b &amp;lt;+235&amp;gt;:   mov    0x20(%rax),%rsi&lt;br/&gt;
   0xffffffffa063e62f &amp;lt;+239&amp;gt;:   callq  0xffffffffa03c0370 &amp;lt;ldlm_extent_shift_kms&amp;gt;&lt;br/&gt;
   0xffffffffa063e634 &amp;lt;+244&amp;gt;:   lea    0x78(%r13),%rdx&lt;br/&gt;
   0xffffffffa063e638 &amp;lt;+248&amp;gt;:   mov    %r15,%rdi&lt;br/&gt;
   0xffffffffa063e63b &amp;lt;+251&amp;gt;:   mov    %rax,0x80(%r13)&lt;br/&gt;
   0xffffffffa063e642 &amp;lt;+258&amp;gt;:   mov    $0x2,%ecx&lt;br/&gt;
   0xffffffffa063e647 &amp;lt;+263&amp;gt;:   mov    %r14,%rsi&lt;br/&gt;
   0xffffffffa063e64a &amp;lt;+266&amp;gt;:   callq  0xffffffffa02b3f70 &amp;lt;cl_object_attr_set&amp;gt;&lt;br/&gt;
   0xffffffffa063e64f &amp;lt;+271&amp;gt;:   mov    %r14,%rdi&lt;br/&gt;
   0xffffffffa063e652 &amp;lt;+274&amp;gt;:   callq  0xffffffffa02b3f30 &amp;lt;cl_object_attr_unlock&amp;gt;&lt;br/&gt;
   0xffffffffa063e657 &amp;lt;+279&amp;gt;:   jmpq   0xffffffffa063e5b6 &amp;lt;osc_lock_detach+118&amp;gt;&lt;br/&gt;
   0xffffffffa063e65c &amp;lt;+284&amp;gt;:   mov    $0xffffffffa066c660,%rdi&lt;br/&gt;
   0xffffffffa063e663 &amp;lt;+291&amp;gt;:   mov    $0xffffffffa0661d4d,%rdx&lt;br/&gt;
   0xffffffffa063e66a &amp;lt;+298&amp;gt;:   mov    $0xffffffffa0661ce4,%rsi&lt;br/&gt;
   0xffffffffa063e671 &amp;lt;+305&amp;gt;:   xor    %eax,%eax&lt;br/&gt;
   0xffffffffa063e673 &amp;lt;+307&amp;gt;:   movl   $0x40000,0x2dffb(%rip)        # 0xffffffffa066c678 &amp;lt;__msg_da&lt;br/&gt;
ta.72090+24&amp;gt;&lt;br/&gt;
   0xffffffffa063e67d &amp;lt;+317&amp;gt;:   callq  0xffffffffa017a280 &amp;lt;libcfs_debug_msg&amp;gt;&lt;br/&gt;
   0xffffffffa063e682 &amp;lt;+322&amp;gt;:   mov    $0xffffffffa066c660,%rdi&lt;br/&gt;
   0xffffffffa063e689 &amp;lt;+329&amp;gt;:   callq  0xffffffffa016ada0 &amp;lt;lbug_with_loc&amp;gt;&lt;br/&gt;
   0xffffffffa063e68e &amp;lt;+334&amp;gt;:   mov    $0xffffffffa066c6a0,%rdi&lt;br/&gt;
   0xffffffffa063e695 &amp;lt;+341&amp;gt;:   mov    $0xffffffffa0661d39,%rdx&lt;br/&gt;
   0xffffffffa063e69c &amp;lt;+348&amp;gt;:   mov    $0xffffffffa0661ce4,%rsi&lt;br/&gt;
   0xffffffffa063e6a3 &amp;lt;+355&amp;gt;:   xor    %eax,%eax&lt;br/&gt;
   0xffffffffa063e6a5 &amp;lt;+357&amp;gt;:   movl   $0x40000,0x2e009(%rip)        # 0xffffffffa066c6b8 &amp;lt;__msg_da&lt;br/&gt;
ta.71857+24&amp;gt;&lt;br/&gt;
   0xffffffffa063e6af &amp;lt;+367&amp;gt;:   callq  0xffffffffa017a280 &amp;lt;libcfs_debug_msg&amp;gt;&lt;br/&gt;
   0xffffffffa063e6b4 &amp;lt;+372&amp;gt;:   mov    $0xffffffffa066c6a0,%rdi&lt;br/&gt;
   0xffffffffa063e6bb &amp;lt;+379&amp;gt;:   callq  0xffffffffa016ada0 &amp;lt;lbug_with_loc&amp;gt;&lt;br/&gt;
End of assembler dump.&lt;/p&gt;</comment>
                            <comment id="76647" author="paf" created="Mon, 10 Feb 2014 20:24:50 +0000"  >&lt;p&gt;KO that goes with the dissassembly in Paf&apos;s comment &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76645&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76645&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76645&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76645&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="76654" author="jay" created="Mon, 10 Feb 2014 21:13:13 +0000"  >&lt;p&gt;Hi Patrick - do you have a crash dump in hand? if yes, can you please show me the state of corresponding cl_lock and osc_lock?&lt;/p&gt;</comment>
                            <comment id="76676" author="paf" created="Mon, 10 Feb 2014 23:17:07 +0000"  >&lt;p&gt;Jinshan - Not right this second, but I&apos;ll try to get one uploaded for you so you can take a look.  Sorry about not having one in hand.&lt;/p&gt;</comment>
                            <comment id="76678" author="paf" created="Mon, 10 Feb 2014 23:33:40 +0000"  >&lt;p&gt;Dump is here, with KOs and console and messages log:&lt;br/&gt;
ftp.whamcloud.com&lt;/p&gt;

&lt;p&gt;uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;.1402101702.tar.gz&lt;/p&gt;

&lt;p&gt;The node which went down with osc_lock_detach is named c2-0c0s7n3.&lt;/p&gt;

&lt;p&gt;This system was running Cray 2.5.&lt;/p&gt;</comment>
                            <comment id="76681" author="jay" created="Tue, 11 Feb 2014 00:28:03 +0000"  >&lt;p&gt;Hi Patrick - what&apos;s the tip of your branch?&lt;/p&gt;</comment>
                            <comment id="76724" author="paf" created="Tue, 11 Feb 2014 15:16:10 +0000"  >&lt;p&gt;Jinshan - Sadly, we don&apos;t use git, so there&apos;s no answer to that question. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Our 2.5 is 2.5 as released by Intel plus a number of patches we&apos;ve pulled in, but I can replicate the same problems in the same way on master or Intel&apos;s released 2.5.  If it would help, I could do it with one of those code bases - That&apos;s just the dump I had handy.&lt;/p&gt;</comment>
                            <comment id="76756" author="jay" created="Tue, 11 Feb 2014 18:34:44 +0000"  >&lt;p&gt;No worry Patrick. I will take a look at the dump, sorry was interrupted by something else yesterday.&lt;/p&gt;</comment>
                            <comment id="76805" author="jay" created="Wed, 12 Feb 2014 05:48:16 +0000"  >&lt;p&gt;I&apos;ve taken a look at the dump. I suspect this issue is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt;, the patch # is 8405.&lt;/p&gt;

&lt;p&gt;Patrick - can you please revert that patch and see what&apos;ll happen?&lt;/p&gt;</comment>
                            <comment id="76851" author="paf" created="Wed, 12 Feb 2014 17:07:24 +0000"  >&lt;p&gt;Sure.  I just started testing master with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; removerted and hit the GPF in cl_lock_delete I noted in my original message.  I&apos;m going to keep testing, but I suspect we&apos;ll see some of the other bugs.&lt;/p&gt;

&lt;p&gt;Would you like the dump from that cl_lock_delete GPF?  (No debugging enabled.)&lt;/p&gt;

&lt;p&gt;Further update:&lt;/p&gt;

&lt;p&gt;I&apos;ve hit these three other bugs from the list above:&lt;br/&gt;
2014-02-12T12:32:04.309770-06:00 c0-0c0s5n2 LustreError: 6228:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed: &lt;br/&gt;
2014-02-12T12:32:04.309813-06:00 c0-0c0s5n2 LustreError: 6228:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG&lt;/p&gt;

&lt;p&gt;GPF at osc_lock_detach+0x46&lt;/p&gt;

&lt;p&gt;2014-02-12T12:19:01.577674-06:00 c0-0c2s2n3 LustreError: 3031:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;br/&gt;
2014-02-12T12:19:01.577718-06:00 c0-0c2s2n3 LustreError: 3031:0:(osc_lock.c:1204:osc_lock_enqueue()) LBUG&lt;/p&gt;

&lt;p&gt;I also hit &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4269&quot; title=&quot;ldlm_lock_put()) ASSERTION( (((( lock))-&amp;gt;l_flags &amp;amp; (1ULL &amp;lt;&amp;lt; 50)) != 0) ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4269&quot;&gt;&lt;del&gt;LU-4269&lt;/del&gt;&lt;/a&gt;:&lt;br/&gt;
2014-02-12T11:46:50.233799-06:00 c0-0c2s6n0 LustreError: 16995:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))-&amp;gt;l_flags &amp;amp; (1ULL &amp;lt;&amp;lt; 50)) != 0) ) failed: &lt;/p&gt;</comment>
                            <comment id="77321" author="lixi" created="Wed, 19 Feb 2014 04:06:52 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Would you please share the way of reproducing these bugs? I&apos;ve tried to run multiple processes of LTP mmstress on Lustre to reproduce them, but failed. I ran with &quot;./mmstress -t 1&quot; commands. Is there anything I am missing?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="77364" author="paf" created="Wed, 19 Feb 2014 15:57:29 +0000"  >&lt;p&gt;Li - Sure.  I have never successfully reproduced these on a small system.  My usual system has 70 nodes on it, though I expect something smaller could do it as well.  But when I tried with two and three nodes, I wasn&apos;t able to reproduce the problem either.&lt;/p&gt;

&lt;p&gt;I run - with no command line options - 4 copies of mmstress per node, on ~70 nodes.  All copies of mmstress are executed in the same directory on the Lustre file system.  Within a half hour, on master or 2.5 or 2.4.1, I&apos;ve hit about 15-20 of these problems.&lt;/p&gt;</comment>
                            <comment id="78083" author="bobijam" created="Fri, 28 Feb 2014 10:18:55 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Would you mind giving &lt;a href=&quot;http://review.whamcloud.com/9433&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9433&lt;/a&gt; a try, it&apos;s a rewrite of &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; osc: Allow lock to be canceled at ENQ time&quot; patch.&lt;/p&gt;</comment>
                            <comment id="78127" author="paf" created="Fri, 28 Feb 2014 20:22:13 +0000"  >&lt;p&gt;Zhenyu - I tried this patch on master just now (master from today + your patch) with the mmstress reproducer.&lt;/p&gt;

&lt;p&gt;I hit essentially all of the bugs from above, and I suspect if I kept running, I would see the others.&lt;/p&gt;

&lt;p&gt;Here&apos;s the list of those I hit for  sure:&lt;br/&gt;
LustreError: 4971:0:(lov_lock.c:216:lov_sublock_lock()) ASSERTION( cl_lock_is_mutexed(child) ) failed: &lt;/p&gt;

&lt;p&gt;GPF in osc_lock_detach&lt;/p&gt;

&lt;p&gt;No exit stuck in cl_locks_prune&lt;/p&gt;

&lt;p&gt;LustreError: 12688:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;/p&gt;

&lt;p&gt;GPF in cl_lock_delete&lt;/p&gt;</comment>
                            <comment id="78128" author="paf" created="Fri, 28 Feb 2014 20:24:57 +0000"  >&lt;p&gt;cl_lock debugging patch&lt;/p&gt;</comment>
                            <comment id="78129" author="paf" created="Fri, 28 Feb 2014 20:27:48 +0000"  >&lt;p&gt;Zhenyu -&lt;/p&gt;

&lt;p&gt;I just attached a debug patch which breaks out the cl_lock_trace calls under other debug flags.  In the past, I&apos;ve been able to hit some of these bugs with the cllock and clfree debug flags that patch adds enabled.  (I can&apos;t hit them with any of the heavier debug flags, like dlmtrace or rpctrace, enabled.)&lt;/p&gt;

&lt;p&gt;Would you be interested in a dump and logs of one of these crashes with your patch and that debug patch?&lt;/p&gt;</comment>
                            <comment id="78147" author="bobijam" created="Sat, 1 Mar 2014 02:01:10 +0000"  >&lt;p&gt;yes please, do you want me to combine these two patches in a single patches for you to get a built image easily?&lt;/p&gt;</comment>
                            <comment id="78149" author="bobijam" created="Sat, 1 Mar 2014 03:03:36 +0000"  >&lt;p&gt;FYI, I&apos;ve pushed &lt;a href=&quot;http://review.whamcloud.com/9441&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9441&lt;/a&gt; , the debug patch which you provided which is based on my patch.&lt;/p&gt;</comment>
                            <comment id="78221" author="bobijam" created="Mon, 3 Mar 2014 14:59:29 +0000"  >&lt;p&gt;Patrick,&lt;/p&gt;

&lt;p&gt;FYI, the error you reported in Mar 1st are exactly issues reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4692&quot; title=&quot;(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4692&quot;&gt;&lt;del&gt;LU-4692&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4693&quot; title=&quot;(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4693&quot;&gt;&lt;del&gt;LU-4693&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="78224" author="paf" created="Mon, 3 Mar 2014 15:17:49 +0000"  >&lt;p&gt;Thanks for pointing those out - Good that someone else has seen them.&lt;/p&gt;

&lt;p&gt;I should be able to get you a dump with debugging (I hope) later today.&lt;/p&gt;</comment>
                            <comment id="78258" author="paf" created="Mon, 3 Mar 2014 19:10:43 +0000"  >&lt;p&gt;Unfortunately, I&apos;ve been unable to hit the bugs with debugging enabled.  I&apos;m trying with only the clfree debugging option on.&lt;/p&gt;

&lt;p&gt;I did see something that may not be related...  I have a number of threads not exiting stuck here:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811f8ad4&amp;gt;&amp;#93;&lt;/span&gt; call_rwsem_down_read_failed+0x14/0x30&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810468db&amp;gt;&amp;#93;&lt;/span&gt; exit_mm+0x3b/0x160&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8104853e&amp;gt;&amp;#93;&lt;/span&gt; do_exit+0x18e/0x980&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81048d78&amp;gt;&amp;#93;&lt;/span&gt; do_group_exit+0x48/0xc0&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81059d93&amp;gt;&amp;#93;&lt;/span&gt; get_signal_to_deliver+0x243/0x480&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81002330&amp;gt;&amp;#93;&lt;/span&gt; do_notify_resume+0xe0/0x7f0&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81367d53&amp;gt;&amp;#93;&lt;/span&gt; retint_signal+0x46/0x83&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;00000000200007d1&amp;gt;&amp;#93;&lt;/span&gt; 0x200007d1&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffffffffff&amp;gt;&amp;#93;&lt;/span&gt; 0xffffffffffffffff&lt;/p&gt;


&lt;p&gt;I&apos;ve seen this before, but only, I think on master doing these tests.  It seems to happen when the tests are run for a long time.  (Normally they aren&apos;t run very long because nodes are dropping.)&lt;/p&gt;</comment>
                            <comment id="78261" author="paf" created="Mon, 3 Mar 2014 19:29:35 +0000"  >&lt;p&gt;With debugging reduced to just the clfree flag (cl_lock_tracing only in cl_free), I started hitting the various bugs.&lt;/p&gt;

&lt;p&gt;I grabbed three dumps.&lt;br/&gt;
Node c0-0c1s2n0: GPF in osc_lock_detach&lt;br/&gt;
Node c0-0c1s1n0: No exit stuck in cl_locks_prune&lt;br/&gt;
Node c0-0c0s5n2: ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;/p&gt;

&lt;p&gt;Dumps are uploading, will be here in about 5 minutes:&lt;br/&gt;
ftp.whamcloud.com:/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;.1403031327.tar.gz&lt;/p&gt;

&lt;p&gt;I&apos;ll go back to testing with the cllock and clfree flags on to see if I can hit the bug.&lt;/p&gt;</comment>
                            <comment id="78262" author="paf" created="Mon, 3 Mar 2014 19:34:16 +0000"  >&lt;p&gt;I suspect if you need better logs, we&apos;ll have to adjust the debug further.&lt;/p&gt;

&lt;p&gt;Before we discovered removing one of the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; patches hides the problem for us, I did some analysis of which calls to cl_lock_trace were most common when running this test, hoping to identify which ones we could remove or make lighter.&lt;/p&gt;

&lt;p&gt;Here&apos;s that data.  The first number is the # of calls to that particular cl_lock_trace in the sample I gathered:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;Note: Data was gathered on Cray 2.5, rather than master.  Line numbers are probably a bit different.&amp;#93;&lt;/span&gt;&lt;br/&gt;
162054 (cl_lock.c:150:cl_lock_trace0()) releasing ref: at cl_lock_put():308&lt;br/&gt;
160954 (cl_lock.c:150:cl_lock_trace0()) put mutex: at cl_lock_mutex_put():754&lt;br/&gt;
160913 (cl_lock.c:150:cl_lock_trace0()) got mutex: at cl_lock_mutex_tail():661&lt;br/&gt;
67146 (cl_lock.c:150:cl_lock_trace0()) acquiring trusted at cl_lock_get_trust():348&lt;br/&gt;
65822 (cl_lock.c:150:cl_lock_trace0()) acquiring ref: at cl_lock_get():332&lt;br/&gt;
37243 (cl_lock.c:150:cl_lock_trace0()) changing state: ate_set():1059&lt;br/&gt;
37243 (cl_lock.c:1058:cl_lock_state_set()&lt;br/&gt;
34207 (cl_lock.c:150:cl_lock_trace0()) delete lock: at cl_lock_delete():1795&lt;br/&gt;
34195 (cl_lock.c:150:cl_lock_trace0()) cancel lock: at cl_lock_cancel():1853&lt;br/&gt;
29408 (cl_lock.c:150:cl_lock_trace0()) changing holds: at cl_lock_hold_mod():868&lt;br/&gt;
29407 (cl_lock.c:867:cl_lock_hold_mod()&lt;br/&gt;
21250 (cl_lock.c:826:cl_lock_delete0()&lt;br/&gt;
17038 (cl_lock.c:802:cl_lock_cancel0()&lt;br/&gt;
12549 (cl_lock.c:150:cl_lock_trace0()) enclosure lock: at cl_lock_enclosure():1701&lt;br/&gt;
7672 (cl_lock.c:150:cl_lock_trace0()) free lock: at cl_lock_free():269&lt;br/&gt;
7491 (cl_lock.c:150:cl_lock_trace0()) enqueue lock: at cl_enqueue_try():1201&lt;br/&gt;
7272 (cl_lock.c:150:cl_lock_trace0()) disclosure lock: at cl_lock_disclosure():1744&lt;br/&gt;
6172 (cl_lock.c:888:cl_lock_used_mod()&lt;br/&gt;
6172 (cl_lock.c:150:cl_lock_trace0()) changing users: at cl_lock_used_mod():889&lt;br/&gt;
3022 (cl_lock.c:150:cl_lock_trace0()) unuse lock: at cl_unuse_try():1372&lt;br/&gt;
1882 (cl_lock.c:150:cl_lock_trace0()) alloc lock: at cl_lock_alloc():410&lt;br/&gt;
589 (cl_lock.c:150:cl_lock_trace0()) use lock: at cl_use_try():1108&lt;br/&gt;
386 (cl_lock.c:150:cl_lock_trace0()) enqueue failed: at cl_lock_request():2180&lt;br/&gt;
4 (cl_lock.c:1800:cl_lock_delete()&lt;br/&gt;
3 (cl_lock.c:1858:cl_lock_cancel()&lt;/p&gt;

&lt;p&gt;Looking at this list, are there any of the top 10 or so we could do without, or could reduce significantly? I&apos;m concerned that reducing the amount of data printed by cl_lock_trace won&apos;t really change how heavy it is - I would think most of the cost is in printing the message (though I could be wrong).&lt;/p&gt;
&lt;hr /&gt;

&lt;p&gt;I was also considering trying a modified version of cl_lock_trace which prints less information:&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;New version of cl_lock_trace I&apos;m going to try:&lt;/p&gt;

&lt;p&gt;static void cl_lock_trace_reduced0(int level, const struct lu_env *env,&lt;br/&gt;
const char *prefix, const struct cl_lock *lock,&lt;br/&gt;
const char *func, const int line)&lt;br/&gt;
{&lt;br/&gt;
CDEBUG(level, &quot;%s: %p@(%d %d %d %d %lx)&quot;&lt;br/&gt;
&quot; at %s():%d\n&quot;,&lt;br/&gt;
prefix, lock, cfs_atomic_read(&amp;amp;lock-&amp;gt;cll_ref),&lt;br/&gt;
lock-&amp;gt;cll_state, lock-&amp;gt;cll_holds,&lt;br/&gt;
lock-&amp;gt;cll_users, lock-&amp;gt;cll_flags,&lt;br/&gt;
func, line);&lt;br/&gt;
}&lt;br/&gt;
#define cl_lock_trace_reduced(level, env, prefix, lock) \&lt;br/&gt;
cl_lock_trace_reduced0(level, env, prefix, lock, &lt;em&gt;FUNCTION, __LINE&lt;/em&gt;)&lt;br/&gt;
&amp;#8212;&lt;/p&gt;

&lt;p&gt;So, are there any of the most common cl_lock_trace calls you don&apos;t think we need?&lt;br/&gt;
Does that modified version of cl_lock_trace have enough info in it?  Is there more I could take out?&lt;/p&gt;

&lt;p&gt;Once we&apos;ve figured out how best to reduce the debug levels, I can test accordingly.&lt;/p&gt;</comment>
                            <comment id="78566" author="jay" created="Thu, 6 Mar 2014 06:50:57 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Will you please try this patch: &lt;a href=&quot;http://review.whamcloud.com/9524&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9524&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and see if it will help?&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="78628" author="paf" created="Thu, 6 Mar 2014 20:00:59 +0000"  >&lt;p&gt;Sure, I&apos;ll test as soon as I can.  Due to some poor planning on my part, that may not be until next week.  I&apos;ll get results sooner if I can.&lt;/p&gt;</comment>
                            <comment id="79143" author="simmonsja" created="Wed, 12 Mar 2014 16:14:52 +0000"  >&lt;p&gt;Please cherry pick this to b2_5&lt;/p&gt;</comment>
                            <comment id="79158" author="pjones" created="Wed, 12 Mar 2014 17:55:09 +0000"  >&lt;p&gt;James&lt;/p&gt;

&lt;p&gt;This will certainly be a candidate to back port once we have confirmation that the fix works&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="79163" author="paf" created="Wed, 12 Mar 2014 18:35:26 +0000"  >&lt;p&gt;Jinshan - I&apos;m sorry for the delay here (vacation, then system problems), but this patch doesn&apos;t fix the problem.&lt;/p&gt;

&lt;p&gt;I ran with this patch + master from last week.&lt;/p&gt;

&lt;p&gt;I wasn&apos;t able to hit the bugs with my cl_lock debugging enabled, unfortunately.  I can provide node dumps from one of these nodes if desired.&lt;/p&gt;

&lt;p&gt;The bug set observed has changed somewhat...&lt;/p&gt;

&lt;p&gt;Old bugs we&apos;re still seeing:&lt;br/&gt;
2014-03-12T11:45:47.984843-05:00 c0-0c1s1n1 LustreError: 8494:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed&lt;/p&gt;

&lt;p&gt;GPF in osc_lock_detach:&lt;br/&gt;
2014-03-12T11:41:56.661772-05:00 c0-0c0s5n2 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0860866&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0860866&amp;gt;&amp;#93;&lt;/span&gt; osc_lock_detach+0x46/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;/p&gt;


&lt;p&gt;2014-03-12T11:37:21.627399-05:00 c0-0c2s1n2 LustreError: 20950:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&lt;/p&gt;

&lt;p&gt;GPF in cl_lock_delete:&lt;br/&gt;
2014-03-12T11:13:41.270193-05:00 c0-0c2s2n1 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c1550&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c1550&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete0+0x190/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/p&gt;


&lt;p&gt;No exit stuck in cl_locks_prune.&lt;/p&gt;

&lt;p&gt;----------------------&lt;/p&gt;

&lt;p&gt;Now for new things:&lt;/p&gt;


&lt;p&gt;I&apos;m seeing this error and related messages fairly often in the logs:&lt;br/&gt;
2014-03-12T11:44:04.990085-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff8801e5189078&lt;br/&gt;
2014-03-12T11:44:04.990101-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5&lt;/p&gt;


&lt;p&gt;This one is new, and observed several times:&lt;br/&gt;
2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8&lt;br/&gt;
2014-03-12T11:47:12.039481-05:00 c0-0c2s3n1 IP: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfda8&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x58/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.039499-05:00 c0-0c2s3n1 PGD 15ff067 PUD 1600067 PMD 0 &lt;br/&gt;
2014-03-12T11:47:12.039507-05:00 c0-0c2s3n1 Oops: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP &lt;br/&gt;
2014-03-12T11:47:12.039514-05:00 c0-0c2s3n1 CPU 5 &lt;br/&gt;
2014-03-12T11:47:12.079400-05:00 c0-0c2s3n1 Modules linked in: xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) dvsproc(P) nic_compat cmsr osc mgc lustre lov mdc fid lmv fld kgnilnd ptlrpc obdclass lnet sha1_generic md5 libcfs ib_core ipip krsip kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace&lt;br/&gt;
2014-03-12T11:47:12.079425-05:00 c0-0c2s3n1 Pid: 3701, comm: ldlm_bl_08 Tainted: P            3.0.93-0.8.2_1.0000.7848-cray_gem_c #1  &lt;br/&gt;
2014-03-12T11:47:12.079439-05:00 c0-0c2s3n1 RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfda8&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfda8&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x58/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.079453-05:00 c0-0c2s3n1 RSP: 0018:ffff88040f137d70  EFLAGS: 00010286&lt;br/&gt;
2014-03-12T11:47:12.079459-05:00 c0-0c2s3n1 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffff88040f137cf0&lt;br/&gt;
2014-03-12T11:47:12.112232-05:00 c0-0c2s3n1 RDX: ffff88040f137cf0 RSI: ffff88040fde9c08 RDI: ffffffffa0898f20&lt;br/&gt;
2014-03-12T11:47:12.112256-05:00 c0-0c2s3n1 RBP: ffff88040f137d90 R08: 0000000000000020 R09: ffffffff81375ed8&lt;br/&gt;
2014-03-12T11:47:12.112268-05:00 c0-0c2s3n1 R10: 0000000000000000 R11: 0000000000000009 R12: ffff8804016c5590&lt;br/&gt;
2014-03-12T11:47:12.112290-05:00 c0-0c2s3n1 R13: ffff88040db84b18 R14: ffff8804016c5588 R15: 0000000000000000&lt;br/&gt;
2014-03-12T11:47:12.112299-05:00 c0-0c2s3n1 FS:  0000000040176880(0000) GS:ffff88021fd40000(0000) knlGS:0000000000000000&lt;br/&gt;
2014-03-12T11:47:12.138736-05:00 c0-0c2s3n1 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
2014-03-12T11:47:12.138763-05:00 c0-0c2s3n1 CR2: fffffffffffffff8 CR3: 00000000015fd000 CR4: 00000000000407e0&lt;br/&gt;
2014-03-12T11:47:12.138887-05:00 c0-0c2s3n1 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
2014-03-12T11:47:12.138911-05:00 c0-0c2s3n1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
2014-03-12T11:47:12.138930-05:00 c0-0c2s3n1 Process ldlm_bl_08 (pid: 3701, threadinfo ffff88040f136000, task ffff88040db8d040)&lt;br/&gt;
2014-03-12T11:47:12.138962-05:00 c0-0c2s3n1 Stack:&lt;br/&gt;
2014-03-12T11:47:12.168889-05:00 c0-0c2s3n1 ffff88040db84b18 ffff8804016c5588 ffff88040db84b18 ffff88040db84b18&lt;br/&gt;
2014-03-12T11:47:12.168907-05:00 c0-0c2s3n1 ffff88040f137db0 ffffffffa02c0afb ffff88040fde9c08 ffff8804016c5588&lt;br/&gt;
2014-03-12T11:47:12.168921-05:00 c0-0c2s3n1 ffff88040f137e10 ffffffffa0862cec ffff88040fde9c08 ffff88040f099b40&lt;br/&gt;
2014-03-12T11:47:12.168930-05:00 c0-0c2s3n1 Call Trace:&lt;br/&gt;
2014-03-12T11:47:12.168976-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0afb&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel+0x13b/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.168988-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0862cec&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x20c/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.198730-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03cd354&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0xd4/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.198752-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03cd8ac&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x1fc/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.198779-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8106637e&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
2014-03-12T11:47:12.198791-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81369ff4&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;br/&gt;
2014-03-12T11:47:12.198811-05:00 c0-0c2s3n1 Code: 00 49 8b 84 24 b8 00 00 00 a8 01 75 40 48 83 c8 01 49 89 84 24 b8 00 00 00 49 8b 44 24 10 49 83 c4 08 49 39 c4 48 8d 58 e8 74 22 &lt;br/&gt;
2014-03-12T11:47:12.228894-05:00 c0-0c2s3n1 8b 43 10 48 8b 40 30 48 85 c0 74 08 48 89 de 4c 89 ef ff d0 &lt;br/&gt;
2014-03-12T11:47:12.254426-05:00 c0-0c2s3n1 RIP  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfda8&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x58/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.254442-05:00 c0-0c2s3n1 RSP &amp;lt;ffff88040f137d70&amp;gt;&lt;br/&gt;
2014-03-12T11:47:12.254450-05:00 c0-0c2s3n1 CR2: fffffffffffffff8&lt;br/&gt;
2014-03-12T11:47:12.254457-05:00 c0-0c2s3n1 --&lt;del&gt;[ end trace ac1164d8e2c40df9 ]&lt;/del&gt;--&lt;br/&gt;
2014-03-12T11:47:12.254509-05:00 c0-0c2s3n1 Kernel panic - not syncing: Fatal exception&lt;br/&gt;
2014-03-12T11:47:12.280053-05:00 c0-0c2s3n1 Pid: 3701, comm: ldlm_bl_08 Tainted: P      D     3.0.93-0.8.2_1.0000.7848-cray_gem_c #1&lt;br/&gt;
2014-03-12T11:47:12.280088-05:00 c0-0c2s3n1 Call Trace:&lt;br/&gt;
2014-03-12T11:47:12.280099-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810065b1&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
2014-03-12T11:47:12.280109-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004dd9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
2014-03-12T11:47:12.280144-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100601c&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
2014-03-12T11:47:12.280151-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006055&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
2014-03-12T11:47:12.280170-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81365182&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
2014-03-12T11:47:12.280186-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81365221&amp;gt;&amp;#93;&lt;/span&gt; panic+0x94/0x1da&lt;br/&gt;
2014-03-12T11:47:12.280206-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005e38&amp;gt;&amp;#93;&lt;/span&gt; oops_end+0xa8/0xe0&lt;br/&gt;
2014-03-12T11:47:12.305555-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026509&amp;gt;&amp;#93;&lt;/span&gt; no_context+0xf9/0x260&lt;br/&gt;
2014-03-12T11:47:12.305579-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810267d5&amp;gt;&amp;#93;&lt;/span&gt; __bad_area_nosemaphore+0x165/0x1f0&lt;br/&gt;
2014-03-12T11:47:12.305613-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026873&amp;gt;&amp;#93;&lt;/span&gt; bad_area_nosemaphore+0x13/0x20&lt;br/&gt;
2014-03-12T11:47:12.305626-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026e1e&amp;gt;&amp;#93;&lt;/span&gt; do_page_fault+0x2fe/0x420&lt;br/&gt;
2014-03-12T11:47:12.331149-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8136884f&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;br/&gt;
2014-03-12T11:47:12.331174-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02bfda8&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel0+0x58/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.331244-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c0afb&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_cancel+0x13b/0x140 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.331262-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0862cec&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x20c/0x330 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.331290-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03cd354&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0xd4/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.331301-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa03cd8ac&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x1fc/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
2014-03-12T11:47:12.356710-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8106637e&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
2014-03-12T11:47:12.356726-05:00 c0-0c2s3n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81369ff4&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;/p&gt;


&lt;p&gt;This is also (sort of) new - I also saw it with one of Bobijam&apos;s patches.&lt;br/&gt;
It&apos;s a failure to exit, stuck here:&lt;br/&gt;
2014-03-12T13:10:59.311520-05:00 c0-0c1s5n3 &amp;lt;node_health:5.1&amp;gt; APID:4805011 (Application_Exited_Check) STACK: call_rwsem_down_read_failed+0x14/0x30; exit_mm+0x3b/0x160; do_exit+0x18e/0x980; do_group_exit+0x48/0xc0; get_signal_to_deliver+0x243/0x480; do_notify_resume+0xe0/0x7f0; retint_signal+0x46/0x83; 0x200007d1; 0xffffffffffffffff;&lt;/p&gt;

&lt;p&gt;It&apos;s possible this isn&apos;t related to the patches, but I haven&apos;t seen it except in testing with fairly recent master and one of these patches.  (I haven&apos;t carefully tested recent master by itself.)&lt;/p&gt;

&lt;p&gt;I also saw several dropped connections to some of our OSTs:&lt;br/&gt;
2014-03-12T11:02:57.205561-05:00 c0-0c1s4n0 Lustre: snxb1-OST0004-osc-ffff88020e481c00: Connection to snxb1-OST0004 (at 10.10.100.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete&lt;/p&gt;

&lt;p&gt;That seems likely to be a problem with our system rather than the Lustre client, but I haven&apos;t seen it before on this system, so I thought I&apos;d mention it.&lt;/p&gt;</comment>
                            <comment id="79166" author="paf" created="Wed, 12 Mar 2014 18:50:19 +0000"  >&lt;p&gt;A quick note - the version of master I used does have the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt;, so that stuck in cl_locks_prune issue above is NOT &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="79168" author="paf" created="Wed, 12 Mar 2014 18:55:37 +0000"  >&lt;p&gt;One further note: I haven&apos;t examined all of the patches offered by Jinshan and Bobijam well enough to be sure, but are they all conflicting?  These bugs haven&apos;t been fixed by a number of different patches, and I&apos;m starting to wonder if there isn&apos;t more than one fix needed - I know that&apos;s much less likely in general, but I thought I&apos;d suggest it as something to consider.&lt;/p&gt;

&lt;p&gt;There&apos;s also a pair of patches that were suggested at one point by Vitaly F. @ Xyratex.  They were not successful, but I&apos;ll attach them for reference.&lt;/p&gt;

&lt;p&gt;One is a patch to avoid recursive disclosures, the other is a tweak to usage of hold/get.  Again, these did not resolve the issue, I&apos;m just attaching them for reference.&lt;/p&gt;</comment>
                            <comment id="79169" author="paf" created="Wed, 12 Mar 2014 18:56:24 +0000"  >&lt;p&gt;Attempted patch from Vitaly&lt;/p&gt;</comment>
                            <comment id="79170" author="jay" created="Wed, 12 Mar 2014 18:56:40 +0000"  >&lt;blockquote&gt;
&lt;p&gt;2014-03-12T11:44:04.990085-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff8801e5189078&lt;br/&gt;
2014-03-12T11:44:04.990101-05:00 c0-0c1s2n0 LustreError: 8529:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I saw this, is the client being evicted at that time?&lt;/p&gt;</comment>
                            <comment id="79171" author="paf" created="Wed, 12 Mar 2014 18:56:47 +0000"  >&lt;p&gt;Other attempted patch from Xyratex&lt;/p&gt;</comment>
                            <comment id="79172" author="jay" created="Wed, 12 Mar 2014 18:57:02 +0000"  >&lt;p&gt;Please share us the the core dump. Thanks.&lt;/p&gt;</comment>
                            <comment id="79173" author="paf" created="Wed, 12 Mar 2014 18:59:08 +0000"  >&lt;p&gt;Ah, yes Jinshan - It was.  Sorry, the messages were a bit garbled and I missed that.&lt;/p&gt;

&lt;p&gt;So that and the lost connection to the OST were a client eviction by the OST:&lt;br/&gt;
2014-03-12T11:44:04.989967-05:00 c0-0c1s2n0 LustreError: 167-0: snxb1-OST0004-osc-ffff8801ec7ae400: This client was evicted by snxb1-OST0004; in progress operations using this service will fail.&lt;/p&gt;</comment>
                            <comment id="79176" author="paf" created="Wed, 12 Mar 2014 19:01:52 +0000"  >&lt;p&gt;Jinshan - Is there a particular failure you&apos;d like a dump for?&lt;/p&gt;</comment>
                            <comment id="79179" author="jay" created="Wed, 12 Mar 2014 19:05:11 +0000"  >&lt;p&gt;Just provide me the latest failure with my patch applied, please. I will take a look.&lt;/p&gt;

&lt;p&gt;BTW, have you ever seen the issue on your production system?&lt;/p&gt;</comment>
                            <comment id="79180" author="paf" created="Wed, 12 Mar 2014 19:15:42 +0000"  >&lt;p&gt;When you say the latest failure, do you just mean this one?&lt;br/&gt;
2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8&lt;/p&gt;

&lt;p&gt;All of the failures I listed were encountered during testing with your patch applied.&lt;/p&gt;


&lt;p&gt;And yes, we&apos;ve seen several of these on our production systems.  I can&apos;t/shouldn&apos;t share the details, but until we found the workaround of removing the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch, this was a serious problem for us.&lt;/p&gt;

&lt;p&gt;Note that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4692&quot; title=&quot;(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4692&quot;&gt;&lt;del&gt;LU-4692&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4693&quot; title=&quot;(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4693&quot;&gt;&lt;del&gt;LU-4693&lt;/del&gt;&lt;/a&gt; are non-Cray reports of two of these same bugs, and then there&apos;s also the older &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="79182" author="jay" created="Wed, 12 Mar 2014 19:35:31 +0000"  >&lt;blockquote&gt;
&lt;p&gt;2014-03-12T11:47:12.039431-05:00 c0-0c2s3n1 BUG: unable to handle kernel paging request at fffffffffffffff8&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes, let&apos;s start with this one for now&lt;/p&gt;</comment>
                            <comment id="79197" author="paf" created="Wed, 12 Mar 2014 22:38:35 +0000"  >&lt;p&gt;Jinshan,&lt;/p&gt;

&lt;p&gt;Unfortunately, I had to give up the test system before I could get those dumps...&lt;br/&gt;
I&apos;ve got it again and testing hasn&apos;t turned up that page requesting panic again.  (I&apos;m sure it&apos;s still there, but I haven&apos;t seen it on this run...)&lt;/p&gt;

&lt;p&gt;So, instead, I&apos;ve got these six dumps for you, five of them are previously seen bugs and the sixth is a new one:&lt;br/&gt;
c0-0c2s5n3: GPF in cl_lock_delete (Also had a thread stuck in cl_locks_prune)&lt;br/&gt;
c0-0c2s7n1: lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed&lt;br/&gt;
c0-0c2s3n0: GPF in cl_lock_put&lt;br/&gt;
c0-0c1s6n2: GPF in osc_lock_detach&lt;/p&gt;

&lt;p&gt;The first four are kernel panics, this is a dump of a node with a thread stuck in cl_locks_prune (Which was NMI&apos;ed while running, rather than having a kernel panic):&lt;br/&gt;
c0-0c1s3n2&lt;br/&gt;
(Pid: 14569, mmstress)&lt;/p&gt;

&lt;p&gt;And this is the new bug:&lt;br/&gt;
c0-0c0s4n2 LustreError: 19165:0:(cl_lock.c:313:cl_lock_put()) ASSERTION( list_empty(&amp;amp;lock-&amp;gt;cll_linkage) ) failed:&lt;/p&gt;

&lt;p&gt;Dumps will be in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;-140312.tar.gz at ftp.whamcloud.com/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The console log is also in there.  There are many other nodes (10+) which went down that I didn&apos;t give dumps for because they were duplicates of the ones I picked. (You&apos;ll see their stack traces in the console log.)&lt;/p&gt;

&lt;p&gt;Upload of the dumps should be done in ~10-15 minutes.&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Patrick&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="79208" author="jay" created="Thu, 13 Mar 2014 02:58:59 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;I will take a look at the dump, please don&apos;t forget to copy lustre modules &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Anyway you have provided an important clue about the error in completion AST. If this is this error happened for every failing case, definitely we can get something from there.&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="79265" author="paf" created="Thu, 13 Mar 2014 18:01:47 +0000"  >&lt;p&gt;Jinshan, Bobijam,&lt;/p&gt;

&lt;p&gt;Just a general question.  Do you think breaking out the cl_lock_trace calls in to their own debug flag (rather than being part of dlmtrace) is a good thing in general?  My patch to do it is just a quick hack, but I&apos;m wondering if a cleaned up version of it - without special treatment for cl_free - is something we might want to land to master?&lt;/p&gt;

&lt;p&gt;It&apos;s been useful for me having it separated, but only because enabling full dlmtrace always prevents me from seeing these cl_lock bugs (and some of the earlier ones as well).&lt;/p&gt;

&lt;p&gt;If you think so, I&apos;ll submit a patch for it.&lt;/p&gt;</comment>
                            <comment id="79296" author="jay" created="Thu, 13 Mar 2014 23:55:13 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;cl_lock is dying so please don&apos;t waste any time on it. A simplified version of cl_lock will be introduced in CLIO simplification project.&lt;/p&gt;

&lt;p&gt;Jinshah&lt;/p&gt;</comment>
                            <comment id="79312" author="paf" created="Fri, 14 Mar 2014 02:45:13 +0000"  >&lt;p&gt;OK.  I like that answer. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="80550" author="ihara" created="Sat, 29 Mar 2014 05:36:55 +0000"  >&lt;p&gt;we need this fix with b2_5, so backported. &lt;a href=&quot;http://review.whamcloud.com/9851&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9851&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="80572" author="bfaccini" created="Sun, 30 Mar 2014 17:25:13 +0000"  >&lt;p&gt;After doing some investigations in correlating/crosscheck with the current problems encountered at CEA/Tera-100 site since they upgraded to Lustre 2.4.2 it appears that they also encounter almost all of the &lt;span class=&quot;error&quot;&gt;&amp;#91;L&amp;#93;&lt;/span&gt;BUGs described in this ticket, and here is the list :&lt;br/&gt;
           _ &quot;ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot;, also reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4692&quot; title=&quot;(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4692&quot;&gt;&lt;del&gt;LU-4692&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
           _ &quot;ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock)&quot; also reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4693&quot; title=&quot;(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4693&quot;&gt;&lt;del&gt;LU-4693&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4797&quot; title=&quot;ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4797&quot;&gt;&lt;del&gt;LU-4797&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
           _ GPF in cl_lock_delete0().&lt;br/&gt;
           _ GPF in cl_lock_put().&lt;br/&gt;
           _ GPF in osc_lock_detach(), also reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3614&quot; title=&quot;Kernel Panic &amp;quot;osc_lock_detach&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3614&quot;&gt;&lt;del&gt;LU-3614&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I have encouraged them to give a try to the debug-trace setting (rpctrace+dlmtrace) and see if it helps to avoid/reduce the frequency of the crashes, and they have enabled this on laste Friday night. Will see on Monday if their very bad stats (about 8 of the different crashes listed per day) have been lowered.&lt;/p&gt;

&lt;p&gt;I have added Lustre 2.4.2 to the list of affected versions for this ticket.&lt;/p&gt;

&lt;p&gt;What is unclear for me (and CEA people) with this ticket is :&lt;br/&gt;
           _ does the patch &lt;a href=&quot;http://review.whamcloud.com/9524&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9524&lt;/a&gt; really fix ?&lt;br/&gt;
           _ if not, can we consider that reverting patches of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; (and more ?) is the fix ?&lt;/p&gt;</comment>
                            <comment id="80607" author="paf" created="Mon, 31 Mar 2014 14:26:17 +0000"  >&lt;p&gt;Bruno - Here&apos;s a breakdown from the Cray perspective, where we/I&apos;ve been looking at these for a while.&lt;/p&gt;

&lt;p&gt;9524 does not fix any single specific assertion/GPF.  In my own testing, with mmstress on a 70-ish node system, it did not significantly reduce incidence of the bugs listed in this ticket.  (I wouldn&apos;t have noticed anything less than probably a 50% reduction, however.  So it may improve things a bit.)&lt;/p&gt;

&lt;p&gt;According to review comments on 9524, it improves the success rate with racer.  I haven&apos;t checked that, as racer isn&apos;t part of our usual test suite.  (We don&apos;t pass it often enough.)&lt;/p&gt;

&lt;p&gt;Cray has found a set of patches that seems to avoid the problems, though I don&apos;t believe it really fixes them.  As I described in this comment:&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76411&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76411&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76411&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76411&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We identified these patches as relevant:&lt;/p&gt;

&lt;p&gt;13079de &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3889&quot; title=&quot; LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock-&amp;gt;cll_state &amp;gt;= CLS_QUEUING ) &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3889&quot;&gt;&lt;del&gt;LU-3889&lt;/del&gt;&lt;/a&gt; osc: Allow lock to be canceled at ENQ time (&lt;a href=&quot;http://review.whamcloud.com/#/c/8405/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8405/&lt;/a&gt;)&lt;br/&gt;
7168ea8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio: Do not shrink sublock at cancel (&lt;a href=&quot;http://review.whamcloud.com/#/c/7569/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7569/&lt;/a&gt;)&lt;br/&gt;
521335c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3433&quot; title=&quot;Encountered a assertion for the ols_state being set to a impossible state&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3433&quot;&gt;&lt;del&gt;LU-3433&lt;/del&gt;&lt;/a&gt; clio: wrong cl_lock usage (&lt;a href=&quot;http://review.whamcloud.com/#/c/6709/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6709/&lt;/a&gt;)&lt;br/&gt;
I1ea629 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov: to not modify lov lock when sublock is canceled (&lt;a href=&quot;http://review.whamcloud.com/#/c/7841/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7841/&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I give the (lengthy) details in my original comment, but in essence, we removed:&lt;br/&gt;
I1ea629 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov: to not modify lov lock when sublock is canceled (&lt;a href=&quot;http://review.whamcloud.com/#/c/7841/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7841/&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;From our 2.4 and 2.5, and have not seen any of the assertions/GPFs you listed in general testing since.  In specific, focused testing with debug disabled, I&apos;ve been able to hit one or two of them.  So I don&apos;t believe pulling the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch has fixed any specific code flaws (also, no one has identified any problems with it); rather, I think it&apos;s probably changed timing.&lt;/p&gt;

&lt;p&gt;Still, the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch was landed because it&apos;s believed to fix a flaw in the code, not because there was a specific crash it fixed.  The &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; clio patch fixed the original reproducer for the assertion reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So, we decided it was safe to pull the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch.  We&apos;ve been running that way for some time, with excellent results.  (And as far as we can tell, no new bugs introduced.)&lt;/p&gt;</comment>
                            <comment id="80837" author="bfaccini" created="Wed, 2 Apr 2014 13:08:47 +0000"  >&lt;p&gt;Patrick, thanks for all these clarifications that are very helpful!!&lt;/p&gt;</comment>
                            <comment id="80893" author="jay" created="Wed, 2 Apr 2014 19:26:10 +0000"  >&lt;p&gt;I&apos;m working on this issue.&lt;/p&gt;</comment>
                            <comment id="80897" author="bfaccini" created="Wed, 2 Apr 2014 22:32:00 +0000"  >&lt;p&gt;Just a small comment to indicate that CEA did not get any crash since they only enabled dlmtrace last Friday !!&lt;br/&gt;
Also, they have a full system down for maintenance planned next Wednesday April 9th, so this will be a good time for them to either run a Lustre version where &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov patch has been pulled and as suggested by Patrick, or to apply any other fix we may have identified or developed to avoid the crashes ...&lt;/p&gt;</comment>
                            <comment id="80900" author="jay" created="Wed, 2 Apr 2014 23:14:11 +0000"  >&lt;p&gt;Yes, it&apos;s recommended to revert the 2nd patch of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;. Though I believe that is not the root cause of this problem but it can reduce the chance significantly.&lt;/p&gt;</comment>
                            <comment id="80903" author="jay" created="Thu, 3 Apr 2014 00:24:47 +0000"  >&lt;p&gt;I think all occurrences of the problems point to the same root cause - the sub lock has already been freed. I think this is a race of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;, I will create a patch.&lt;/p&gt;</comment>
                            <comment id="80905" author="jay" created="Thu, 3 Apr 2014 00:42:35 +0000"  >&lt;p&gt;Patch is at &lt;a href=&quot;http://review.whamcloud.com/9876&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9876&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Patrick, can you please give it a try since you can consistently reproduce it?&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="80971" author="paf" created="Thu, 3 Apr 2014 18:56:58 +0000"  >&lt;p&gt;Jinshan - With this patch applied on top of master, I&apos;m getting a GPF in cl_lock_delete0, on line 841, which is this:&lt;br/&gt;
&amp;#8212;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;                /*
                 * From now on, no &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; references to &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; lock can be acquired
                 * by cl_lock_lookup().
                 */
                cfs_list_for_each_entry_reverse(slice, &amp;amp;lock-&amp;gt;cll_layers,
                                                cls_linkage) {
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (slice-&amp;gt;cls_ops-&amp;gt;clo_delete != NULL) &amp;lt;---- This line here.
                                slice-&amp;gt;cls_ops-&amp;gt;clo_delete(env, slice);
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&amp;#8212;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;Edit: I had previously given this is happening in cl_lock_delete at the call to cl_lock_delete0, I was misreading the stack trace - Sorry about that.&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This happens swiftly when running mmstress, even with full debug enabled.&lt;/p&gt;

&lt;p&gt;I&apos;m going to take a quick look to see if I can understand why, but I&apos;ll probably upload a dump (with full dk logs enabled) shortly...&lt;/p&gt;</comment>
                            <comment id="80972" author="jay" created="Thu, 3 Apr 2014 19:07:48 +0000"  >&lt;p&gt;Can you please give me stack trace?&lt;/p&gt;</comment>
                            <comment id="80973" author="paf" created="Thu, 3 Apr 2014 19:11:10 +0000"  >&lt;p&gt;Oh, duh.  Sorry Jinshan - I forgot:&lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006591&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004de9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x440&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005ffc&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006035&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81367a12&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81367ab1&amp;gt;&amp;#93;&lt;/span&gt; panic+0x94/0x1da&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005e18&amp;gt;&amp;#93;&lt;/span&gt; oops_end+0xa8/0xe0&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81005f4b&amp;gt;&amp;#93;&lt;/span&gt; die+0x5b/0x90&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100362a&amp;gt;&amp;#93;&lt;/span&gt; do_general_protection+0x15a/0x160&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8136b11f&amp;gt;&amp;#93;&lt;/span&gt; general_protection+0x1f/0x30&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c2588&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete0+0x198/0x200 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c273b&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_delete+0x14b/0x190 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c2bd7&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_finish+0x37/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c508a&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_hold_mutex+0x3ba/0x620 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c5346&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_hold+0x56/0x120 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa069f133&amp;gt;&amp;#93;&lt;/span&gt; lov_lock_enqueue+0x8e3/0xf80 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c3e9b&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_try+0xfb/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c489f&amp;gt;&amp;#93;&lt;/span&gt; cl_enqueue_locked+0x7f/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02c548e&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_request+0x7e/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02ca434&amp;gt;&amp;#93;&lt;/span&gt; cl_io_lock+0x394/0x5c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02ca6fa&amp;gt;&amp;#93;&lt;/span&gt; cl_io_loop+0x9a/0x1a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07634f8&amp;gt;&amp;#93;&lt;/span&gt; ll_fault+0x308/0x4e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811135e6&amp;gt;&amp;#93;&lt;/span&gt; __do_fault+0x76/0x570&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81113b84&amp;gt;&amp;#93;&lt;/span&gt; handle_pte_fault+0xa4/0xcc0&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8111494e&amp;gt;&amp;#93;&lt;/span&gt; handle_mm_fault+0x1ae/0x240&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81026caf&amp;gt;&amp;#93;&lt;/span&gt; do_page_fault+0x18f/0x420&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8136b14f&amp;gt;&amp;#93;&lt;/span&gt; page_fault+0x1f/0x30&lt;/p&gt;



&lt;p&gt;Dump is here:&lt;/p&gt;

&lt;p&gt;ftp.whamcloud.com&lt;br/&gt;
uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;_cl_lock_delete0_gpf_140403.tar.gz&lt;/p&gt;</comment>
                            <comment id="80974" author="jay" created="Thu, 3 Apr 2014 19:17:12 +0000"  >&lt;p&gt;Please use patch version 2 of &lt;a href=&quot;http://review.whamcloud.com/9881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9881&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="80976" author="amk" created="Thu, 3 Apr 2014 19:28:04 +0000"  >&lt;p&gt;Jinshan, you might want to take a look at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4861&quot; title=&quot;App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4861&quot;&gt;&lt;del&gt;LU-4861&lt;/del&gt;&lt;/a&gt;. Cray is also seeing a deadlock in cl_lock_mutex_get along the cl_glimpse_lock path. This is independent of Patrick&apos;s testing. Seems like this different behavior could shed light on or confirm your suspicions about the root cause of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;. I didn&apos;t want to add the deadlock info here in case it&apos;s an unrelated problem.&lt;/p&gt;</comment>
                            <comment id="80988" author="paf" created="Thu, 3 Apr 2014 20:53:36 +0000"  >&lt;p&gt;Jinshan - Wow.  Finally some good news on these bugs - early testing results on master are perfect.  I would expect to have seen 10-20 instances of these various bugs by now in my testing, and I have not seen any yet.&lt;/p&gt;

&lt;p&gt;I&apos;m adding this patch and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov to Cray 2.5, and will be doing some lengthy testing tonight, both with the mmstress reproducer and general Cray IO stress testing.  I may also try racer.sh for a while as well.&lt;/p&gt;</comment>
                            <comment id="80999" author="jay" created="Thu, 3 Apr 2014 22:19:52 +0000"  >&lt;p&gt;That&apos;s really good, Patrick. Thank you for your effort on this bug.&lt;/p&gt;

&lt;p&gt;Hi Ann, I will take a look at it soon.&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="81046" author="paf" created="Fri, 4 Apr 2014 15:50:32 +0000"  >&lt;p&gt;Jinshan - Testing last night completed without any problems.  Thank you very much for this - It looks like we&apos;ve probably finally fixed a bug we&apos;ve been working on for a long time.  (Much of the work on this bug on our side happened before we opened the bug with you.)&lt;/p&gt;

&lt;p&gt;I&apos;ve given a positive review to the mod as well.  I think this ticket could probably be closed as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Again, thank you very much.  I&apos;ve been working on this in various forms since about October of last year.&lt;/p&gt;</comment>
                            <comment id="81053" author="jay" created="Fri, 4 Apr 2014 16:24:09 +0000"  >&lt;p&gt;duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="81064" author="patrick.valentin" created="Fri, 4 Apr 2014 17:45:18 +0000"  >&lt;p&gt;Patrick,&lt;/p&gt;

&lt;p&gt;in your comment on 03/Apr/14 8:53 PM, you wrote: &lt;br/&gt;
        &amp;gt; I&apos;m adding this patch and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov to Cray 2.5, and ...&lt;br/&gt;
Does it mean you apply &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt; patch AND you continue to revert &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov&quot;, or does it mean that you reapply &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov&quot; you were previously reverting ?&lt;/p&gt;

&lt;p&gt;We built today a lustre 2.4.2 with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt; patch set 2 without reverting &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov&quot; for tests at CEA, and it seems they still have the issue. I should have the confirmation of test results tonight or on next monday.&lt;/p&gt;

&lt;p&gt;Thanks in advance&lt;br/&gt;
Patrick&lt;/p&gt;
</comment>
                            <comment id="81067" author="paf" created="Fri, 4 Apr 2014 17:56:33 +0000"  >&lt;p&gt;Patrick -&lt;/p&gt;

&lt;p&gt;I ran Cray&apos;s 2.5 (which is very similar to Intel&apos;s 2.5.1) with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; lov.  So I reapplied that patch, which I was previously reverting.&lt;/p&gt;

&lt;p&gt;It worries me (obviously) that CEA continued to see problems in 2.4.2.  Can you share which specific assertions/GPFs they continued to hit?  And if possible, what codes were causing the issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Patrick&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="81079" author="jay" created="Fri, 4 Apr 2014 19:31:52 +0000"  >&lt;p&gt;Hi Patrick Valentin,&lt;/p&gt;

&lt;p&gt;Please apply patch &lt;a href=&quot;http://review.whamcloud.com/9881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9881&lt;/a&gt; to your branch and don&apos;t revert anything.&lt;/p&gt;

&lt;p&gt;I was wondering why you talked to yourself and it took me a while to figure out that you guys have the same first name :-D&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="81282" author="patrick.valentin" created="Wed, 9 Apr 2014 14:36:00 +0000"  >&lt;p&gt;Hi Jinshan and Patrick,&lt;br/&gt;
in my comment on last friday, I said that tests were run at CEA and they still had the issue. In fact, they only ran a subset of the tests, and they did not have crashes. They had client evictions, but this is probably another problem (perhaps &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt;).&lt;br/&gt;
Monday and tuesday was the cluster maintenance window. Tests will restart today with luster 2.4.3 plus &lt;a href=&quot;http://review.whamcloud.com/9881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9881&lt;/a&gt; and without reverting &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;. Hope to have news soon. &lt;/p&gt;

&lt;p&gt;Patrick&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="23408">LU-4692</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="23822">LU-4797</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="27557">LU-5910</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14281" name="0002-LELUS-203-clio-no-recursive-closures.patch" size="7569" author="paf" created="Wed, 12 Mar 2014 18:56:24 +0000"/>
                            <attachment id="14191" name="cl_lock_debug_patch.diff" size="9056" author="paf" created="Fri, 28 Feb 2014 20:24:57 +0000"/>
                            <attachment id="14282" name="lelus-203-lock-hold-v1.patch" size="562" author="paf" created="Wed, 12 Mar 2014 18:56:47 +0000"/>
                            <attachment id="14059" name="locks.log" size="10697" author="aboyko" created="Thu, 6 Feb 2014 08:34:09 +0000"/>
                            <attachment id="14048" name="mmstress.tar.gz" size="40442" author="paf" created="Wed, 5 Feb 2014 23:17:22 +0000"/>
                            <attachment id="14080" name="osc.ko" size="213" author="paf" created="Mon, 10 Feb 2014 20:24:50 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwee7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12543</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>