<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:50:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5366] BUG 6063: lock collide during recovery</title>
                <link>https://jira.whamcloud.com/browse/LU-5366</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt; At 16:34 today, one of our mds nodes hit an LBUG that appears to be &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5294&quot; title=&quot;mdd_unlink() returning -7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5294&quot;&gt;&lt;del&gt;LU-5294&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Jul 17 16:34:09 atlas-mds3.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;794253.417021&amp;#93;&lt;/span&gt; LustreError: 15235:0:(lu_object.h:867:lu_object_attr()) ASSERTION( ((o)&lt;del&gt;&amp;gt;lo_header&lt;/del&gt;&amp;gt;loh_attr &amp;amp; LOHA_EXISTS) != 0 ) failed: &lt;br/&gt;
Jul 17 16:34:09 atlas-mds3.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;794253.430991&amp;#93;&lt;/span&gt; LustreError: 15235:0:(lu_object.h:867:lu_object_attr()) LBUG&lt;/p&gt;

&lt;p&gt;We performed a crash dump and the mds rebooted. We entered recovery at 17:54 and at 19:24 the time remaining reached 0 but it was still in Recovering status. We have been getting these messages from the mds.&lt;/p&gt;

&lt;p&gt;[ 8689.325886] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-atlas2-MDT0000_UUID lock: ffff881d06e30900/0xf35a0587ba982321 lrc: 3/0,0 m0&lt;br/&gt;
[ 8689.364338] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) Skipped 2 previous similar messages&lt;br/&gt;
[ 8739.604058] Lustre: atlas2-MDT0000: Denying connection for new client 7c9cecb7-6c21-5cab-c1b6-5ab153ff6158 (at 8173@gni100), waiting for all 20156 known clients (19103 recovered, 1032 in progress, and 21 6&lt;br/&gt;
[ 8739.627327] Lustre: Skipped 18 previous similar messages&lt;br/&gt;
[ 8973.664327] Lustre: atlas2-MDT0000: Client 1d186399-bf63-fee6-2b63-8fddd9e7fba3 (at 83@gni2) reconnecting, waiting for 20156 clients in recovery for 0:32&lt;br/&gt;
[ 8973.679915] Lustre: Skipped 374 previous similar messages&lt;br/&gt;
[ 8973.685115] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) refused reconnection, still busy with 1 active RPCs&lt;br/&gt;
[ 8973.685117] Lustre: Skipped 374 previous similar messages&lt;br/&gt;
[ 9065.583029] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports&lt;br/&gt;
[ 9413.230945] Lustre: atlas2-MDT0000: Denying connection for new client 1cbe0fe6-3a30-b20e-2504-2e3665d3e188 (at 11386@gni100), waiting for all 20156 known clients (19126 recovered, 1008 in progress, and 220&lt;br/&gt;
[ 9413.254413] Lustre: Skipped 24 previous similar messages&lt;br/&gt;
[ 9441.943231] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 376s: evicting client at 11681@gni100  ns: mdt-atlas2-MDT0000_UUID lock: ffff881d07c7f240/0xf0&lt;br/&gt;
[ 9441.985270] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) Skipped 2 previous similar messages&lt;br/&gt;
[ 9442.001611] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports&lt;br/&gt;
[ 9442.024632] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-atlas2-MDT0000_UUID lock: ffff883fc7b21240/0xf35a0587ba988598 lrc: 3/0,0 m0&lt;br/&gt;
[ 9442.062793] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) Skipped 1 previous similar message&lt;br/&gt;
[ 9574.246608] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) reconnecting, waiting for 20156 clients in recovery for 3:04&lt;br/&gt;
[ 9574.262195] Lustre: Skipped 368 previous similar messages&lt;br/&gt;
[ 9574.268355] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) refused reconnection, still busy with 1 active RPCs&lt;br/&gt;
[ 9574.283062] Lustre: Skipped 368 previous similar messages&lt;br/&gt;
[ 9818.311480] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports&lt;/p&gt;</description>
                <environment>RHEL 6.5</environment>
        <key id="25640">LU-5366</key>
            <summary>BUG 6063: lock collide during recovery</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="curtispb">Philip B Curtis</reporter>
                        <labels>
                    </labels>
                <created>Fri, 18 Jul 2014 00:42:04 +0000</created>
                <updated>Thu, 11 Dec 2014 18:26:58 +0000</updated>
                            <resolved>Thu, 11 Dec 2014 18:26:58 +0000</resolved>
                                    <version>Lustre 2.4.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="89444" author="curtispb" created="Fri, 18 Jul 2014 00:47:58 +0000"  >&lt;p&gt;Sorry, the first &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5294&quot; title=&quot;mdd_unlink() returning -7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5294&quot;&gt;&lt;del&gt;LU-5294&lt;/del&gt;&lt;/a&gt; should have been &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5285&quot; title=&quot;mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5285&quot;&gt;&lt;del&gt;LU-5285&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="89449" author="pjones" created="Fri, 18 Jul 2014 01:40:36 +0000"  >&lt;p&gt;Bobijam is looking into this issue&lt;/p&gt;</comment>
                            <comment id="89451" author="curtispb" created="Fri, 18 Jul 2014 02:04:54 +0000"  >&lt;p&gt;The mds just reported that it finished recovery at 21:58. We will wait to hear back to make sure it is safe to release to users. We wouldn&apos;t want it to hit that LBUG again in 10 minutes.&lt;/p&gt;</comment>
                            <comment id="89452" author="bobijam" created="Fri, 18 Jul 2014 02:25:53 +0000"  >&lt;p&gt;So your MDS does not contain &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5285&quot; title=&quot;mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5285&quot;&gt;&lt;del&gt;LU-5285&lt;/del&gt;&lt;/a&gt; patch, is it?&lt;/p&gt;</comment>
                            <comment id="89453" author="curtispb" created="Fri, 18 Jul 2014 02:32:46 +0000"  >&lt;p&gt;That is correct. We do not have the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5285&quot; title=&quot;mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5285&quot;&gt;&lt;del&gt;LU-5285&lt;/del&gt;&lt;/a&gt; patch.&lt;/p&gt;</comment>
                            <comment id="89454" author="blakecaldwell" created="Fri, 18 Jul 2014 02:39:35 +0000"  >&lt;p&gt;We are on 2.4.3 with patches and are stuck there for a base in the near term future. Is a backport to 2.4.3 possible?&lt;/p&gt;</comment>
                            <comment id="89456" author="bobijam" created="Fri, 18 Jul 2014 02:48:40 +0000"  >&lt;p&gt;yes, I&apos;m doing the back port for b2_4&lt;/p&gt;</comment>
                            <comment id="89457" author="curtispb" created="Fri, 18 Jul 2014 02:55:39 +0000"  >&lt;p&gt;We have been seeing these errors that started shortly after recovery reported that it had finished.&lt;/p&gt;</comment>
                            <comment id="89458" author="bobijam" created="Fri, 18 Jul 2014 03:11:32 +0000"  >&lt;p&gt;this error messages reporting that the MDS is doing the object pre-create reservation on OSTs&lt;/p&gt;</comment>
                            <comment id="89459" author="curtispb" created="Fri, 18 Jul 2014 03:12:00 +0000"  >&lt;p&gt;I should also note that the fs is unresponsive at this time due to the lock timeouts, but it has not LBUG&apos;d yet.&lt;/p&gt;</comment>
                            <comment id="89460" author="bobijam" created="Fri, 18 Jul 2014 03:26:45 +0000"  >&lt;p&gt;I suspect it&apos;s MDS&apos;s osp hasn&apos;t successfully connected to OSTs yet. Can you check the MDS connection to OSTs messages?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5285&quot; title=&quot;mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5285&quot;&gt;&lt;del&gt;LU-5285&lt;/del&gt;&lt;/a&gt; back port for b2_4 &lt;a href=&quot;http://review.whamcloud.com/11136&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11136&lt;/a&gt; , for b2_5 &lt;a href=&quot;http://review.whamcloud.com/11137&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11137&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="89461" author="curtispb" created="Fri, 18 Jul 2014 03:56:41 +0000"  >&lt;p&gt;I&apos;ll include the lustrekernel log of what has been happening since the LBUG initially happened in case that is helpful.&lt;/p&gt;</comment>
                            <comment id="89462" author="blakecaldwell" created="Fri, 18 Jul 2014 03:59:03 +0000"  >&lt;p&gt;It looks like osps are good All 1008 are showing as UP! The syslog messages are attached too.&lt;br/&gt;
1007 UP osp atlas2-OST03e8-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1008 UP osp atlas2-OST03e9-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1009 UP osp atlas2-OST03ea-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1010 UP osp atlas2-OST03eb-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1011 UP osp atlas2-OST03ec-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1012 UP osp atlas2-OST03ed-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1013 UP osp atlas2-OST03ee-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;br/&gt;
1014 UP osp atlas2-OST03ef-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5&lt;/p&gt;</comment>
                            <comment id="89463" author="bobijam" created="Fri, 18 Jul 2014 04:39:16 +0000"  >&lt;p&gt;in the log, I saw a deadlock stack trace like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2173&quot; title=&quot;Some sort of deadlock in lod_qos code&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2173&quot;&gt;&lt;del&gt;LU-2173&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.381954] INFO: task mdt00_000:15203 blocked for more than 120 seconds.
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.389665] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.398633] mdt00_000     D 0000000000000002     0 15203      2 0x00000000
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.406483]  ffff881e85227560 0000000000000046 ffff880000001000 00000000e028a510
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.415032]  ffff884051f26440 04c88840d1000000 ffff881e6a06a980 10000000000004c8
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.423606]  ffff88205015fab8 ffff881e85227fd8 000000000000fb88 ffff88205015fab8
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.432161] Call Trace:
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.435008]  [&amp;lt;ffffffff81277817&amp;gt;] ? kobject_put+0x27/0x60
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.441161]  [&amp;lt;ffffffff8150eb75&amp;gt;] rwsem_down_failed_common+0x95/0x1d0
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.448474]  [&amp;lt;ffffffff811b5b2e&amp;gt;] ? bh_lru_install+0x16e/0x1a0
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.455112]  [&amp;lt;ffffffff8150ecd3&amp;gt;] rwsem_down_write_failed+0x23/0x30
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.462252]  [&amp;lt;ffffffff81281d13&amp;gt;] call_rwsem_down_write_failed+0x13/0x20
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.469886]  [&amp;lt;ffffffff8150e1d2&amp;gt;] ? down_write+0x32/0x40
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.475981]  [&amp;lt;ffffffffa0e0bca5&amp;gt;] lod_alloc_qos.clone.0+0x175/0x1180 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.483819]  [&amp;lt;ffffffffa0bc10af&amp;gt;] ? qsd_op_begin+0x5f/0xb40 [lquota]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.491046]  [&amp;lt;ffffffffa0e0e70a&amp;gt;] lod_qos_prep_create+0x74a/0x1b14 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.498675]  [&amp;lt;ffffffffa0882ad2&amp;gt;] ? fld_server_lookup+0x72/0x3d0 [fld]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.506117]  [&amp;lt;ffffffffa0e090db&amp;gt;] lod_declare_striped_object+0x14b/0x880 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.514461]  [&amp;lt;ffffffffa0c9cccb&amp;gt;] ? osd_xattr_get+0x21b/0x2d0 [osd_ldiskfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.522374]  [&amp;lt;ffffffffa0e09d21&amp;gt;] lod_declare_object_create+0x511/0x7a0 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.530511]  [&amp;lt;ffffffffa0b1a8cf&amp;gt;] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.539511]  [&amp;lt;ffffffffa0b29ffe&amp;gt;] mdd_declare_create+0x4e/0x870 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.546842]  [&amp;lt;ffffffffa0b287ff&amp;gt;] ? mdd_linkea_prepare+0x23f/0x430 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.554487]  [&amp;lt;ffffffffa0b2afe5&amp;gt;] mdd_create+0x7c5/0x1790 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.561269]  [&amp;lt;ffffffffa0c9cb47&amp;gt;] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.569122]  [&amp;lt;ffffffffa0d6863e&amp;gt;] mdt_reint_open+0x13ae/0x21c0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.576366]  [&amp;lt;ffffffffa03d583e&amp;gt;] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.584811]  [&amp;lt;ffffffffa06e8f6c&amp;gt;] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.592927]  [&amp;lt;ffffffffa0d52cd1&amp;gt;] mdt_reint_rec+0x41/0xe0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.599699]  [&amp;lt;ffffffffa0d37af3&amp;gt;] mdt_reint_internal+0x4c3/0x780 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.607141]  [&amp;lt;ffffffffa0d38080&amp;gt;] mdt_intent_reint+0x1f0/0x530 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.614413]  [&amp;lt;ffffffffa0d35f2e&amp;gt;] mdt_intent_policy+0x39e/0x720 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.621774]  [&amp;lt;ffffffffa06a0841&amp;gt;] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.629424]  [&amp;lt;ffffffffa06c72cf&amp;gt;] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.637467]  [&amp;lt;ffffffffa0d363b6&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.644012]  [&amp;lt;ffffffffa0d3ab57&amp;gt;] mdt_handle_common+0x647/0x16d0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.651447]  [&amp;lt;ffffffffa06e9d4c&amp;gt;] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.659756]  [&amp;lt;ffffffffa0d76a55&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.667020]  [&amp;lt;ffffffffa06f9568&amp;gt;] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.675847]  [&amp;lt;ffffffffa03b95de&amp;gt;] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.682999]  [&amp;lt;ffffffffa03cad9f&amp;gt;] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.690743]  [&amp;lt;ffffffffa06f08c9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.698478]  [&amp;lt;ffffffff81063b80&amp;gt;] ? default_wake_function+0x0/0x20
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.705548]  [&amp;lt;ffffffffa06fa8fe&amp;gt;] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.712718]  [&amp;lt;ffffffffa06f9e30&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.719882]  [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.725563]  [&amp;lt;ffffffffa06f9e30&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.732711]  [&amp;lt;ffffffffa06f9e30&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.739844]  [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bzzz, can you take a look?&lt;/p&gt;</comment>
                            <comment id="89464" author="blakecaldwell" created="Fri, 18 Jul 2014 05:03:43 +0000"  >&lt;p&gt;We&apos;ve built a lustre build with 2.4 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5285&quot; title=&quot;mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5285&quot;&gt;&lt;del&gt;LU-5285&lt;/del&gt;&lt;/a&gt; backport, but are waiting now pending investigation of deadlock causing the MDT unresponsiveness. Thanks.&lt;/p&gt;</comment>
                            <comment id="89465" author="jfc" created="Fri, 18 Jul 2014 05:04:34 +0000"  >&lt;p&gt;Bobijam,&lt;br/&gt;
Can you please say again who you want to take a look at this?&lt;br/&gt;
I do not know who you mean by &apos;Bzzz&apos;&lt;br/&gt;
Thank you,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="89466" author="bobijam" created="Fri, 18 Jul 2014 05:11:26 +0000"  >&lt;p&gt;bzzz refers to Alex Zhuravlev&lt;/p&gt;</comment>
                            <comment id="89467" author="curtispb" created="Fri, 18 Jul 2014 06:32:00 +0000"  >&lt;p&gt;Any updates on this line of investigation?&lt;/p&gt;</comment>
                            <comment id="89468" author="bobijam" created="Fri, 18 Jul 2014 06:52:32 +0000"  >&lt;p&gt;I haven&apos;t found the root cause yet.&lt;/p&gt;</comment>
                            <comment id="89489" author="bzzz" created="Fri, 18 Jul 2014 16:11:17 +0000"  >&lt;p&gt;those traces were dumped after the recovery. we aren&apos;t allocating new objects via normal precreation during recovery, we trust the clients to provide striping.&lt;/p&gt;</comment>
                            <comment id="89500" author="bzzz" created="Fri, 18 Jul 2014 16:37:06 +0000"  >&lt;p&gt;can we have a look at the logs from atlas2-OST024c, please?&lt;br/&gt;
those traces might be a sign of very slow precreation process when a single file need many stripes. this could explain missing process with the semaphore held.&lt;/p&gt;</comment>
                            <comment id="89523" author="blakecaldwell" created="Fri, 18 Jul 2014 18:22:47 +0000"  >&lt;p&gt;Uploaded logs from oss with atlas2-OST024c. Start time is before the LBUG until present 7-18 14:22 EST.&lt;/p&gt;</comment>
                            <comment id="89860" author="hilljjornl" created="Wed, 23 Jul 2014 17:35:05 +0000"  >&lt;p&gt;With the upload last week has there been any progress towards a root cause?&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="89927" author="green" created="Thu, 24 Jul 2014 06:29:39 +0000"  >&lt;p&gt;So the only way I can think of for the &quot;bug 6063&quot; message to print would be if some of the code somewhere ignored the &quot;replay&quot; flag in the RPC and attempted to return a new lock to the client as part of replay RPC.&lt;/p&gt;

&lt;p&gt;Interestingly, I do not see any asserts for this in the code so if this is what really happened, we&apos;d never know unless it manifests itself in this way.&lt;/p&gt;</comment>
                            <comment id="101340" author="jamesanunez" created="Thu, 11 Dec 2014 18:26:58 +0000"  >&lt;p&gt;Please reopen this ticket if you experience this issue again.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15397" name="lustre_logs_atlas-oss3b5" size="21883" author="blakecaldwell" created="Fri, 18 Jul 2014 18:22:47 +0000"/>
                            <attachment id="15394" name="lustrekernel.txt" size="2329297" author="curtispb" created="Fri, 18 Jul 2014 03:56:41 +0000"/>
                            <attachment id="15393" name="mds_errors_after-recovery.out.rtf" size="2241" author="curtispb" created="Fri, 18 Jul 2014 02:55:39 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwrqf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14965</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>