<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:22:26 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2109] __llog_process_thread() GPF</title>
                <link>https://jira.whamcloud.com/browse/LU-2109</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Observed after rebooting the grove-mds2 on to our latest kernel and lustre-orion tag.  The server was rebooted while under a fairly heavy test load.  No detailed investigation has been done yet, the MDS was able to successfully restart after rebooting the node again and claims to have successfully completed recovery this time.&lt;/p&gt;

&lt;p&gt;&amp;#8212; First boot &amp;#8212;&lt;/p&gt;

&lt;p&gt;Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000&lt;br/&gt;
Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64&lt;br/&gt;
Lustre: MGS: Mounted grove-mds2/mgs&lt;br/&gt;
Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000&lt;br/&gt;
LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2&lt;br/&gt;
LustreError: 4503:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2&lt;br/&gt;
Lustre: 4508:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File &quot;fld&quot; doesn&apos;t support range lookup, using stub. DNE and FIDs on OST will not work with this backend&lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.48@o2ib500 &lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.2.199@o2ib500&lt;br/&gt;
LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.51@o2ib500 &lt;br/&gt;
LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.44@o2ib500 &lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.53@o2ib500 &lt;br/&gt;
Lustre: Skipped 186 previous similar messages&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2&lt;br/&gt;
LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can&apos;t cancel record: -2&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a&lt;br/&gt;
LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2&lt;br/&gt;
LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can&apos;t cancel record: -2&lt;br/&gt;
general protection fault: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP &lt;br/&gt;
last sysfs file: /sys/module/sg/initstate&lt;br/&gt;
CPU 7 &lt;/p&gt;

&lt;p&gt;Pid: 4567, comm: osp-syn-641&lt;br/&gt;
 Tainted: P        W  ----------------   2.6.32-220.17.1.3chaos.ch5.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH&lt;br/&gt;
RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0694252&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0694252&amp;gt;&amp;#93;&lt;/span&gt; __llog_process_thread+0x2a2/0xc80 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
RSP: 0018:ffff882f66e9db60  EFLAGS: 00010206&lt;br/&gt;
RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008701 RCX: 0000000000000000&lt;br/&gt;
RDX: 000000000021e000 RSI: 0000000000000001 RDI: ffff882f63cfe000&lt;br/&gt;
RBP: ffff882f66e9dc10 R08: ffff88179d8f1900 R09: 0000000000000000&lt;br/&gt;
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88179caf4058&lt;br/&gt;
R13: 000000000000fcff R14: ffff882f63cfc000 R15: ffff882f63cfe000&lt;br/&gt;
FS:  00007ffff7fdc700(0000) GS:ffff881894820000(0000) knlGS:0000000000000000&lt;br/&gt;
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
CR2: 00007ffff7ff9000 CR3: 0000000001a85000 CR4: 00000000000006e0&lt;br/&gt;
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
Process osp-syn-641&lt;br/&gt;
 (pid: 4567, threadinfo ffff882f66e9c000, task ffff882ff9988ae0)&lt;br/&gt;
Stack:&lt;br/&gt;
 0000000000002000 0000000000000050 0000000000000010 0000000000000000&lt;br/&gt;
&amp;lt;0&amp;gt; ffff882f63cfc001 000086fe00000000 0000000000000000 ffff882f640124c0&lt;br/&gt;
&amp;lt;0&amp;gt; ffff882f66e9de80 0000000000000000 000000000021e000 0000fd009aee4b80&lt;br/&gt;
Call Trace:&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd4570&amp;gt;&amp;#93;&lt;/span&gt; ? osp_sync_process_queues+0x0/0xf60 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0694d33&amp;gt;&amp;#93;&lt;/span&gt; __llog_process+0x103/0x4d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06961bb&amp;gt;&amp;#93;&lt;/span&gt; llog_cat_process_cb+0x21b/0x290 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06946fe&amp;gt;&amp;#93;&lt;/span&gt; __llog_process_thread+0x74e/0xc80 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810618d4&amp;gt;&amp;#93;&lt;/span&gt; ? enqueue_task_fair+0x64/0x100&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0695fa0&amp;gt;&amp;#93;&lt;/span&gt; ? llog_cat_process_cb+0x0/0x290 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0694d33&amp;gt;&amp;#93;&lt;/span&gt; __llog_process+0x103/0x4d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06953d8&amp;gt;&amp;#93;&lt;/span&gt; __llog_cat_process+0x98/0x260 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd4570&amp;gt;&amp;#93;&lt;/span&gt; ? osp_sync_process_queues+0x0/0xf60 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81051ba3&amp;gt;&amp;#93;&lt;/span&gt; ? __wake_up+0x53/0x70 &lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd64f2&amp;gt;&amp;#93;&lt;/span&gt; osp_sync_thread+0x1c2/0x620 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd6330&amp;gt;&amp;#93;&lt;/span&gt; ? osp_sync_thread+0x0/0x620 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c14a&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd6330&amp;gt;&amp;#93;&lt;/span&gt; ? osp_sync_thread+0x0/0x620 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0dd6330&amp;gt;&amp;#93;&lt;/span&gt; ? osp_sync_thread+0x0/0x620 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c140&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;/p&gt;


&lt;p&gt;&amp;#8212; Second boot &amp;#8212;&lt;/p&gt;

&lt;p&gt;Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000&lt;br/&gt;
Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64&lt;br/&gt;
Lustre: MGS: Mounted grove-mds2/mgs&lt;br/&gt;
Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000&lt;br/&gt;
LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2&lt;br/&gt;
LustreError: 4525:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2&lt;br/&gt;
Lustre: 4530:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File &quot;fld&quot; doesn&apos;t support range lookup, using stub. DNE and FIDs on OST will not work with this backend&lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.183@o2ib500&lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.22@o2ib500 &lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.24@o2ib500 &lt;br/&gt;
Lustre: Skipped 3 previous similar messages&lt;br/&gt;
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.36@o2ib500 &lt;br/&gt;
Lustre: Skipped 2 previous similar messages&lt;br/&gt;
Lustre: lstest-MDT0000: Mounted grove-mds2/mdt0&lt;br/&gt;
Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 255 clients reconnect.&lt;br/&gt;
LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16&lt;br/&gt;
Lustre: lstest-MDT0000: Recovery over after 1:10, of 255 clients 255 recovered and 0 were evicted. &lt;br/&gt;
LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16&lt;/p&gt;</description>
                <environment></environment>
        <key id="14687">LU-2109</key>
            <summary>__llog_process_thread() GPF</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="behlendorf">Brian Behlendorf</reporter>
                        <labels>
                            <label>HB</label>
                            <label>sequoia</label>
                    </labels>
                <created>Fri, 1 Jun 2012 13:26:36 +0000</created>
                <updated>Fri, 19 Apr 2013 16:24:20 +0000</updated>
                            <resolved>Fri, 19 Apr 2013 16:24:20 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                    <fixVersion>Lustre 2.4.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="42120" author="liwei" created="Mon, 23 Jul 2012 07:45:56 +0000"  >&lt;p&gt;Brian,&lt;/p&gt;

&lt;p&gt;Given that the request buffers would not be retrievable after being freed, I suspect that the plain log handles were freed somehow while osp_sync_thread() was processing them.  If this sounds possible, I wonder if you could apply &lt;a href=&quot;http://review.whamcloud.com/3443&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3443&lt;/a&gt; and try reproducing the failure?  The libcfs module parameter &quot;libcfs_panic_on_lbug&quot; should be set to &quot;0&quot; on the MDS in order to get the debug log.  I was not able to reproduce it in the lab with a &quot;mdsrate --unlink&quot; workload.&lt;/p&gt;</comment>
                            <comment id="42448" author="behlendorf" created="Mon, 30 Jul 2012 12:39:26 +0000"  >&lt;p&gt;We&apos;ll pull the debug patch in to our branch tag and see if we&apos;re able to reproduce the issue.  We appear to still hit it sporadically restarting the servers.&lt;/p&gt;</comment>
                            <comment id="42452" author="behlendorf" created="Mon, 30 Jul 2012 12:52:54 +0000"  >&lt;p&gt;Haven&apos;t gotten the debug patch in place just yet, although we did just see a slight variation on this.  This time we didn&apos;t get poison.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-07-30 09:44:03 grove-mds2 login: Lustre: 21215:0:(fld_index.c:354:fld_index_init()) srv-lstest-MDT0000: File &quot;fld&quot; doesn&apos;t support range lookup, using stub. DNE and FIDs on OST will not work with this backend
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can&apos;t cancel record: -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can&apos;t cancel record: -2
2012-07-30 09:44:40 general protection fault: 0000 [#1] SMP
2012-07-30 09:44:40 last sysfs file: /sys/module/ptlrpc/initstate
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also note that the crash isn&apos;t being caused by an LBUG it&apos;s a GPF so setting &quot;libcfs_panic_on_lbug&quot; to 0 isn&apos;t going to help.  Resolving the address in the earlier posted stack suggests that &apos;lop-&amp;gt;lop_next_block&apos; is probably a garbage address we&apos;re attempting to jump too.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) list *( __llog_process_thread+0x2a2)
0x60c2 is in __llog_process_thread (/home/behlendo/src/git/lustre/lustre/include/lustre_log.h:648).
warning: Source file is more recent than executable.
643             ENTRY;
644
645             rc = llog_handle2ops(loghandle, &amp;amp;lop);
646             if (rc)
647                     RETURN(rc);
648             if (lop-&amp;gt;lop_next_block == NULL)
649                     RETURN(-EOPNOTSUPP);
650
651             rc = lop-&amp;gt;lop_next_block(env, loghandle, cur_idx, next_idx,
652                                      cur_offset, buf, len);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="42475" author="liwei" created="Mon, 30 Jul 2012 22:35:21 +0000"  >&lt;p&gt;The value of &quot;lop&quot; was 0x5a5a5a5a5a5a5a5a (i.e., RAX), according to&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;llog_next_block():
/root/lustre-dev/lustre/include/lustre_log.h:648
    60c2:       48 8b 40 08             mov    0x8(%rax),%rax
    60c6:       48 85 c0                test   %rax,%rax
    60c9:       0f 84 81 01 00 00       je     6250 &amp;lt;__llog_process_thread+0x430&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The log handles do not have their own slab cache.  In the latest &quot;variation&quot;, the handle in question might have been allocated to some other structure, which could explain why neither a valid log ID nor the poison was seen.&lt;/p&gt;

&lt;p&gt;I&apos;ll think harder for how the handles could be freed.&lt;/p&gt;</comment>
                            <comment id="42636" author="liwei" created="Fri, 3 Aug 2012 00:46:23 +0000"  >&lt;p&gt;I have updated the diagnostic patch to narrow down the problematic zone.&lt;/p&gt;</comment>
                            <comment id="42753" author="behlendorf" created="Mon, 6 Aug 2012 15:02:22 +0000"  >&lt;p&gt;Observed again but unfortunately without the updated diagnostic patch.  I&apos;ll pull the updated version in to our next tag.&lt;/p&gt;</comment>
                            <comment id="42754" author="behlendorf" created="Mon, 6 Aug 2012 15:09:34 +0000"  >&lt;p&gt;Actually, I misspoke.  Revision 2 of the diagnostic patch was applied, but not revision 3.  I also tweaked revision 2 to change the CDEBUG() in osp_sync_new_job() to a CERROR() to ensure we saw the error. There was a good change we wouldn&apos;t get a debug log, and in fact we didn&apos;t, but we do have the following console output.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Conn race 172.20.4.42@o2ib500
2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Skipped 654 previous similar messages
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff881801d0e000:be556800: rc -2
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff881801d0e000
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can&apos;t cancel record: -2
2012-08-06 11:35:50 LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -19
2012-08-06 11:35:50 LustreError: Skipped 4 previous similar messages
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can&apos;t cancel record: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 7 in progress, 7 in flight: -22
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="42813" author="liwei" created="Tue, 7 Aug 2012 11:22:17 +0000"  >&lt;p&gt;Brian,&lt;/p&gt;

&lt;p&gt;Thanks.  The console log further proofs the junk log ID came from freed or reallocated log handles.  I&apos;m looking forward to see how revision 3 of the diagnostic patch can help us determine why the handle was freed while being used.&lt;/p&gt;</comment>
                            <comment id="46122" author="ian" created="Mon, 8 Oct 2012 01:57:56 +0000"  >&lt;p&gt;Brian - we haven&apos;t see an update in this ticket for a couple of months, is it still an issue?&lt;/p&gt;</comment>
                            <comment id="46170" author="prakash" created="Mon, 8 Oct 2012 11:14:01 +0000"  >&lt;p&gt;IIRC, this would only occur sporadically after upgrading, and then subside for an unknown reason. So we can&apos;t really say for sure if it&apos;s still an issue, because we never really understood the root cause to begin with and didn&apos;t have a solid reproducer. Personally, I&apos;d be OK with moving this to an LU ticket and the resolving as &quot;wont fix&quot; or something similar. That way it&apos;s out in the public, and can be re-opened if it is seen again with the 2.3 code.&lt;/p&gt;</comment>
                            <comment id="46178" author="ian" created="Mon, 8 Oct 2012 12:22:04 +0000"  >&lt;p&gt;Moved to Lustre project as this code has landed to master. Marking as Won&apos;t Fix at Prakash&apos;s recommendation until/unless it reoccurs.&lt;/p&gt;</comment>
                            <comment id="46638" author="adilger" created="Tue, 16 Oct 2012 18:27:23 +0000"  >&lt;p&gt;I hit this in my single-node test setup:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;dual-core x86_64, 2GB RAM&lt;/li&gt;
	&lt;li&gt;2.3.53-38-g54d7c3f base
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;added extra patch to do unmount/mount between all tests in acceptance-small.sh&lt;/li&gt;
		&lt;li&gt;&lt;a href=&quot;http://review.whamcloud.com/4276&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4276&lt;/a&gt; touches the OFD code&lt;/li&gt;
		&lt;li&gt;&lt;a href=&quot;http://review.whamcloud.com/4282&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4282&lt;/a&gt; touches the OSD code&lt;/li&gt;
		&lt;li&gt;several other patches that are unrelated to this code (test scripts, etc)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Failure hit when unmounting after mmp.sh was finished and filesystem was remounting:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: == mmp test complete, duration 394 sec 07:09:31 (1350392971)
Lustre: MGC192.168.20.154@tcp: Reactivating import
LustreError: 12112:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Lustre: testfs-MDT0000: Temporarily refusing client connection from 0@lo
LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
Lustre: Found index 2 for testfs-OST0002, updating log
Lustre: 9041:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350392977/real 1350392977]  req@ffff8800b1b6bc00 x1415989288763649/t0(0) o8-&amp;gt;testfs-OST0002-osc-MDT0000@0@lo:28/4 lens 400/544 e 0 to 1 dl 1350392982 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
Lustre: 12244:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0002: deleting orphan objects from 80455 to 85737
Lustre: 12248:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0001: deleting orphan objects from 80346 to 80495
Lustre: Mounted testfs-client
LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5
LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5
LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can&apos;t cancel record: -5
LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5
LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5
LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can&apos;t cancel record: -5
general protection fault: 0000 [#1] SMP
 Tainted: P    B      ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc.
RIP: 0010:[&amp;lt;ffffffffa131856c&amp;gt;]  [&amp;lt;ffffffffa131856c&amp;gt;] llog_process_thread+0x2cc/0xe10 [obdclass]
Process osp-syn-1
Call Trace:
 [&amp;lt;ffffffffa0e4fe00&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [&amp;lt;ffffffffa131a8dd&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
 [&amp;lt;ffffffffa131c8d3&amp;gt;] llog_cat_process_cb+0x2c3/0x370 [obdclass]
 [&amp;lt;ffffffffa1318b9b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
 [&amp;lt;ffffffffa131a8dd&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
 [&amp;lt;ffffffffa131b419&amp;gt;] llog_cat_process_or_fork+0x89/0x290 [obdclass]
 [&amp;lt;ffffffffa0e4fe00&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [&amp;lt;ffffffffa131b639&amp;gt;] llog_cat_process+0x19/0x20 [obdclass]
 [&amp;lt;ffffffffa0e51f20&amp;gt;] osp_sync_thread+0x1d0/0x700 [osp]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="46911" author="ian" created="Thu, 25 Oct 2012 11:53:43 +0000"  >&lt;p&gt;Mike - is this related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2129&quot; title=&quot;ASSERTION( last_rec-&amp;gt;lrh_index == tail-&amp;gt;lrt_index )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2129&quot;&gt;&lt;del&gt;LU-2129&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;http://review.whamcloud.com/4303&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4303&lt;/a&gt; ?&lt;/p&gt;</comment>
                            <comment id="47287" author="liwei" created="Fri, 2 Nov 2012 07:48:50 +0000"  >&lt;p&gt;Abandoned the diagnostic patch for Orion; new ones are&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/4445&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4445&lt;/a&gt;  (Not for landing.)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/4433&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4433&lt;/a&gt;  (Aimed for landing.)&lt;/p&gt;</comment>
                            <comment id="47295" author="ian" created="Fri, 2 Nov 2012 10:54:15 +0000"  >&lt;p&gt;Brian, can you guys please pull in Li Wei&apos;s new patches above for testing?&lt;/p&gt;</comment>
                            <comment id="47486" author="prakash" created="Tue, 6 Nov 2012 20:12:39 +0000"  >&lt;p&gt;I just pulled those patches into our branch.&lt;/p&gt;</comment>
                            <comment id="47488" author="liwei" created="Tue, 6 Nov 2012 20:18:11 +0000"  >&lt;p&gt;Prakash, thanks.&lt;/p&gt;</comment>
                            <comment id="47532" author="prakash" created="Wed, 7 Nov 2012 13:05:42 +0000"  >&lt;p&gt;Just installed those patches on the MDS and got a hit:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-07 09:53:31 Lustre: Lustre: Build Version: 2.3.54-3chaos-3chaos--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64
2012-11-07 09:53:32 Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
2012-11-07 09:53:33 Lustre: Found index 0 for lstest-MDT0000, updating log
2012-11-07 09:53:33 LustreError: 32702:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
2012-11-07 09:53:33 LustreError: 32705:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.185@o2ib500 (not set up)
2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.17.95@o2ib500 (not set up)
2012-11-07 09:53:34 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.4.157@o2ib500 (not set up)
2012-11-07 09:53:34 LustreError: Skipped 6 previous similar messages
2012-11-07 09:53:35 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.104@o2ib500 (not set up)
2012-11-07 09:53:35 LustreError: Skipped 15 previous similar messages
2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.172@o2ib500
2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.107@o2ib500
2012-11-07 09:53:36 LustreError: 32788:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x4378:1
2012-11-07 09:53:36 LustreError: 32799:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x437b:1
2012-11-07 09:53:36 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.128@o2ib500
2012-11-07 09:53:36 Lustre: Skipped 39 previous similar messages
2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x438a:1
2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 17 previous similar messages
2012-11-07 09:53:37 LustreError: 11-0: lstest-OST01a4-osc-MDT0000: Communicating with 172.20.3.20@o2ib500, operation ost_connect failed with -16
2012-11-07 09:53:37 LNet: 21212:0:(o2iblnd_cb.c:2357:kiblnd_passive_connect()) Conn race 172.20.3.24@o2ib500
2012-11-07 09:53:37 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.165@o2ib500
2012-11-07 09:53:37 Lustre: Skipped 15 previous similar messages
2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x43a4:1
2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 24 previous similar messages
2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1
2012-11-07 09:53:38 Pid: 33016, comm: osp-syn-460
2012-11-07 09:53:38 
2012-11-07 09:53:38 
2012-11-07 09:53:38 Call Trace:
2012-11-07 09:53:38  [&amp;lt;ffffffffa05bc965&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075ae81&amp;gt;] llog_free_handle+0xb1/0x430 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075b25d&amp;gt;] llog_close+0x5d/0x190 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0760ea9&amp;gt;] llog_cat_cancel_records+0x179/0x490 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0946f10&amp;gt;] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff59b8&amp;gt;] osp_sync_process_committed+0x238/0x760 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffffa09699a7&amp;gt;] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff5f74&amp;gt;] osp_sync_process_queues+0x94/0x11c0 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffff8105ea30&amp;gt;] ? default_wake_function+0x0/0x20
2012-11-07 09:53:38  [&amp;lt;ffffffffa075cf3b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff5ee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075ec7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0760c83&amp;gt;] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075cf3b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa07609b0&amp;gt;] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075ec7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075f7b9&amp;gt;] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffff8104cab9&amp;gt;] ? __wake_up_common+0x59/0x90
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff5ee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffffa075f9d9&amp;gt;] llog_cat_process+0x19/0x20 [obdclass]
2012-11-07 09:53:38  [&amp;lt;ffffffffa05bd83a&amp;gt;] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff8000&amp;gt;] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffff8100c14a&amp;gt;] child_rip+0xa/0x20
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
2012-11-07 09:53:38 
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can&apos;t cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2
2012-11-07 09:53:38   req@ffff880f75ddf400 x1418000660995471/t0(0) o6-&amp;gt;lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can&apos;t cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2
2012-11-07 09:53:38   req@ffff880f73c0bc00 x1418000660995472/t0(0) o6-&amp;gt;lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
2012-11-07 09:53:38 general protection fault: 0000 [#1] SMP 
2012-11-07 09:53:38 last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map
2012-11-07 09:53:38 CPU 0 
2012-11-07 09:53:38 Modules linked in: osp(U) mdt(U) mdd(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ksocklnd(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate sg ses enclosure sd_mod crc_t10dif isci libsas wmi mpt2sas scsi_transport_sas raid_class sb_edac edac_core ahci i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand]
2012-11-07 09:53:39 
2012-11-07 09:53:39 Pid: 33016, comm: osp-syn-460
2012-11-07 09:53:39  Tainted: P        W  ----------------   2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro 2620x-in/S2600GZ
2012-11-07 09:53:39 RIP: 0010:[&amp;lt;ffffffffa075c90c&amp;gt;]  [&amp;lt;ffffffffa075c90c&amp;gt;] llog_process_thread+0x2cc/0xe10 [obdclass]
2012-11-07 09:53:39 RSP: 0018:ffff880f720bdb60  EFLAGS: 00010206
2012-11-07 09:53:39 RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008f81 RCX: 0000000000000000
2012-11-07 09:53:39 RDX: ffff880f763761c0 RSI: ffff880f79788000 RDI: ffff880f72d3e000
2012-11-07 09:53:39 RBP: ffff880f720bdc00 R08: ffff881f9b6b0500 R09: 0000000000000000
2012-11-07 09:53:39 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881f9a728058
2012-11-07 09:53:39 R13: 000000000000fcff R14: ffff880f72d3c000 R15: ffff880f720bde80
2012-11-07 09:53:39 FS:  00007ffff7fdc700(0000) GS:ffff880060600000(0000) knlGS:0000000000000000
2012-11-07 09:53:39 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-11-07 09:53:39 CR2: 00007ffff7ff9000 CR3: 000000200eb4b000 CR4: 00000000000406f0
2012-11-07 09:53:39 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-11-07 09:53:39 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-11-07 09:53:39 Process osp-syn-460
2012-11-07 09:53:39  (pid: 33016, threadinfo ffff880f720bc000, task ffff880fd518cae0)
2012-11-07 09:53:39 Stack:
2012-11-07 09:53:39  ffff881f00002000 0000000100000000 ffff881f00000000 ffff880f72d3c001
2012-11-07 09:53:39 &amp;lt;0&amp;gt; 00008f8000000000 0000000000000000 0000000000000000 0000000000240000
2012-11-07 09:53:39 &amp;lt;0&amp;gt; 0000fd00720bdbe0 ffff880f72d3e000 ffff880f763761c0 ffff881f9b6b0500
2012-11-07 09:53:39 Call Trace:
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff5ee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffffa075ec7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa0760c83&amp;gt;] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa075cf3b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa07609b0&amp;gt;] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa075ec7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa075f7b9&amp;gt;] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffff8104cab9&amp;gt;] ? __wake_up_common+0x59/0x90
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff5ee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffffa075f9d9&amp;gt;] llog_cat_process+0x19/0x20 [obdclass]
2012-11-07 09:53:39  [&amp;lt;ffffffffa05bd83a&amp;gt;] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff8000&amp;gt;] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffff8100c14a&amp;gt;] child_rip+0xa/0x20
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffffa0ff7e30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
2012-11-07 09:53:39 Code: 74 0c 00 01 00 00 00 e8 c3 0c e7 ff 48 83 7d b8 00 0f 84 f8 03 00 00 4c 8b 45 b8 49 8b 80 b0 00 00 00 48 85 c0 0f 84 e4 03 00 00 &amp;lt;48&amp;gt; 8b 40 08 48 85 c0 0f 84 67 02 00 00 4d 89 f1 c7 04 24 00 20 
2012-11-07 09:53:39 RIP  [&amp;lt;ffffffffa075c90c&amp;gt;] llog_process_thread+0x2cc/0xe10 [obdclass]
2012-11-07 09:53:39  RSP &amp;lt;ffff880f720bdb60&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47567" author="liwei" created="Thu, 8 Nov 2012 02:24:43 +0000"  >&lt;p&gt;The latest report is very helpful.  Here&apos;s my reconstruction:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;An OSP sync thread canceled the last record in a log and was freeing the log handle.  However, the thread was processing this log and it was at index 36702.  Although both the handle and header structures were poisoned, the thread could continue processing the log because a) index 36702 was less than (LLOG_BITMAP_BYTES * 8 - 1), b) llh_bitmap actually contained non-zero bits after being poisoned, and c) the on-disk canceled records were intact.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can&apos;t cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2
2012-11-07 09:53:38   req@ffff880f75ddf400 x1418000660995471/t0(0) o6-&amp;gt;lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
[...]
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can&apos;t cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2
2012-11-07 09:53:38   req@ffff880f73c0bc00 x1418000660995472/t0(0) o6-&amp;gt;lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These were from two records processed after freeing the log handle.  The indices were 36705 and 36707.  If I calculated correctly, 36702, 36705, and 36707 correspond to bit 30, 33, and 35 in the 573th bitmap word:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0
63                                  47                               32

 0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0
31                                  15                                0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This was why 36703, 36704, and 36706 did not appear in the log.&lt;/p&gt;</comment>
                            <comment id="47802" author="prakash" created="Wed, 14 Nov 2012 14:26:59 +0000"  >&lt;p&gt;Thanks for the detailed explanation! I need a better understanding of llogs to fully understand, but it sounds plausible at a high level. Do you have an idea as to how to fix it?&lt;/p&gt;

&lt;p&gt;I&apos;ve also seen the &quot;Still busy&quot; message crop up without the other &quot;can&apos;t cancel record&quot; messages:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;ConMan&amp;gt; Console [grove-mds2] log at 2012-11-13 07:00:00 PST.
2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1
2012-11-13 07:56:45 Pid: 32930, comm: osp-syn-431
2012-11-13 07:56:45 
2012-11-13 07:56:45 
2012-11-13 07:56:45 Call Trace:
2012-11-13 07:56:45  [&amp;lt;ffffffffa05aa965&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0703e81&amp;gt;] llog_free_handle+0xb1/0x430 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa070425d&amp;gt;] llog_close+0x5d/0x190 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0709ea9&amp;gt;] llog_cat_cancel_records+0x179/0x490 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa08eff20&amp;gt;] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcd9b8&amp;gt;] osp_sync_process_committed+0x238/0x760 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffffa09129b7&amp;gt;] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcdf74&amp;gt;] osp_sync_process_queues+0x94/0x11c0 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffff8105ea30&amp;gt;] ? default_wake_function+0x0/0x20
2012-11-13 07:56:45  [&amp;lt;ffffffffa0705f3b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcdee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0707c7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0709c83&amp;gt;] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0705f3b&amp;gt;] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa07099b0&amp;gt;] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0707c7d&amp;gt;] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa07087b9&amp;gt;] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffff8104cab9&amp;gt;] ? __wake_up_common+0x59/0x90
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcdee0&amp;gt;] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffffa07089d9&amp;gt;] llog_cat_process+0x19/0x20 [obdclass]
2012-11-13 07:56:45  [&amp;lt;ffffffffa05ab83a&amp;gt;] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fd0000&amp;gt;] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcfe30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffff8100c14a&amp;gt;] child_rip+0xa/0x20
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcfe30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffffa0fcfe30&amp;gt;] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
2012-11-13 07:56:45 

&amp;lt;ConMan&amp;gt; Console [grove-mds2] log at 2012-11-13 08:00:00 PST.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47899" author="liwei" created="Thu, 15 Nov 2012 22:43:03 +0000"  >&lt;p&gt;Prakash,&lt;/p&gt;

&lt;p&gt;In your last comment, according to&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;both lgh_cur_idx and lgh_last_idx were equal to the maximum index allowed in a log, meaning that the thread had processed all the records and would not use the handle anymore.  I think that is why the &quot;can&apos;t cancel record&quot; message and the GPF did not occur.&lt;/p&gt;

&lt;p&gt;In the case described by my last comment, lgh_cur_idx was far smaller than lgh_last_idx.  What puzzles me is that how llh_count could become one before the records from lgh_cur_idx to lgh_last_idx were processed.  Following Andreas and Alex&apos;s advice, I applied &lt;a href=&quot;http://review.whamcloud.com/4508&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4508&lt;/a&gt; that verifies the consistency between llh_count and llh_bitmap at log open time and failed an OST under a racer workload every ten seconds for a few hours.  But no inconsistency was found.&lt;/p&gt;

&lt;p&gt;I&apos;ll think again...&lt;/p&gt;</comment>
                            <comment id="47914" author="bzzz" created="Fri, 16 Nov 2012 04:37:42 +0000"  >&lt;p&gt;notice that during cancelation llog cookie is taken from a request, not directly from llog. so if request is freed, then corresponded memory can be re-used (or just filled with 5a).&lt;/p&gt;</comment>
                            <comment id="48289" author="tappro" created="Thu, 22 Nov 2012 13:04:45 +0000"  >&lt;p&gt;I agree with Alex, it is not about freed loghandle but about freed logid from request:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That means just cat_id2handle() tried to use logid being freed. And protection fault in osp_sync_process_queues() is caused just by calling ptlrpc_req_finished() on freed request.&lt;/p&gt;</comment>
                            <comment id="48298" author="liwei" created="Fri, 23 Nov 2012 00:17:40 +0000"  >&lt;p&gt;There was evidences early on of poisoned log handles when filling cookies into request buffers.  E.g., from Brian&apos;s comment on Aug 7:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="48301" author="tappro" created="Fri, 23 Nov 2012 04:42:01 +0000"  >&lt;p&gt;so, this debug is not in osp_sync_new_job() now? Can we add it again?&lt;/p&gt;</comment>
                            <comment id="48303" author="bzzz" created="Fri, 23 Nov 2012 05:20:31 +0000"  >&lt;p&gt;yes.. Li Wei, could you combine all the debug you&apos;ve developed (checks for poisoned ids, llh_count vs bitmap verification, etc) into a single patch and land it on master, please?&lt;/p&gt;</comment>
                            <comment id="48310" author="tappro" created="Fri, 23 Nov 2012 10:26:06 +0000"  >&lt;blockquote&gt;
&lt;p&gt;In the case described by my last comment, lgh_cur_idx was far smaller than lgh_last_idx. What puzzles me is that how llh_count could become one before the records from lgh_cur_idx to lgh_last_idx were processed.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;But you&apos;ve answered already that. There were no more records in llog and empty bitmap, but after poisoning it become filled with pattern and llog_process_thread() find set bit which was 36702.&lt;/p&gt;

&lt;p&gt;I think we have to add refcount into llog_handle and free it when only last one dropped.&lt;/p&gt;</comment>
                            <comment id="48335" author="liwei" created="Fri, 23 Nov 2012 22:23:51 +0000"  >&lt;p&gt;Mike, I had that thought too.  But considering that&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;a log is destroyed only if all its records have been cancelled, and&lt;/li&gt;
	&lt;li&gt;a record is cancelled &lt;em&gt;after&lt;/em&gt; it has been processed,&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;a log should be destroyed at least &lt;em&gt;after&lt;/em&gt; its OSP has processed all its records and likely moved to another log.  In addition, because&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;a log is destroyed only if lgh_last_idx equals 64767, and&lt;/li&gt;
	&lt;li&gt;an OSP processes records sequentially towards the end,&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;lgh_cur_idx should be 64767 (or 64766, if a padding record exists at the end) when the OSP has processed all the records in the log.  Thus, the log handle should not reach a state like&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1.&lt;/p&gt;</comment>
                            <comment id="48348" author="bzzz" created="Mon, 26 Nov 2012 01:58:28 +0000"  >&lt;p&gt;&amp;gt; where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1.&lt;/p&gt;

&lt;p&gt;iirc, in the all reported cases the problem happened right after startup. so 64767 means that this is a llog left from the previous boot and we&apos;re reprocessing that.&lt;br/&gt;
now, given there is no strong ordering among RPCs going to OST it&apos;s possible that some RPC did not reach OST (or reply was lost), so it&apos;s possible to have most of&lt;br/&gt;
bits cleared (llog records cancelled) with few exceptions.&lt;/p&gt;

&lt;p&gt;i think it should be possible to reproduce this locally setting timeout pretty big (to workaround slow performance compared to customer&apos;s hardware) and&lt;br/&gt;
dropping one rpc.&lt;/p&gt;

&lt;p&gt;another debug patch could be to print a warning when number of processed records &amp;gt; llh_count (if we started with a full llog).&lt;/p&gt;
</comment>
                            <comment id="48353" author="tappro" created="Mon, 26 Nov 2012 07:05:55 +0000"  >&lt;p&gt;Li Wei, about &quot;Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1&quot;. Why can&apos;t that be that 36702 was just the latest cancelled record? I noticed that requests are added to the head of opd_syn_committed_there list, so later requests will be processed first, so technically it is possible that records with lower index can be cancelled later then with larger index.&lt;/p&gt;
</comment>
                            <comment id="48355" author="tappro" created="Mon, 26 Nov 2012 07:20:16 +0000"  >&lt;p&gt;&lt;del&gt;Looking at osp_sync_process_queues() I&apos;ve noticed also how can we get our situation. First was executed the osp_sync_process_committed() and cancelled all records in current plain log because they, say, committed. Then osp_sync_can_process_new() checks if there are more work to do, and can return positive answer and osp_sync_process_record() is called adding freed logid to the request. This request processed and got -ENOENT from OST. In that case it is also added to the commit list and we will get our situation.&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;It is not quite so, the llh is set to NULL when request is sent. In fact, llog_process_thread() is not safe if callback deletes plain llog. We need mechanism to understand that llog was deleted and stop processing.&lt;/p&gt;

&lt;p&gt;&lt;del&gt;So I think first of all we must exist from osp_sync_process_queues() if llog was deleted by osp_sync_process_committed call.&lt;/del&gt;&lt;br/&gt;
Better to fix llog_process_thread as mentioned above.&lt;/p&gt;</comment>
                            <comment id="48358" author="bzzz" created="Mon, 26 Nov 2012 12:02:05 +0000"  >&lt;p&gt;I think it makes sense to introduce a local copy of llh_count just before we start processing of the llog, decrement it on the every alive record and break once it reaches 1. but this logic should be used only when the llog is not getting new records.&lt;/p&gt;</comment>
                            <comment id="48374" author="prakash" created="Mon, 26 Nov 2012 15:44:39 +0000"  >&lt;p&gt;FWIW, I hit this again. This time, I had all of the OSTs powered down, then rebooted the MDS (which came up fine), then powered up the OSTs. The MDS appeared to be OK when it came up, but once the OSTs were brought back online, the MDS crashed.&lt;/p&gt;

&lt;p&gt;And now as the MDS tries to come back online, it repeatedly hits this (twice so far):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-26 12:40:40 LustreError: 33226:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01fc-osc-MDT0000: error opening log id 0xaf75:0: rc = -2
2012-11-26 12:40:40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
2012-11-26 12:40:40 IP: [&amp;lt;ffffffffa071d3e4&amp;gt;] cat_cancel_cb+0x2e4/0x5e0 [obdclass]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="48384" author="liwei" created="Mon, 26 Nov 2012 22:34:31 +0000"  >&lt;p&gt;Ah, right, it must be records that were processed but not cancelled before MDTs were brought down.&lt;/p&gt;</comment>
                            <comment id="48385" author="bzzz" created="Mon, 26 Nov 2012 23:15:49 +0000"  >&lt;p&gt;as for cat_cancel_cb() I found that llog_cat_init_and_process() forks, so then it&apos;s running with main osp&apos;s llog_process() in parallel. I think llog_cat_init_and_process() should not fork.&lt;/p&gt;</comment>
                            <comment id="48402" author="tappro" created="Tue, 27 Nov 2012 04:36:39 +0000"  >&lt;p&gt;Prakash, that is different issue, could you add new ticket with data you have?&lt;/p&gt;</comment>
                            <comment id="48416" author="prakash" created="Tue, 27 Nov 2012 12:34:34 +0000"  >&lt;p&gt;I opened &lt;a href=&quot;http://jira.whamcloud.com/browse/LU-2394&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;LU-2394&lt;/a&gt; for the cat_cancel_cb and a link to the patch to address the issue.&lt;/p&gt;</comment>
                            <comment id="48488" author="liwei" created="Wed, 28 Nov 2012 11:09:08 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/4696&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4696&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A fix.  For some reason, I wasn&apos;t able to get my new regression test working correctly today.  Instead, I tested this patch with this hack:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;--- a/lustre/obdclass/llog.c
+++ b/lustre/obdclass/llog.c
@@ -137,7 +137,8 @@ &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; llog_cancel_rec(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env, struct llog_ha
 
         &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; ((llh-&amp;gt;llh_flags &amp;amp; LLOG_F_ZAP_WHEN_EMPTY) &amp;amp;&amp;amp;
             (llh-&amp;gt;llh_count == 1) &amp;amp;&amp;amp;
-            (loghandle-&amp;gt;lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1)) {
+            (loghandle-&amp;gt;lgh_last_idx &amp;gt;= 1)) {
+               loghandle-&amp;gt;lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1;
                cfs_spin_unlock(&amp;amp;loghandle-&amp;gt;lgh_hdr_lock);
                rc = llog_destroy(env, loghandle);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc &amp;lt; 0) {
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="48656" author="adilger" created="Mon, 3 Dec 2012 02:15:23 +0000"  >&lt;p&gt;Li Wei, this is exactly what OBD_FAIL_CHECK() is for - the ability to be able to trigger specific errors in the code at runtime instead of having to change it at compile time.&lt;/p&gt;</comment>
                            <comment id="48796" author="tappro" created="Wed, 5 Dec 2012 07:44:08 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,4745&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4745&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Attempt to introduce refcounter for llog handler. Please verify it is correct and works.&lt;/p&gt;</comment>
                            <comment id="49205" author="prakash" created="Thu, 13 Dec 2012 16:09:30 +0000"  >&lt;p&gt;Mike, I pulled in that patch. I&apos;ve also done some initial testing and it looks good so far. I&apos;ve rebooted the MDS a few times while under light load, and haven&apos;t hit the issue yet (with the fix). Without the fix, I was able to hit it on the first try under the same load.&lt;/p&gt;</comment>
                            <comment id="50338" author="tappro" created="Fri, 11 Jan 2013 11:13:01 +0000"  >&lt;p&gt;patch was landed. Prakash, are you OK to close this ticket?&lt;/p&gt;</comment>
                            <comment id="50384" author="tappro" created="Sun, 13 Jan 2013 01:33:06 +0000"  >&lt;p&gt;closing for now, reopen if it will appear again&lt;/p&gt;</comment>
                            <comment id="50413" author="prakash" created="Mon, 14 Jan 2013 10:28:44 +0000"  >&lt;p&gt;Mike, I can&apos;t recall hitting this since pulling in the patch. Thanks!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="16730">LU-2362</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>server</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10070" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Project</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10031"><![CDATA[Orion]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzux6f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3037</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>