<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:24:51 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2397] Assertion triggered in check_for_next_transno</title>
                <link>https://jira.whamcloud.com/browse/LU-2397</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I did a quick search of the existing bugs, but didn&apos;t find this one.&lt;/p&gt;

&lt;p&gt;Running the &lt;tt&gt;replay-single&lt;/tt&gt; test in a single node VM setup I triggered the following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
LustreError: 14716:0:(ldlm_lib.c:1749:check_for_next_transno()) ASSERTION( req_transno &amp;gt;= next_transno ) failed: req_transno: 0, next_transno: 51539608157
LustreError: 14716:0:(ldlm_lib.c:1749:check_for_next_transno()) LBUG
Kernel panic - not syncing: LBUG
Pid: 14716, comm: tgt_recov Tainted: P           ---------------    2.6.32-279.9.1.1chaos.ch5.1.x86_64 #1
Call Trace:
 [&amp;lt;ffffffff814fdceb&amp;gt;] ? panic+0xa0/0x168
 [&amp;lt;ffffffffa0705f6b&amp;gt;] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [&amp;lt;ffffffffa0991e75&amp;gt;] ? check_for_next_transno+0x585/0x590 [ptlrpc]
 [&amp;lt;ffffffffa09918f0&amp;gt;] ? check_for_next_transno+0x0/0x590 [ptlrpc]
 [&amp;lt;ffffffffa098be0e&amp;gt;] ? target_recovery_overseer+0x5e/0x250 [ptlrpc]
 [&amp;lt;ffffffffa098a270&amp;gt;] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
 [&amp;lt;ffffffffa0716591&amp;gt;] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [&amp;lt;ffffffffa0992f55&amp;gt;] ? target_recovery_thread+0x7b5/0x19d0 [ptlrpc]
 [&amp;lt;ffffffffa09927a0&amp;gt;] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [&amp;lt;ffffffff8100c14a&amp;gt;] ? child_rip+0xa/0x20
 [&amp;lt;ffffffffa09927a0&amp;gt;] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [&amp;lt;ffffffffa09927a0&amp;gt;] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The test case was simply:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ sudo ONLY=&quot;59 60&quot; FSTYPE=zfs ./lustre/tests/replay-single.sh
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="16787">LU-2397</key>
            <summary>Assertion triggered in check_for_next_transno</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                            <label>LB</label>
                    </labels>
                <created>Tue, 27 Nov 2012 17:11:45 +0000</created>
                <updated>Fri, 19 Apr 2013 22:30:41 +0000</updated>
                            <resolved>Fri, 18 Jan 2013 14:20:52 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                    <fixVersion>Lustre 2.4.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="48447" author="pjones" created="Tue, 27 Nov 2012 17:22:40 +0000"  >&lt;p&gt;Could you please comment on this one? Thanks Peter&lt;/p&gt;</comment>
                            <comment id="48533" author="tappro" created="Thu, 29 Nov 2012 05:46:49 +0000"  >&lt;p&gt;replay with 0 transno? Looks wrong. Is that reproduced easily with command you&apos;ve mentioned?&lt;/p&gt;</comment>
                            <comment id="48544" author="prakash" created="Thu, 29 Nov 2012 14:32:20 +0000"  >&lt;p&gt;I only hit it once while running that test in a loop for about 6 hours.. I wouldn&apos;t call that easy to reproduce. Honestly, there are probably more pressing issues to spend time on, but I just wanted to document this here since I did hit it once.&lt;/p&gt;</comment>
                            <comment id="49817" author="green" created="Mon, 31 Dec 2012 11:37:52 +0000"  >&lt;p&gt;I hit this too, in recovery-ost-single, test 2&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[107343.839932] Lustre: 26202:0:(ldlm_lib.c:2184:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 12884901891
[107348.833481] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[107348.836301] LustreError: 26224:0:(ldlm_lib.c:1748:check_for_next_transno()) ASSERTION( req_transno &amp;gt;= next_transno ) failed: req_transno: 0, next_transno: 12884901892
[107348.839179] LustreError: 26224:0:(ldlm_lib.c:1748:check_for_next_transno()) LBUG
[107348.840790] Pid: 26224, comm: tgt_recov
[107348.841717] 
[107348.841720] Call Trace:
[107348.842588]  [&amp;lt;ffffffffa0ae8915&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[107348.842866]  [&amp;lt;ffffffffa0ae8f27&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
[107348.843157]  [&amp;lt;ffffffffa1331b36&amp;gt;] check_for_next_transno+0x596/0x5a0 [ptlrpc]
[107348.843598]  [&amp;lt;ffffffffa13315a0&amp;gt;] ? check_for_next_transno+0x0/0x5a0 [ptlrpc]
[107348.844039]  [&amp;lt;ffffffffa132bb46&amp;gt;] target_recovery_overseer+0x66/0x230 [ptlrpc]
[107348.844532]  [&amp;lt;ffffffffa1329fb0&amp;gt;] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
[107348.844827]  [&amp;lt;ffffffffa0af9641&amp;gt;] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[107348.845169]  [&amp;lt;ffffffffa1332ae3&amp;gt;] target_recovery_thread+0x683/0x1660 [ptlrpc]
[107348.845686]  [&amp;lt;ffffffff814faef5&amp;gt;] ? _spin_unlock_irq+0x15/0x20
[107348.846050]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.846522]  [&amp;lt;ffffffff8100c14a&amp;gt;] child_rip+0xa/0x20
[107348.846795]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.847313]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.847798]  [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
[107348.848056] 
[107348.848285] BUG: spinlock cpu recursion on CPU#3, ll_ost01_003/25839 (Not tainted)
[107348.848750]  lock: ffff88004d00cc38, .magic: dead4ead, .owner: tgt_recov/26224, .owner_cpu: 3
[107348.849257] Pid: 25839, comm: ll_ost01_003 Not tainted 2.6.32-debug #6
[107348.849558] Call Trace:
[107348.849764]  [&amp;lt;ffffffff8128098a&amp;gt;] ? spin_bug+0xaa/0x100
[107348.850006]  [&amp;lt;ffffffff81280ba1&amp;gt;] ? _raw_spin_lock+0x121/0x180
[107348.850290]  [&amp;lt;ffffffff81051f73&amp;gt;] ? __wake_up+0x53/0x70
[107348.850588]  [&amp;lt;ffffffff814fafde&amp;gt;] ? _spin_lock+0xe/0x10
[107348.850872]  [&amp;lt;ffffffffa1330d6c&amp;gt;] ? target_queue_recovery_request+0x41c/0xc50 [ptlrpc]
[107348.851438]  [&amp;lt;ffffffff814faf3e&amp;gt;] ? _spin_unlock+0xe/0x10
[107348.851617] Kernel panic - not syncing: LBUG
[107348.851621] Pid: 26224, comm: tgt_recov Not tainted 2.6.32-debug #6
[107348.851622] Call Trace:
[107348.851629]  [&amp;lt;ffffffff814f75e4&amp;gt;] ? panic+0xa0/0x168
[107348.851642]  [&amp;lt;ffffffffa0ae8f7b&amp;gt;] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[107348.851673]  [&amp;lt;ffffffffa1331b36&amp;gt;] ? check_for_next_transno+0x596/0x5a0 [ptlrpc]
[107348.851701]  [&amp;lt;ffffffffa13315a0&amp;gt;] ? check_for_next_transno+0x0/0x5a0 [ptlrpc]
[107348.851728]  [&amp;lt;ffffffffa132bb46&amp;gt;] ? target_recovery_overseer+0x66/0x230 [ptlrpc]
[107348.851755]  [&amp;lt;ffffffffa1329fb0&amp;gt;] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
[107348.851768]  [&amp;lt;ffffffffa0af9641&amp;gt;] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[107348.851796]  [&amp;lt;ffffffffa1332ae3&amp;gt;] ? target_recovery_thread+0x683/0x1660 [ptlrpc]
[107348.851799]  [&amp;lt;ffffffff814faef5&amp;gt;] ? _spin_unlock_irq+0x15/0x20
[107348.851826]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851829]  [&amp;lt;ffffffff8100c14a&amp;gt;] ? child_rip+0xa/0x20
[107348.851856]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851883]  [&amp;lt;ffffffffa1332460&amp;gt;] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851886]  [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;crashdump and modules are in /exports/crashdumps/192.168.10.219-2012-12-30-10\:28\:47&lt;/p&gt;</comment>
                            <comment id="50281" author="tappro" created="Thu, 10 Jan 2013 14:39:30 +0000"  >&lt;p&gt;The code below is not safe:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (cfs_atomic_read(&amp;amp;obd-&amp;gt;obd_req_replay_clients) == 0) {
                CDEBUG(D_HA, &lt;span class=&quot;code-quote&quot;&gt;&quot;waking &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; completed recovery\n&quot;&lt;/span&gt;);
                wake_up = 1;
        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (req_transno == next_transno) {
                CDEBUG(D_HA, &lt;span class=&quot;code-quote&quot;&gt;&quot;waking &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; next (&quot;&lt;/span&gt;LPD64&lt;span class=&quot;code-quote&quot;&gt;&quot;)\n&quot;&lt;/span&gt;, next_transno);
                wake_up = 1;
        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (queue_len == cfs_atomic_read(&amp;amp;obd-&amp;gt;obd_req_replay_clients)) {
                &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; d_lvl = D_HA;
                &lt;span class=&quot;code-comment&quot;&gt;/** handle gaps occured due to lost reply or VBR */&lt;/span&gt;
                LASSERTF(req_transno &amp;gt;= next_transno,
                         &lt;span class=&quot;code-quote&quot;&gt;&quot;req_transno: &quot;&lt;/span&gt;LPU64&lt;span class=&quot;code-quote&quot;&gt;&quot;, next_transno: &quot;&lt;/span&gt;LPU64&lt;span class=&quot;code-quote&quot;&gt;&quot;\n&quot;&lt;/span&gt;,
                         req_transno, next_transno);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;the obd_req_replay_clients counter is checked for 0 and after that it is supposed to be not zero in next check against queue_len. I believe that was so in past when all counters were changed under the same lock but some time ago they became atomic values and many places became lockless due to smp improvements work. So now we can&apos;t rely on the fact that this counter can not change. Those recent issues described in this ticket are about that - queue_len is 0 and counter become 0 exactly at second check, so we enter the path which is not supposed to be run with zero queue_len.&lt;/p&gt;

&lt;p&gt;The fix would be check queue_len &amp;gt; 0 in addition to the existing check, also obd_req_replay_clients check for 0 should be done after comparision with queue_len.&lt;/p&gt;
</comment>
                            <comment id="50327" author="tappro" created="Fri, 11 Jan 2013 02:23:19 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,4998&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4998&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50829" author="tappro" created="Fri, 18 Jan 2013 14:20:52 +0000"  >&lt;p&gt;landed&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvcxj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5688</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>