<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:13:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7975] &quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&gt;ldo_stripenr == 0 )&quot; LBUG/Assert on MDS</title>
                <link>https://jira.whamcloud.com/browse/LU-7975</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A site has encountered multiple crashes with same signature/stack+msgs following :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 89879:0:(osp_precreate.c:1222:osp_object_truncate()) can&apos;t punch object: -11
Lustre: composit-OST0009-osc-MDT0000: Connection to composit-OST0009 (at 10.0.14.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 ) failed: 
LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) LBUG
Pid: 89879, comm: mdt01_006

Call Trace:
 [&amp;lt;ffffffffa057e895&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [&amp;lt;ffffffffa057ee97&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
 [&amp;lt;ffffffffa266c0af&amp;gt;] lod_ah_init+0x58f/0x5d0 [lod]
 [&amp;lt;ffffffffa26c7ad3&amp;gt;] mdd_object_make_hint+0x83/0xa0 [mdd]
 [&amp;lt;ffffffffa26d4502&amp;gt;] mdd_create_data+0x332/0x7d0 [mdd]
 [&amp;lt;ffffffffa25a93f0&amp;gt;] mdt_finish_open+0x1350/0x19a0 [mdt]
 [&amp;lt;ffffffffa257e5f4&amp;gt;] ? mdt_object_lock+0x14/0x20 [mdt]
 [&amp;lt;ffffffffa25a9fbd&amp;gt;] mdt_open_by_fid_lock+0x57d/0x910 [mdt]
 [&amp;lt;ffffffffa25aabac&amp;gt;] mdt_reint_open+0x56c/0x21a0 [mdt]
 [&amp;lt;ffffffffa059b14c&amp;gt;] ? upcall_cache_get_entry+0x29c/0x890 [libcfs]
 [&amp;lt;ffffffffa0983930&amp;gt;] ? lu_ucred+0x20/0x30 [obdclass]
 [&amp;lt;ffffffffa2572945&amp;gt;] ? mdt_ucred+0x15/0x20 [mdt]
 [&amp;lt;ffffffffa258f8ec&amp;gt;] ? mdt_root_squash+0x2c/0x410 [mdt]
 [&amp;lt;ffffffffa123bad6&amp;gt;] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [&amp;lt;ffffffffa2593ab1&amp;gt;] mdt_reint_rec+0x41/0xe0 [mdt]
 [&amp;lt;ffffffffa2578f83&amp;gt;] mdt_reint_internal+0x4c3/0x780 [mdt]
 [&amp;lt;ffffffffa257950e&amp;gt;] mdt_intent_reint+0x1ee/0x520 [mdt]
 [&amp;lt;ffffffffa2576cee&amp;gt;] mdt_intent_policy+0x3ae/0x770 [mdt]
 [&amp;lt;ffffffffa11ca2f5&amp;gt;] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
 [&amp;lt;ffffffffa11f43fb&amp;gt;] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
 [&amp;lt;ffffffffa25771b6&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
 [&amp;lt;ffffffffa257c84a&amp;gt;] mdt_handle_common+0x52a/0x1470 [mdt]
 [&amp;lt;ffffffffa25b98f5&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
 [&amp;lt;ffffffffa12238d5&amp;gt;] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [&amp;lt;ffffffffa05904fa&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [&amp;lt;ffffffffa121c289&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [&amp;lt;ffffffff81057849&amp;gt;] ? __wake_up_common+0x59/0x90
 [&amp;lt;ffffffffa122605d&amp;gt;] ptlrpc_main+0xaed/0x1780 [ptlrpc]
 [&amp;lt;ffffffffa1225570&amp;gt;] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
 [&amp;lt;ffffffff8109e78e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff8109e6f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;According to existing tickets, I have found that this kind of problem has already (partially?) been addressed in LU&amp;#45;4260, LU&amp;#45;4791 and LU&amp;#45;5346 tickets. &lt;br/&gt;
And since both fixes for LU&amp;#45;4260 and LU&amp;#45;4791 are already integrated, this means that we encounter a new situation/problem during OST objects pre-creation, likely to be caused by some specific file meta-data pattern (I have identified as &quot;deferred layout&quot; feature usage with open(, ...|O_LOV_DELAY_CREATE|...,) along with a non-0 truncate() to trigger objects preallocation), leading to trigger a similar case than described in LU&amp;#45;5346 upon error return path that is still not fixed.&lt;/p&gt;

&lt;p&gt;BTW, I have also determined that these MDT assert always occurs just following an OSS crash, hence the &amp;#45;EAGAIN/EWOULDBLOCK error in &quot;(osp_precreate.c:1222:osp_object_truncate()) can&apos;t punch object: -11&quot; msg just preceding the assert !&lt;/p&gt;</description>
                <environment></environment>
        <key id="35779">LU-7975</key>
            <summary>&quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&gt;ldo_stripenr == 0 )&quot; LBUG/Assert on MDS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="bfaccini">Bruno Faccini</reporter>
                        <labels>
                            <label>cea</label>
                    </labels>
                <created>Fri, 1 Apr 2016 15:35:17 +0000</created>
                <updated>Thu, 14 Jun 2018 21:41:17 +0000</updated>
                            <resolved>Tue, 31 May 2016 12:50:05 +0000</resolved>
                                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="147589" author="bzzz" created="Fri, 1 Apr 2016 15:42:36 +0000"  >&lt;p&gt;the good thing is that we&apos;ve got a new infrastructure with DNE2 allowing for striping procedures (declaration and creation) to be per-thandle and not put intermediate state into an object. another improvement we can make with this infrastructure is to release OSP objects after creation, don&apos;t leave them in cache consuming memory.&lt;/p&gt;</comment>
                            <comment id="147599" author="bfaccini" created="Fri, 1 Apr 2016 16:28:47 +0000"  >&lt;p&gt;After more crash-dump and Lustre debug logs analysis, I was able to find a reproducer for the situation unveiled at customer site and first causing MDS CERROR/msg &quot;(osp_precreate.c:1222:osp_object_truncate()) can&apos;t punch object: &amp;#45;11&quot; and a few later &quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 ) failed&quot; Assertion/LBUG() :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# !!! with all OSTs down !!!
mkdir /mnt/lustre/foo_dir/
lfs setstripe -c 2 /mnt/lustre/foo_dir/
# open file with O_LOV_DELAY_CREATE, then non-0 truncate
multiop /mnt/lustre/foo_dir/foo5 oO_RDWR:O_CREAT:O_LOV_DELAY_CREATE:T1050000c
# normal open
cat /etc/hosts &amp;gt; /mnt/lustre/foo_dir/foo5 ---&amp;gt;&amp;gt;&amp;gt; &quot;-bash: /mnt/lustre/foo_dir/foo5: Resource temporarily unavailable&quot; and &quot;(osp_precreate.c:1222:osp_object_truncate()) can&apos;t punch object: -11&quot;
# normal open again
cat /etc/hosts &amp;gt; /mnt/lustre/foo_dir/foo5 ---&amp;gt;&amp;gt;&amp;gt; &quot;LustreError: 11915:0:(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 ) failed:&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, diging more in both MDS full Lustre debug log along into the source code, I found the reason of the -EAGAIN return from osp_object_create() call, which is not sent over the wire but in turn is coded as -EWOULDBLOCK in ptlrpc_import_delay_req() !! :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/**
 * Based on the current state of the import, determine if the request
 * can be sent, is an error, or should be delayed.
 *
 * Returns true if this request should be delayed. If false, and
 * *status is set, then the request can not be sent and *status is the
 * error code.  If false and status is 0, then request can be sent.
 *
 * The imp-&amp;gt;imp_lock must be held.
 */
static int ptlrpc_import_delay_req(struct obd_import *imp,
                                   struct ptlrpc_request *req, int *status)
{
        int delay = 0;
        ENTRY;
..............
                 } else if (req-&amp;gt;rq_send_state != imp-&amp;gt;imp_state) {
                /* invalidate in progress - any requests should be drop */
                if (cfs_atomic_read(&amp;amp;imp-&amp;gt;imp_inval_count) != 0) {
                        DEBUG_REQ(D_ERROR, req, &quot;invalidate in flight&quot;);
                        *status = -EIO;
                } else if (imp-&amp;gt;imp_dlm_fake || req-&amp;gt;rq_no_delay) {
                        *status = -EWOULDBLOCK; &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;
.........................
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;when&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/*
 *
 */
int osp_object_truncate(const struct lu_env *env, struct dt_object *dt,
                        __u64 size)
{
...............
        /*
         * XXX: decide how do we do here with resend
         * if we don&apos;t resend, then client may see wrong file size
         * if we do resend, then MDS thread can get stuck for quite long
         */
        req-&amp;gt;rq_no_resend = req-&amp;gt;rq_no_delay = 1;
................
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So this raises more implications and I think resend (or any retry mechanism) may not be an option at the moment if we don&apos;t want data corruption failure (missing non-0 truncate() effect/result) when this particular scenario occurs. Because even if striping allocated things/stuff cleanup upon error return path must be fixed in MDS code, the error should not be returned to the user otherwise further open() will fail and non-0 truncate() effect/result will be lost.&lt;/p&gt;</comment>
                            <comment id="147600" author="bzzz" created="Fri, 1 Apr 2016 16:52:43 +0000"  >&lt;p&gt;as for truncate, there is a ticket to move this functionality to the client side. could be a good time, probably..&lt;/p&gt;</comment>
                            <comment id="147656" author="gerrit" created="Fri, 1 Apr 2016 22:27:22 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/19301&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19301&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7975&quot; title=&quot;&amp;quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 )&amp;quot; LBUG/Assert on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7975&quot;&gt;&lt;del&gt;LU-7975&lt;/del&gt;&lt;/a&gt; lod: fix delayed stripe error path &amp;amp; MDS resend&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 43a47da70a47d027f9ac199ed4ff6fdc4fe91614&lt;/p&gt;</comment>
                            <comment id="147657" author="gerrit" created="Fri, 1 Apr 2016 22:46:25 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/19302&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19302&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7975&quot; title=&quot;&amp;quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 )&amp;quot; LBUG/Assert on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7975&quot;&gt;&lt;del&gt;LU-7975&lt;/del&gt;&lt;/a&gt; lod: fix delayed stripe error path &amp;amp; Client resend&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: bb8a376656f6b3626a47c32cbc1855045a47d929&lt;/p&gt;</comment>
                            <comment id="147658" author="bfaccini" created="Fri, 1 Apr 2016 22:49:43 +0000"  >&lt;p&gt;Patch at &lt;a href=&quot;http://review.whamcloud.com/19301&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19301&lt;/a&gt; fixes cleanup in delayed stripe error path and also implements resend mechanism from MDS side. It may keep a MDS thread busy for some time doing so.&lt;/p&gt;

&lt;p&gt;Patch at &lt;a href=&quot;http://review.whamcloud.com/19302&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19302&lt;/a&gt; is an other way to fix, also doing delayed stripe error path necessary cleanup, but offloading resend mechanism to Client side, which may be less intrusive.&lt;/p&gt;</comment>
                            <comment id="153302" author="bfaccini" created="Tue, 24 May 2016 07:39:40 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/19301&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19301&lt;/a&gt; has been abandoned in favor of &lt;a href=&quot;http://review.whamcloud.com/19302&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19302&lt;/a&gt;, according to reviewers comments and choice between both solutions.&lt;/p&gt;</comment>
                            <comment id="154020" author="gerrit" created="Tue, 31 May 2016 04:56:11 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/19302/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19302/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7975&quot; title=&quot;&amp;quot;(lod_object.c:700:lod_ah_init()) ASSERTION( lc-&amp;gt;ldo_stripenr == 0 )&amp;quot; LBUG/Assert on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7975&quot;&gt;&lt;del&gt;LU-7975&lt;/del&gt;&lt;/a&gt; lod: fix delayed stripe error path &amp;amp; Client resend&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 047dfe489966c8816cbead1a3abbbb1564fdb7db&lt;/p&gt;</comment>
                            <comment id="154063" author="pjones" created="Tue, 31 May 2016 12:50:05 +0000"  >&lt;p&gt;Landed for 2.9&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="24438">LU-4971</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzy6n3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>