<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:22:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2079] Error reading changelog_users file preventing successful changelog setup during init</title>
                <link>https://jira.whamcloud.com/browse/LU-2079</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I&apos;m having issues mounting our 2.3.51-2chaos based FS after rebooting the clients. I see the following messages on the console:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 5872:0:(client.c:1116:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881009598400 x1414745503563778/t0(0) o101-&amp;gt;MGC172.20.5.2@o2ib500@172.20.5.2@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
ib0: no IPv6 routers present
LustreError: 5900:0:(client.c:1116:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880810225800 x1414745503563780/t0(0) o101-&amp;gt;MGC172.20.5.2@o2ib500@172.20.5.2@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: Skipped 1 previous similar message
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: Skipped 2 previous similar messages
LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
LustreError: Skipped 5 previous similar messages
LustreError: 5954:0:(lmv_obd.c:1190:lmv_statfs()) can&apos;t stat MDS #0 (lstest-MDT0000-mdc-ffff88080ba12000), error -11
LustreError: 4827:0:(lov_obd.c:937:lov_cleanup()) lov tgt 385 not cleaned! deathrow=0, lovrc=1
LustreError: 4827:0:(lov_obd.c:937:lov_cleanup()) lov tgt 386 not cleaned! deathrow=1, lovrc=1
Lustre: Unmounted lstest-client
LustreError: 5954:0:(obd_mount.c:2332:lustre_fill_super()) Unable to mount  (-11)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I haven&apos;t had time to look into the cause, but thought it might be useful to open an issue about it.&lt;/p&gt;</description>
                <environment></environment>
        <key id="16221">LU-2079</key>
            <summary>Error reading changelog_users file preventing successful changelog setup during init</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="prakash">Prakash Surya</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                            <label>topsequoia</label>
                    </labels>
                <created>Tue, 2 Oct 2012 15:48:15 +0000</created>
                <updated>Thu, 29 Nov 2012 21:02:12 +0000</updated>
                            <resolved>Thu, 29 Nov 2012 21:02:12 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="45896" author="pjones" created="Tue, 2 Oct 2012 16:19:11 +0000"  >&lt;p&gt;Alex&lt;/p&gt;

&lt;p&gt;Who should look into this one?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="45897" author="bzzz" created="Tue, 2 Oct 2012 16:21:59 +0000"  >&lt;p&gt;me&lt;/p&gt;</comment>
                            <comment id="45937" author="prakash" created="Wed, 3 Oct 2012 13:40:49 +0000"  >&lt;p&gt;I see many of the following messages on the MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.103@o2ib500
Lustre: Skipped 6094 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it looks like it&apos;s failing this check in target_handle_connect:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 803         if (target-&amp;gt;obd_no_conn) {
 804                 cfs_spin_unlock(&amp;amp;target-&amp;gt;obd_dev_lock);
 805 
 806                 LCONSOLE_WARN(&quot;%s: Temporarily refusing client connection &quot;
 807                               &quot;from %s\n&quot;, target-&amp;gt;obd_name,
 808                               libcfs_nid2str(req-&amp;gt;rq_peer.nid));
 809                 GOTO(out, rc = -EAGAIN);
 810         }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="45940" author="prakash" created="Wed, 3 Oct 2012 14:03:24 +0000"  >&lt;p&gt;Digging through all the noise on the console, I think this might be the cause of the issue:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 32893:0:(llog_osd.c:227:llog_osd_read_header()) lstest-MDT0000-osd: error reading log header from [0x1:0x3:0x0]: rc = -14
LustreError: 32893:0:(mdd_device.c:411:mdd_changelog_init()) lstest-MDD0000: changelog setup during init failed: rc = -14
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;My theory is, mdd_changelog_init fails -&amp;gt; mdd_prepare fails -&amp;gt; mdt_prepare fails -&amp;gt; obd_no_conn doesn&apos;t get enabled in mdt_prepare.&lt;/p&gt;</comment>
                            <comment id="45941" author="bzzz" created="Wed, 3 Oct 2012 14:10:08 +0000"  >&lt;p&gt;the theory is correct.&lt;/p&gt;</comment>
                            <comment id="45942" author="bzzz" created="Wed, 3 Oct 2012 14:15:14 +0000"  >&lt;p&gt;Prakash, would you mind to reset rc to 0 in mdd_changelog_init() and mdd-&amp;gt;mdd_cl.mc_flags = 0; for a while ? I&apos;m looking at this yet.&lt;/p&gt;</comment>
                            <comment id="45944" author="prakash" created="Wed, 3 Oct 2012 14:21:40 +0000"  >&lt;p&gt;So.. Is this what you&apos;re thinking:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
diff --git i/lustre/mdd/mdd_device.c w/lustre/mdd/mdd_device.c
index af403b0..7b71958 100644
--- i/lustre/mdd/mdd_device.c
+++ w/lustre/mdd/mdd_device.c
@@ -412,6 +412,11 @@ static int mdd_changelog_init(const struct lu_env *env, struct mdd_device *mdd)
                mdd-&amp;gt;mdd_cl.mc_flags |= CLM_ERR;
        }
 
+       if (rc == -EFAULT) { /* LU-2079 */
+               rc = 0;
+               mdd-&amp;gt;mdd_cl.mc_flags = 0;
+       }
+
        return rc;
 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="45946" author="bzzz" created="Wed, 3 Oct 2012 14:24:09 +0000"  >&lt;p&gt;yes.. for some reason the record llog gets from dmu is shorter than expected.. trying to reproduce locally.&lt;/p&gt;

&lt;p&gt;are you using changelogs ?&lt;/p&gt;</comment>
                            <comment id="45953" author="prakash" created="Wed, 3 Oct 2012 15:39:12 +0000"  >&lt;blockquote&gt;
&lt;p&gt;are you using changelogs ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;No, not on this filesystem.&lt;/p&gt;</comment>
                            <comment id="45954" author="prakash" created="Wed, 3 Oct 2012 15:53:35 +0000"  >&lt;p&gt;The clients are now able to connect after applying the following patch and rebooting the MDS: &lt;a href=&quot;http://review.whamcloud.com/4169&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4169&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="45955" author="bzzz" created="Wed, 3 Oct 2012 15:58:43 +0000"  >&lt;p&gt;thanks... I was trying to reproduce this mounting orion&apos;s filesystem with master&apos;s code, but it worked fine... need to think more.&lt;br/&gt;
sorry for all this.&lt;/p&gt;</comment>
                            <comment id="45967" author="prakash" created="Wed, 3 Oct 2012 20:42:31 +0000"  >&lt;p&gt;I mentioned this issue to Brian, and he said that EFAULT is often returned if a buffer that is too small is passed in. I haven&apos;t verified if that is happening, but it&apos;s something to keep in mind. Even if that is the case, it doesn&apos;t answer &lt;b&gt;why&lt;/b&gt; that happened.&lt;/p&gt;</comment>
                            <comment id="45971" author="bzzz" created="Thu, 4 Oct 2012 00:16:11 +0000"  >&lt;p&gt;yes, EFAULT is returned when read is short.. can you try to mount with zfs directly (or find by zdb) and dump changelog_catalog file and attach it here, please?&lt;/p&gt;</comment>
                            <comment id="46075" author="morrone" created="Fri, 5 Oct 2012 18:41:22 +0000"  >&lt;p&gt;Alex, changelog_catalog is 0 length.&lt;/p&gt;</comment>
                            <comment id="46120" author="bzzz" created="Mon, 8 Oct 2012 01:26:29 +0000"  >&lt;p&gt;Chris, to be honest I don&apos;t know how could it get corrupted (supposed to be at least 8K from the very beginning).&lt;br/&gt;
can you remove the file manually (mount -t zfs ..) ?&lt;/p&gt;</comment>
                            <comment id="46203" author="morrone" created="Mon, 8 Oct 2012 16:14:34 +0000"  >&lt;p&gt;Since we don&apos;t know the cause, and it clearly can happen, I think we&apos;ll need a code change to handle this situation.&lt;/p&gt;</comment>
                            <comment id="46204" author="bzzz" created="Mon, 8 Oct 2012 16:16:16 +0000"  >&lt;p&gt;no objections&lt;/p&gt;</comment>
                            <comment id="46855" author="bzzz" created="Tue, 23 Oct 2012 03:33:53 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/4376&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4376&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;this is a debug patch, we&apos;d like to see attributes of the object. the idea is that for a reason (literally a bug in the past) the object was created with a wrong type - directory. and now this wrong type lead to a wrong calculation of size.&lt;/p&gt;

&lt;p&gt;please, try to mount with the patch and attach kernel messages from MDS.&lt;/p&gt;</comment>
                            <comment id="46913" author="prakash" created="Thu, 25 Oct 2012 13:04:28 +0000"  >&lt;p&gt;Alex, Thanks for the explanation. The patch is applied to our branch and I&apos;ll get it installed later today.&lt;/p&gt;</comment>
                            <comment id="46918" author="prakash" created="Thu, 25 Oct 2012 14:11:26 +0000"  >&lt;p&gt;Here you go:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-10-25 11:08:00 LustreError: 32687:0:(llog_osd.c:227:llog_osd_read_header()) lstest-MDT0000-osd: error reading log header from [0x1:0x3:0x0]: rc = -14
2012-10-25 11:08:00 LustreError: 32687:0:(llog_osd.c:230:llog_osd_read_header()) attrs: valid 17ff, mode 100644, size 24, block 257
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47440" author="bzzz" created="Tue, 6 Nov 2012 09:46:43 +0000"  >&lt;p&gt;thanks... the theory was wrong, sorry &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; could you dump that file and attach it here, please? anything like hexdump is good enough. I&apos;m still scratching the head what happened to the file. at the moment another theory is that the header was written partially for a reason.&lt;/p&gt;</comment>
                            <comment id="47463" author="morrone" created="Tue, 6 Nov 2012 14:21:57 +0000"  >&lt;p&gt;How are you mapping &lt;span class=&quot;error&quot;&gt;&amp;#91;0x1:0x3:0x0&amp;#93;&lt;/span&gt; to changelog_catalog??  This is about as clear as mud in the code.  I think I see that the first 0x1 identifies the file as an LLOG, but I don&apos;t see the next level of mapping.&lt;/p&gt;

&lt;p&gt;And the error message that Prakash shared says &quot;size 24&quot;.  Is that the size of the file?  Because there are files that are size 24, but changelog_catalog is not one of them.  As I said back on Oct 5, the changelog_catalog appears to be empty when I mount the filesystem through the posix layer.&lt;/p&gt;</comment>
                            <comment id="47467" author="bzzz" created="Tue, 6 Nov 2012 14:38:33 +0000"  >&lt;p&gt;in the directory entry we store dnode (to maintain compatibility with zfs) and fid (which seem to be &lt;span class=&quot;error&quot;&gt;&amp;#91;0x1:0x3:0x0&amp;#93;&lt;/span&gt;), then we lookup fid in OI.&lt;/p&gt;

&lt;p&gt;as for the reverse - mdd_changelog_init() uses &quot;changelog_catalog&quot; to name the object.&lt;/p&gt;</comment>
                            <comment id="47470" author="bzzz" created="Tue, 6 Nov 2012 15:09:09 +0000"  >&lt;p&gt;it seems few fids in sequence 1 (llog) were re-used (due to step-by-step landing and changes during inspections) and now changelog_catalog share the fid with seq_ctl or seq_srv (iirc, there was a problem with duplicated re-used sequences which is a sign of wrong seq_&lt;/p&gt;
{ctl|srv}
&lt;p&gt;). seq_&lt;/p&gt;
{srv|ctl}
&lt;p&gt; are 24bytes ...&lt;/p&gt;</comment>
                            <comment id="47514" author="bzzz" created="Wed, 7 Nov 2012 02:33:09 +0000"  >&lt;p&gt;I&apos;m developing a patch to verify dnode/fid in direntry agains dinode/fid in OI.&lt;/p&gt;</comment>
                            <comment id="47568" author="bzzz" created="Thu, 8 Nov 2012 03:07:18 +0000"  >&lt;p&gt;Guys, could you try with &lt;a href=&quot;http://review.whamcloud.com/#change,4169&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4169&lt;/a&gt; ? this patch replaces previous one. except for the line to ignore errors in mdd_changelog_init() the patch should be OK to land on master branch, I think. if the last theory is confirmed with the patch, then I&apos;ll develop one-time fix.&lt;/p&gt;</comment>
                            <comment id="47613" author="morrone" created="Thu, 8 Nov 2012 19:21:18 +0000"  >&lt;p&gt;Yes, I&apos;ll swap in that change.&lt;/p&gt;</comment>
                            <comment id="47657" author="morrone" created="Fri, 9 Nov 2012 18:58:04 +0000"  >&lt;p&gt;Alex,&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-09 14:37:50 LustreError: 32846:0:(llog_osd.c:227:llog_osd_read_header()) lstest-MDT0000-osd: error reading log header from [0x1:0x3:0x0]: rc = -14
2012-11-09 14:37:50 LustreError: 32846:0:(llog_osd.c:230:llog_osd_read_header()) attrs: valid 17ff, mode 100644, size 24, block 257
2012-11-09 14:37:50 LustreError: 32846:0:(mdd_device.c:410:mdd_changelog_init()) lstest-MDD0000: changelog setup during init failed: rc = -14
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I don&apos;t see any of the new error messages.&lt;/p&gt;</comment>
                            <comment id="47745" author="prakash" created="Tue, 13 Nov 2012 15:57:36 +0000"  >&lt;p&gt;Alex: Chris and I have speculated that the file in question is actually the &lt;tt&gt;CHANGELOG_USERS&lt;/tt&gt; file and not the &lt;tt&gt;CHANGELOG_CATALOG&lt;/tt&gt; file. I&apos;ve been looking into the issue some more this morning, and have more evidence this is the case.&lt;/p&gt;

&lt;p&gt;Using systemtap, I can see that it&apos;s the second call to llog_cat_init_and_process from within mdd_prepare that is failing. So what I think is happening is this call:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
 317         rc = llog_open_create(env, ctxt, &amp;amp;ctxt-&amp;gt;loc_handle, NULL,          
 318                               CHANGELOG_CATALOG);                          
 319         if (rc)                                                            
 320                 GOTO(out_cleanup, rc);                                     
 321                                                                            
 322         ctxt-&amp;gt;loc_handle-&amp;gt;lgh_logops-&amp;gt;lop_add = llog_cat_add_rec;          
 323         ctxt-&amp;gt;loc_handle-&amp;gt;lgh_logops-&amp;gt;lop_declare_add =                    
 324                                         llog_cat_declare_add_rec;          
 325                                                                            
 326         rc = llog_cat_init_and_process(env, ctxt-&amp;gt;loc_handle);             
 327         if (rc)                                                            
 328                 GOTO(out_close, rc);                                       
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                      

&lt;p&gt;is succeeding, as can be seen by the systemtap output:                          &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
     0 mount.lustre(51462):-&amp;gt;llog_open_create env=0xffff880f5d263bb8 ctxt=0xffff880f80491b00 res=0xffff880f80491b40 logid=0x0 name=0xffffffffa0ee3292
...                                                                             
   548 mount.lustre(51462):&amp;lt;-llog_open_create return=0x0                        
     0 mount.lustre(51462):-&amp;gt;llog_cat_init_and_process env=0xffff880f5d263bb8 llh=0xffff880f7dc843c0
...                                                                             
   367 mount.lustre(51462):&amp;lt;-llog_cat_init_and_process return=0x0               
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                      

&lt;p&gt;But then later, this is failing:                                                &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
 354         rc = llog_open_create(env, uctxt, &amp;amp;uctxt-&amp;gt;loc_handle, NULL,        
 355                               CHANGELOG_USERS);                            
 356         if (rc)                                                            
 357                 GOTO(out_ucleanup, rc);                                    
 358                                                                            
 359         uctxt-&amp;gt;loc_handle-&amp;gt;lgh_logops-&amp;gt;lop_add = llog_cat_add_rec;         
 360         uctxt-&amp;gt;loc_handle-&amp;gt;lgh_logops-&amp;gt;lop_declare_add = llog_cat_declare_add_rec;
 361                                                                            
 362         rc = llog_cat_init_and_process(env, uctxt-&amp;gt;loc_handle);            
 363         if (rc)                                                            
 364                 GOTO(out_uclose, rc);                                      
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                      

&lt;p&gt;as you can see here:                                                            &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
     0 mount.lustre(51462):-&amp;gt;llog_open_create env=0xffff880f5d263bb8 ctxt=0xffff880facc13740 res=0xffff880facc13780 logid=0x0 name=0xffffffffa0ee32ba
...                                                                             
   498 mount.lustre(51462):&amp;lt;-llog_open_create return=0x0                        
     0 mount.lustre(51462):-&amp;gt;llog_cat_init_and_process env=0xffff880f5d263bb8 llh=0xffff880e283cb900
...                                                                             
 25647 mount.lustre(51462):&amp;lt;-llog_cat_init_and_process return=0xfffffffffffffff2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                      

&lt;p&gt;I &lt;em&gt;think&lt;/em&gt; the &lt;tt&gt;dmu_read&lt;/tt&gt; call is successful, but since the &lt;tt&gt;CATALOG_USERS&lt;/tt&gt; file is only 24 bytes in length, &lt;tt&gt;dt_read&lt;/tt&gt; reports an error (osd_read returns 24 which != &lt;tt&gt;LLOG_CHUNK_SIZE&lt;/tt&gt;).&lt;/p&gt;

&lt;p&gt;So, why is the &lt;tt&gt;CATALOG_USERS&lt;/tt&gt; file 24 bytes in length when &lt;tt&gt;dt_read&lt;/tt&gt; is expecting it to be &lt;tt&gt;LLOG_CHUNK_SIZE&lt;/tt&gt; bytes?&lt;/p&gt;

&lt;p&gt;Here is the hexdump of the &lt;tt&gt;CATALOG_USERS&lt;/tt&gt; file, in case it&apos;s useful:         &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
# grove-mds2 /mnt/grove-mds2/mdt0 &amp;gt; hexdump changelog_catalog                   
# grove-mds2 /mnt/grove-mds2/mdt0 &amp;gt; hexdump changelog_users                     
0000000 0bd0 0000 0002 0000 ffff ffff ffff ffff                                 
0000010 0000 0000 0000 0000                                                     
0000018                                                                         
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;     </comment>
                            <comment id="47751" author="prakash" created="Tue, 13 Nov 2012 16:18:16 +0000"  >&lt;p&gt;Here&apos;s the full systemtap log I gathered by running these on the MDS:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;stap -DSTP_NO_OVERLOAD /usr/share/doc/systemtap-1.6/examples/general/para-callgraph.stp &apos;module(&quot;obdclass&quot;).function(&quot;*&quot;)&apos; &apos;module(&quot;mdd&quot;).function(&quot;mdd_prepare&quot;)&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And this in another shell:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/etc/init.d/lustre start
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47753" author="prakash" created="Tue, 13 Nov 2012 16:51:14 +0000"  >&lt;p&gt;And just to be completely sure, here&apos;s the string passed to &lt;tt&gt;llog_open_create&lt;/tt&gt; just prior to the failed &lt;tt&gt;llog_cat_init_and_process&lt;/tt&gt; call:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; p (char *)0xffffffffa0ee32ba
$3 = 0xffffffffa0ee32ba &quot;changelog_users&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47758" author="prakash" created="Tue, 13 Nov 2012 20:05:35 +0000"  >&lt;p&gt;Buried in all the console noise, I managed to find this message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-11-13 14:05:03 LustreError: 53596:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can&apos;t get LMA on [0x200000bd0:0x4f:0x0]: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="47781" author="bzzz" created="Wed, 14 Nov 2012 05:22:07 +0000"  >&lt;p&gt;thanks for the help guys. unfortunately it doesn&apos;t help very much that it&apos;s not changelog, but changelog_users - still the file is corrupted and I can&apos;t prove the root cause, sorry.&lt;/p&gt;

&lt;p&gt;this is how the first bytes of a llog should look like:&lt;/p&gt;

&lt;p&gt;0000000 2000 0000 0000 0000 5539 1064 0000 0000&lt;br/&gt;
0000010 9f4d 50a0 0000 0000 0013 0000 0058 0000&lt;br/&gt;
0000020 0000 0000 0004 0000 0000 0000 6f63 666e&lt;/p&gt;

&lt;p&gt;struct llog_rec_hdr {&lt;br/&gt;
	__u32	lrh_len;&lt;br/&gt;
	__u32	lrh_index;&lt;br/&gt;
	__u32	lrh_type;&lt;br/&gt;
	__u32	lrh_id;&lt;br/&gt;
};&lt;/p&gt;

&lt;p&gt;notice lrh_len=2000 (first 8K header of any llog)&lt;br/&gt;
lrh_type=10645539 (LLOG_HDR_MAGIC = LLOG_OP_MAGIC | 0x45539)&lt;/p&gt;

&lt;p&gt;the good news is that the filesystem itself seem to be consistent (given no bad messages from the latest patch).&lt;br/&gt;
at least OI is not broken, there is no duplicate fids, etc.&lt;/p&gt;

&lt;p&gt;so having this I&apos;d suggest to remove changelog_users manually (or I can make a patch to do so at mount time).&lt;/p&gt;
</comment>
                            <comment id="48015" author="prakash" created="Mon, 19 Nov 2012 16:01:11 +0000"  >&lt;p&gt;Alex, what is considered a &quot;bad&quot; message? I see some of the &quot;can&apos;t get LMA&quot; messages, are those &quot;bad&quot;?&lt;/p&gt;</comment>
                            <comment id="48021" author="prakash" created="Mon, 19 Nov 2012 16:57:04 +0000"  >&lt;p&gt;Alex, a couple more questions when you have some time:                          &lt;/p&gt;

&lt;p&gt;You mentioned above that there was previously a bug which would cause the &lt;tt&gt;changelog_&amp;#42;&lt;/tt&gt; files and &lt;tt&gt;seq_&amp;#42;&lt;/tt&gt; files to share the same FID.. Is there a chance this happened with the &lt;tt&gt;changelog_users&lt;/tt&gt; and &lt;tt&gt;seq_ctl&lt;/tt&gt; files? I ask because those two look very similar:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                         
# grovei /tftpboot/dumps/surya1/mdt0 &amp;gt; hexdump seq_ctl                             
0000000 0400 4000 0002 0000 ffff ffff ffff ffff                                    
0000010 0000 0000 0000 0000                                                        
0000018                                                                            
# grovei /tftpboot/dumps/surya1/mdt0 &amp;gt; hexdump changelog_users                     
0000000 0bd0 0000 0002 0000 ffff ffff ffff ffff                                    
0000010 0000 0000 0000 0000                                                        
0000018                                                                            
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                         

&lt;p&gt;It seems like too much of a coincidence that &lt;tt&gt;changelog_users&lt;/tt&gt; is exactly the size of &lt;tt&gt;struct lu_seq_range&lt;/tt&gt; (which I believe &lt;tt&gt;seq_ctl&lt;/tt&gt; contains) and has &lt;b&gt;very&lt;/b&gt; similar contents.&lt;/p&gt;

&lt;p&gt;Also, I created a new file system in a VM to use for testing. What should &quot;normal&quot; &lt;tt&gt;changelog_catalog&lt;/tt&gt; and &lt;tt&gt;changelog_users&lt;/tt&gt; files look like? I expected to see something like you posted earlier, but instead the files on my test MDS are empty:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                      
$ hexdump changelog_users                                                       
$ hexdump changelog_catalog                                                     
                                                                                
$ stat changelog_catalog                                                        
  File: `changelog_catalog&apos;                                                     
  Size: 0               Blocks: 1          IO Block: 131072 regular empty file  
Device: 1fh/31d Inode: 191         Links: 2                                     
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)        
Access: 2012-11-19 13:33:36.328821000 -0800                                     
Modify: 1969-12-31 16:00:00.846817000 -0800                                     
Change: 1969-12-31 16:00:00.846817000 -0800                                     
                                                                                
$ stat changelog_users                                                          
  File: `changelog_users&apos;                                                       
  Size: 0               Blocks: 1          IO Block: 131072 regular empty file  
Device: 1fh/31d Inode: 192         Links: 2                                     
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)        
Access: 2012-11-19 13:33:32.722166000 -0800                                     
Modify: 1969-12-31 16:00:00.846817000 -0800                                     
Change: 1969-12-31 16:00:00.846817000 -0800                                     
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                      

&lt;p&gt;Is this normal? &lt;/p&gt;</comment>
                            <comment id="48124" author="bzzz" created="Tue, 20 Nov 2012 15:05:34 +0000"  >&lt;p&gt;&amp;gt; what is considered a &quot;bad&quot; message? I see some of the &quot;can&apos;t get LMA&quot; messages, are those &quot;bad&quot;?&lt;/p&gt;

&lt;p&gt;given your filesystem is in use for quite long and LMA was not always set in Orion, I think it&apos;s OK to see this message on some objects.&lt;br/&gt;
though this also mean we can&apos;t verify OI in this case.&lt;/p&gt;

&lt;p&gt;&amp;gt; You mentioned above that there was previously a bug which would cause the changelog_* files and seq_* files to share the same FID.. Is there a chance this happened with the changelog_users and seq_ctl files?&lt;/p&gt;

&lt;p&gt;yes, I think that was possible.&lt;/p&gt;

&lt;p&gt;&amp;gt; Also, I created a new file system in a VM to use for testing. What should &quot;normal&quot; changelog_catalog and changelog_users files look like? I expected to see something like you posted earlier, but instead the files on my test MDS are empty:&lt;/p&gt;

&lt;p&gt;this is because changelog was not used. the both files are supposed to be empty in this case. but any record written should grow them to 8K+&lt;/p&gt;

&lt;p&gt;how often do you see &quot;can&apos;t get LMA&quot; ?&lt;/p&gt;</comment>
                            <comment id="48128" author="prakash" created="Tue, 20 Nov 2012 15:41:27 +0000"  >&lt;blockquote&gt;
&lt;p&gt;how often do you see &quot;can&apos;t get LMA&quot; ?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Looking back at the logs, it looks like we&apos;ve seen it about 15 times on the test MDS for 3 distinct FIDs&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 33431:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can&apos;t get LMA on [0x200000bd0:0x4f:0x0]: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 33231:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can&apos;t get LMA on [0x200000bda:0x3:0x0]: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 33244:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can&apos;t get LMA on [0x200000bda:0x4:0x0]: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I see it on the production OSTs frequently:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;ConMan&amp;gt; Console [grove214] log at 2012-11-16 04:00:00 PST.
2012-11-16 04:18:37 LustreError: 5777:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can&apos;t get LMA on [0x100000000:0x10ac1:0x0]: rc = -2
2012-11-16 04:19:10 LustreError: 7522:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can&apos;t get LMA on [0x100000000:0x10ac4:0x0]: rc = -2
2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can&apos;t get LMA on [0x100000000:0x10aca:0x0]: rc = -2
2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) Skipped 3 previous similar messages
2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can&apos;t get LMA on [0x100000000:0x10ad7:0x0]: rc = -2
2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) Skipped 5 previous similar messages

&amp;lt;ConMan&amp;gt; Console [grove214] log at 2012-11-16 05:00:00 PST.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When was the &quot;shared FID&quot; bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I&apos;d like to verify.&lt;/p&gt;</comment>
                            <comment id="48130" author="bzzz" created="Tue, 20 Nov 2012 15:53:54 +0000"  >&lt;p&gt;&amp;gt;&amp;gt; When was the &quot;shared FID&quot; bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I&apos;d like to verify.&lt;/p&gt;

&lt;p&gt;commit 155e4b6cf45cc0ab21f72d94e5cccbd7a0939c58&lt;br/&gt;
Author: Alex Zhuravlev &amp;lt;alexey.zhuravlev@intel.com&amp;gt;&lt;br/&gt;
Date:   Tue Oct 2 23:52:42 2012 +0400&lt;/p&gt;

&lt;p&gt;    &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2075&quot; title=&quot;Assertion triggered in lod_declare_object_create&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2075&quot;&gt;&lt;del&gt;LU-2075&lt;/del&gt;&lt;/a&gt; fld: use predefined FIDs&lt;/p&gt;

&lt;p&gt;    and let OSD do mapping to the names internally.&lt;/p&gt;

&lt;p&gt;so, during the landing process we returned back to the schema when LMA is set by OSD itself (otherwise we&apos;ll have to set it in many places, in contrast with just osd-zfs and osd-ldiskfs now). so now every object created with OSD API is supposed to have LMA (which later can be used by LFSCK, for example). &lt;/p&gt;</comment>
                            <comment id="48140" author="prakash" created="Tue, 20 Nov 2012 17:08:14 +0000"  >&lt;p&gt;OK. That landed between 2.3.51 and 2.3.52.. We started seeing the message when we upgraded the test system to 2.3.51-Xchaos (from orion-2_3_49_92_1-72chaos). We haven&apos;t seen it on our production Grove FS, but were much more conservative with it&apos;s upgrade process, jumping from orion-2_3_49_54_2-68chaos to 2.3.54-6chaos.&lt;/p&gt;

&lt;p&gt;I think I&apos;m going to chalk this up to the FIDs being shared unless we have evidence to the contrary. I&apos;ll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as &quot;cannot reproduce&quot;.&lt;/p&gt;

&lt;p&gt;Also, what does LMA stand for and/or what&apos;s its purpose? Just curious.&lt;/p&gt;</comment>
                            <comment id="48159" author="bzzz" created="Wed, 21 Nov 2012 00:19:14 +0000"  >&lt;p&gt;&amp;gt; I&apos;ll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as &quot;cannot reproduce&quot;&lt;br/&gt;
it should be OK to just truncate it&lt;/p&gt;

&lt;p&gt;&amp;gt; Also, what does LMA stand for and/or what&apos;s its purpose? Just curious.&lt;br/&gt;
it stands for lustre metadata attributes (struct lustre_mdt_attrs)&lt;/p&gt;
</comment>
                            <comment id="48561" author="prakash" created="Thu, 29 Nov 2012 20:55:29 +0000"  >&lt;p&gt;Alex, I reverted &lt;a href=&quot;http://review.whamcloud.com/4376&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;4376&lt;/a&gt;, reverted &lt;a href=&quot;http://review.whamcloud.com/4169&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;4169&lt;/a&gt;, and truncated the &lt;tt&gt;changelog_users&lt;/tt&gt; file to be zero length. Things are back online and look healthy, so I&apos;ll go ahead and resolve this issue. Thanks for the help!&lt;/p&gt;</comment>
                            <comment id="48562" author="prakash" created="Thu, 29 Nov 2012 21:02:12 +0000"  >&lt;p&gt;I believe the issue was a bug in previous versions of Lustre which has been detailed in the comments of this issue. It has been resolved and deemed fixed since v2.3.52.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12043" name="systemtap-LU-2079.txt" size="216922" author="prakash" created="Tue, 13 Nov 2012 16:18:16 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv4y7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4335</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>