<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:06:19 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7138] LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()-&gt;journal_info == ((void *)0) ) failed:</title>
                <link>https://jira.whamcloud.com/browse/LU-7138</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This evening we have hit this LBUG on the MDT in our production file system, the file system is currently down as we hit the same bug every time we attempt to bring the MDT back, as soon as recovery finishes.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;0&amp;gt;LustreError: 722:0:(osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed:
&amp;lt;0&amp;gt;LustreError: 722:0:(osd_handler.c:1017:osd_trans_start()) LBUG
&amp;lt;4&amp;gt;Pid: 722, comm: mdt01_017
&amp;lt;4&amp;gt;
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa065f895&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa065fe97&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa17df24d&amp;gt;] osd_trans_start+0x25d/0x660 [osd_ldiskfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09b9b4a&amp;gt;] llog_osd_destroy+0x42a/0xd40 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09b2edc&amp;gt;] llog_cat_new_log+0x1ec/0x710 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09b350a&amp;gt;] llog_cat_add_rec+0x10a/0x450 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa09ab1e9&amp;gt;] llog_add+0x89/0x1c0 [obdclass]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa17f1976&amp;gt;] ? osd_attr_set+0x166/0x460 [osd_ldiskfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0d914e2&amp;gt;] mdd_changelog_store+0x122/0x290 [mdd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0da4d0c&amp;gt;] mdd_changelog_data_store+0x16c/0x320 [mdd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0dad9b3&amp;gt;] mdd_attr_set+0x12f3/0x1730 [mdd]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa088a551&amp;gt;] mdt_reint_setattr+0xf81/0x13a0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa087be1c&amp;gt;] ? mdt_root_squash+0x2c/0x3f0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa08801dd&amp;gt;] mdt_reint_rec+0x5d/0x200 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa086423b&amp;gt;] mdt_reint_internal+0x4cb/0x7a0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa08649ab&amp;gt;] mdt_reint+0x6b/0x120 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0c6f56e&amp;gt;] tgt_request_handle+0x8be/0x1000 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0c1f5a1&amp;gt;] ptlrpc_main+0xe41/0x1960 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8106c4f0&amp;gt;] ? pick_next_task_fair+0xd0/0x130
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0c1e760&amp;gt;] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109e66e&amp;gt;] kthread+0x9e/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c20a&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109e5d0&amp;gt;] ? kthread+0x0/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;
&amp;lt;0&amp;gt;Kernel panic - not syncing: LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The stack trace doesn&apos;t quite seem to be the same as for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt; (which anyway doesn&apos;t have any fix suggested.)&lt;/p&gt;</description>
                <environment></environment>
        <key id="32056">LU-7138</key>
            <summary>LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()-&gt;journal_info == ((void *)0) ) failed:</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="ferner">Frederik Ferner</reporter>
                        <labels>
                    </labels>
                <created>Thu, 10 Sep 2015 23:23:50 +0000</created>
                <updated>Thu, 1 Mar 2018 16:38:54 +0000</updated>
                            <resolved>Sun, 4 Oct 2015 19:34:03 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="127009" author="pjones" created="Thu, 10 Sep 2015 23:37:38 +0000"  >&lt;p&gt;Oleg is looking into this&lt;/p&gt;</comment>
                            <comment id="127013" author="green" created="Fri, 11 Sep 2015 00:08:40 +0000"  >&lt;p&gt;So, looking at the llog_cat_new_log, we can see on error it cleans up with:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;out_destroy:
        llog_destroy(env, loghandle);
        RETURN(rc);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This gets translated into llog_osd_destroy that insists it creates its own transaction, but we are already in a transaction which causes thisassertion failure.&lt;/p&gt;

&lt;p&gt;It&apos;s trivial to reproduce with this patch (Causing 100% cash on startup):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/obdclass/llog_cat.c b/lustre/obdclass/llog_cat.c
index 981af6f..3617e3b 100644
--- a/lustre/obdclass/llog_cat.c
+++ b/lustre/obdclass/llog_cat.c
@@ -105,8 +105,8 @@ static int llog_cat_new_log(const struct lu_env *env,
         * assigned to the record and updated in rec header */
        rc = llog_write_rec(env, cathandle, &amp;amp;rec-&amp;gt;lid_hdr,
                            &amp;amp;loghandle-&amp;gt;u.phd.phd_cookie, LLOG_NEXT_IDX, th);
-       if (rc &amp;lt; 0)
-               GOTO(out_destroy, rc);
+//     if (rc &amp;lt; 0)
+               GOTO(out_destroy, rc = -5);
 
        CDEBUG(D_OTHER, &quot;new recovery log &quot;DOSTID&quot;:%x for index %u of catalog&quot;
               DOSTID&quot;\n&quot;, POSTID(&amp;amp;loghandle-&amp;gt;lgh_id.lgl_oi),
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If we think about it some more - it&apos;s actually totally logical that we do not need to cleanup in a different transaction and must cleanup in the same one so that it&apos;s atomic in the end.&lt;br/&gt;
So I imagine we need to use a different destroy method that would reuse the same transaction.&lt;/p&gt;</comment>
                            <comment id="127014" author="green" created="Fri, 11 Sep 2015 00:25:47 +0000"  >&lt;p&gt;So the next puzzle now is why did llog init or llog write fail, you are not out orf space or anythng like that, are you, no messages in logs pointing at other sources of failures?&lt;/p&gt;

&lt;p&gt;You can comment out the llog_destroy call like this as a temporary workaround, though if writes would keep failing, I am sure something else would hit too:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/obdclass/llog_cat.c b/lustre/obdclass/llog_cat.c
index 981af6f..6e51088 100644
--- a/lustre/obdclass/llog_cat.c
+++ b/lustre/obdclass/llog_cat.c
@@ -116,7 +116,7 @@ static int llog_cat_new_log(const struct lu_env *env,
        loghandle-&amp;gt;lgh_hdr-&amp;gt;llh_cat_idx = rec-&amp;gt;lid_hdr.lrh_index;
        RETURN(0);
 out_destroy:
-       llog_destroy(env, loghandle);
+//     llog_destroy(env, loghandle);
        RETURN(rc);
 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It&apos;s also unclear what would happen when the system encounters these half written logs later, so it&apos;s really a &quot;try at your own risk&quot; patch.&lt;/p&gt;</comment>
                            <comment id="127018" author="ferner" created="Fri, 11 Sep 2015 00:35:27 +0000"  >&lt;p&gt;There&apos;s nothing obvious in syslog. The full contents of /var/log/messages before the crash/reboot is this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Sep  7 08:52:49 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a3:0x662:0x0] != self_fid [0x200010dc1:0xd167:0x0]
Sep  7 11:25:06 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dbe:0x9e03:0x0] != self_fid [0x200010dbe:0x9e09:0x0]
Sep  7 12:03:51 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011586:0x283f:0x0] != self_fid [0x200011582:0x41a3:0x0]
Sep  7 13:14:15 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dab:0x198d5:0x0] != self_fid [0x200011602:0x1c22e:0x0]
Sep  7 13:30:42 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011580:0x9afc:0x0] != self_fid [0x200010da9:0x16251:0x0]
Sep  7 14:10:07 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001157b:0xf7a6:0x0] != self_fid [0x200011581:0x152fe:0x0]
Sep  7 15:47:13 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115b2:0x4cab:0x0] != self_fid [0x2000115b1:0x4cab:0x0]
Sep  7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting
Sep  7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  7 18:18:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting
Sep  8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: Skipped 2 previous similar messages
Sep  8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  8 11:06:25 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  8 11:11:41 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:17:00 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 16:15:16 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a4:0xa9e9:0x0] != self_fid [0x200011626:0x1d353:0x0]
Sep  8 17:23:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 17:23:27 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 18:33:55 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010d9e:0x196cf:0x0] != self_fid [0x20001157c:0x10445:0x0]
Sep  8 20:37:36 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x29e3:0x0] != self_fid [0x200010db6:0x1c6f6:0x0]
Sep  9 00:32:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011640:0x407d:0x0] != self_fid [0x20001157f:0x1411a:0x0]
Sep  9 01:07:50 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  9 02:36:05 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115bf:0x1b19:0x0] != self_fid [0x200011570:0x1d5c:0x0]
Sep 10 10:21:18 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011645:0x8b5e:0x0] != self_fid [0x20001163a:0xcf49:0x0]
Sep 10 10:29:03 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x1dc2d:0x0] != self_fid [0x20001163d:0x1dc37:0x0]
Sep 10 16:48:08 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011668:0x739b:0x0] != self_fid [0x20001167c:0x1501f:0x0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Mounting the MDT as ldiskfs works and there is lots of space free:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -h
/dev/mapper/vg_lustre03-mgs
                      469M  2.7M  441M   1% /lustre/mgs
/dev/mapper/vg_lustre03-mdt
                      1.5T  103G  1.3T   8% /lustre/lustre03/mdt
[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_lustre03-mdt
                       484M  128M  357M   27% /lustre/lustre03/mdt
/dev/mapper/vg_lustre03-mgs
                       125K   218  125K    1% /lustre/mgs

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Is there another option to get the file system back online without the &quot;at your own risk&quot; patch? I&apos;d rather not find out that half written logs are completely trashing our file system...&lt;/p&gt;</comment>
                            <comment id="127019" author="green" created="Fri, 11 Sep 2015 00:42:03 +0000"  >&lt;p&gt;If you apply a patch like below, it won&apos;t really fix anything, but at least we&apos;ll know where does the error comes from and what was it (if you have a crashdump from the crash, there&apos;s a chance you can extract information from it too).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/obdclass/llog.c b/lustre/obdclass/llog.c
index c34e179..680dae1 100644
--- a/lustre/obdclass/llog.c
+++ b/lustre/obdclass/llog.c
@@ -702,12 +702,16 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle,
 	ENTRY;
 
 	rc = llog_handle2ops(handle, &amp;amp;lop);
-	if (rc)
+	if (rc) {
+		CERROR(&quot;llog_handle2ops error %d\n&quot;, rc);
 		RETURN(rc);
+	}
 
 	LASSERT(lop);
-	if (lop-&amp;gt;lop_write_rec == NULL)
+	if (lop-&amp;gt;lop_write_rec == NULL) {
+		CERROR(&quot;lop-&amp;gt;lop_write_rec == NULL&quot;);
 		RETURN(-EOPNOTSUPP);
+	}
 
 	buflen = rec-&amp;gt;lrh_len;
 	LASSERT(cfs_size_round(buflen) == buflen);
@@ -718,6 +722,8 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle,
 	rc = lop-&amp;gt;lop_write_rec(env, handle, rec, logcookies, idx, th);
 	if (!raised)
 		cfs_cap_lower(CFS_CAP_SYS_RESOURCE);
+	if (rc)
+		CERROR(&quot;lop_write_rec error %d\n&quot;, rc);
 	RETURN(rc);
 }
 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;the patch is only needed on MDS, btw&lt;/p&gt;</comment>
                            <comment id="127020" author="ferner" created="Fri, 11 Sep 2015 01:03:30 +0000"  >&lt;p&gt;Ok, I&apos;ve started the build for a Lustre version with the debug patch.&lt;/p&gt;

&lt;p&gt;I have a vmcore from the initial LBUG kernel panic, is this potentially useful? (at the moment I&apos;m not sure how I would go about extracting information from this, though.)&lt;/p&gt;</comment>
                            <comment id="127021" author="green" created="Fri, 11 Sep 2015 01:04:48 +0000"  >&lt;p&gt;you can use crash tool to load vmcore in, then use mod -S to load lustre module symbol information and also xbt crash module to see local variable contents.&lt;/p&gt;</comment>
                            <comment id="127025" author="ferner" created="Fri, 11 Sep 2015 01:32:31 +0000"  >&lt;p&gt;After updating Lustre and mounting the MDT again, I got this on the console:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: lustre03-MDT0000: Denying connection for new client lustre03-MDT0000-lwp-OST0001_UUID (at 10.144.144.31@o2ib), waiting for all 354 known clients (139 recovered, 170 in progress, and 0 evicted) to recove0
Lustre: Skipped 19 previous similar messages
Lustre: Skipped 3 previous similar messages
Lustre: lustre03-MDT0000: Client 211c78af-7674-5b7e-5f34-3d38b509f421 (at 10.144.148.31@o2ib) reconnecting, waiting for 354 clients in recovery for 2:44
Lustre: Skipped 1895 previous similar messages
LustreError: 10692:0:(ldlm_lib.c:1748:check_for_next_transno()) lustre03-MDT0000: waking for gap in transno, VBR is OFF (skip: 370152455116, ql: 37, comp: 317, conn: 354, next: 370152455496, last_committed: 370)
LustreError: 10692:0:(llog.c:726:llog_write_rec()) lop_write_rec error -28
LustreError: 10692:0:(llog.c:726:llog_write_rec()) Skipped 13 previous similar messages
LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: 
LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m also not having much luck with the vmcore, it appears I can find the right &quot;NAMELIST&quot; (vmlinux?, do I need the kernel-debuginfo rpm for the kernel we&apos;ve been runninng?), so I&apos;m thinking of uploading this somewhere and hopefully you could look at this? &lt;/p&gt;

&lt;p&gt;vmcore file now at &lt;a href=&quot;ftp://ftpanon.diamond.ac.uk/LU-7138/vmcore-LU_7138&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;ftp://ftpanon.diamond.ac.uk/LU-7138/vmcore-LU_7138&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I need to get some sleep now, but would appreciate if this could still be looked at. I&apos;ll be back in approximately 5h and users will start getting very impatient shortly after this, so an idea how to get this file system back up (sanely) would be really good at that point.&lt;/p&gt;</comment>
                            <comment id="127026" author="green" created="Fri, 11 Sep 2015 01:53:25 +0000"  >&lt;p&gt;&quot;lop_write_rec error -28&quot; means you are out of free space.&lt;br/&gt;
Can ou check that you have enough inodes too? (df -i). Is there a quota set for the root user by any chance?&lt;/p&gt;</comment>
                            <comment id="127028" author="ferner" created="Fri, 11 Sep 2015 01:59:45 +0000"  >&lt;p&gt;I have checked inodes as well:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_lustre03-mdt
                       484M  128M  357M   27% /lustre/lustre03/mdt
/dev/mapper/vg_lustre03-mgs
                       125K   218  125K    1% /lustre/mgs
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And I don&apos;t think we have any quota limits set at all. How would I check this without having the file system up? (For example on the ldiskfs mounted MDT?)&lt;/p&gt;</comment>
                            <comment id="127029" author="green" created="Fri, 11 Sep 2015 02:04:11 +0000"  >&lt;p&gt;Aha, reading into llog_osd_write_rec I see this nice piece:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        /* if it&apos;s the last idx in log file, then return -ENOSPC */
        if (loghandle-&amp;gt;lgh_last_idx &amp;gt;= LLOG_BITMAP_SIZE(llh) - 1)
                RETURN(-ENOSPC);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I wonder if this is what you are hitting.... Hmmm....&lt;br/&gt;
In fact looking at the backtrace, we are coming from llog_cat_add_rec where we have:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        rc = llog_write_rec(env, loghandle, rec, reccookie, LLOG_NEXT_IDX, th);
        if (rc &amp;lt; 0)
                CDEBUG_LIMIT(rc == -ENOSPC ? D_HA : D_ERROR,
                             &quot;llog_write_rec %d: lh=%p\n&quot;, rc, loghandle);
        up_write(&amp;amp;loghandle-&amp;gt;lgh_lock);
        if (rc == -ENOSPC) {
                /* try to use next log */
                loghandle = llog_cat_current_log(cathandle, th);
                LASSERT(!IS_ERR(loghandle));
                /* new llog can be created concurrently */
                if (!llog_exist(loghandle)) {
                        rc = llog_cat_new_log(env, cathandle, loghandle, th);
                        if (rc &amp;lt; 0) {
                                up_write(&amp;amp;loghandle-&amp;gt;lgh_lock);
                                RETURN(rc);
                        }
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So we caught ENOSPC and got into the creation of new llog, but got another ENOSPC there as well where the llog should already be the new one, so it could not be end of previous llog, writes appear to be failing genuinely.&lt;/p&gt;</comment>
                            <comment id="127030" author="green" created="Fri, 11 Sep 2015 02:07:32 +0000"  >&lt;p&gt;if you mount as ldiskfs, you should see quotainfo with the quota tools, I imagine. like quota.&lt;/p&gt;

&lt;p&gt;Also can you write files in there? e.g. can you do dd if=/dev/zero of=/mnt/mdt-mountpoint/LOGS/aaa bs=1024k count=1 ?&lt;br/&gt;
(remove the /mnt/mdt-mountpoint/LOGS/aaa afterwards).&lt;/p&gt;</comment>
                            <comment id="127031" author="ferner" created="Fri, 11 Sep 2015 02:15:41 +0000"  >&lt;p&gt;quota shows now quota for any user on the MDT when mounted as ldiskfs and yes, I can still write a file:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-mds03-01 ~]$ sudo dd if=/dev/zero of=/lustre/lustre03/mdt/LOGS/aaa bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000997444 s, 1.1 GB/s
[bnh65367@cs04r-sc-mds03-01 ~]$
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;(just to be sure, I&apos;ve also checked the output of df -hl and df -hil for all OSS nodes and all OSTs have space as well.)&lt;/p&gt;</comment>
                            <comment id="127033" author="green" created="Fri, 11 Sep 2015 02:19:37 +0000"  >&lt;p&gt;Can you also check that all the files in LOGS/ dir are root owned? what are the sizes of the 10 most recently written (based on mtime) ones?&lt;/p&gt;</comment>
                            <comment id="127034" author="ferner" created="Fri, 11 Sep 2015 02:21:57 +0000"  >&lt;p&gt;LOGS/ dir on the mdt is empty, no files at all, the directory is owned by root:root and world writeable.&lt;/p&gt;</comment>
                            <comment id="127041" author="green" created="Fri, 11 Sep 2015 04:29:32 +0000"  >&lt;p&gt;Ok, apparently LOGS/ dir is old style llogs.&lt;/p&gt;

&lt;p&gt;Anyway, the current theory is your changelogs catalog has overflown. If you check size of the &quot;changelog_catalog&quot; file in the MDT fs mounted as ldiskfs is quite big I imagine?&lt;/p&gt;

&lt;p&gt;What do you use changelogs for? a robinhood install or some other eager user that immediately consumes all changelogs as they are generated? Is it still alive and well? (As in - was consuming records while MDS was up, and did not happen to wedge itself ages ago).&lt;br/&gt;
This also should mean that the number of files (not dirs) in the &quot;O&quot; dir should be small.&lt;/p&gt;

&lt;p&gt;If you don&apos;t care all that much about these changelogs because they have been consumed already - you can just remove &quot;changelog_catalog&quot; and &quot;changelog_users&quot; files and then MDS should be able to start. You will then need to reenable your changelogs and reregister all consumers.&lt;/p&gt;</comment>
                            <comment id="127043" author="ferner" created="Fri, 11 Sep 2015 05:37:45 +0000"  >&lt;p&gt;Changelogs are indeed used for a robinhood instance. This appears to be still alive and had managed to keep up with the rate of changes as far as I can see.&lt;/p&gt;

&lt;p&gt;changelog_catalog isn&apos;t that big as far as I can see (4MB), and there are indeed not many files below O:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-mds03-01 mdt]$ sudo find O -type f | wc -l
35
[bnh65367@cs04r-sc-mds03-01 mdt]$ ls -l changelog_*
-rw-r--r-- 1 root root 4153280 Aug 19  2014 changelog_catalog
-rw-r--r-- 1 root root    8384 Aug 19  2014 changelog_users
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Removing changelog_* did indeed allow us to bring the file system back successfully (so far).&lt;/p&gt;

&lt;p&gt;Immediate crisis over, but we&apos;d be very interested in the root cause and how to avoid this in the future (and on our other file systems).&lt;/p&gt;</comment>
                            <comment id="127050" author="bfaccini" created="Fri, 11 Sep 2015 07:55:31 +0000"  >&lt;p&gt;Oleg, based on the size of changelog_catalog which indicate it has reached its max size, I believe this ticket is the combination of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt; (Catalog no longer able to wrap-around) and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt; (llog_destroy() wrong call to destroy plain LLOG upon Catalog full condition because a journal transaction already started).&lt;br/&gt;
My master patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt; has still not landed, and I think Mike has already started working on a solution for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
So at the moment, I think there was no other solution than to remove changelog_catalog file to allow for the filesystem to be restarted.&lt;/p&gt;</comment>
                            <comment id="127073" author="bzzz" created="Fri, 11 Sep 2015 13:22:51 +0000"  >&lt;p&gt;ah, even better - there is llog_trans_destroy() in the master branch already. I&apos;ll cook the patch.&lt;/p&gt;</comment>
                            <comment id="129262" author="bzzz" created="Sun, 4 Oct 2015 19:34:03 +0000"  >&lt;p&gt;a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="129319" author="ferner" created="Mon, 5 Oct 2015 15:19:04 +0000"  >&lt;p&gt;Alex,&lt;/p&gt;

&lt;p&gt;could you double check the bug number that this is a duplicate of? &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6636&quot; title=&quot;cfs_hash_for_each_relax() doesn&amp;#39;t break iteration as expected&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6636&quot;&gt;&lt;del&gt;LU-6636&lt;/del&gt;&lt;/a&gt; doesn&apos;t look right, did you mean &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Also, what would be the best way we can get a fix/patch for lustre 2.7?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Frederik&lt;/p&gt;</comment>
                            <comment id="129423" author="bzzz" created="Tue, 6 Oct 2015 06:21:45 +0000"  >&lt;p&gt;yes, you&apos;re right, that&apos;s a typo - I meant &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt;. the port for 2.7 would need llog_trans_destroy() and llog_destroy() from the master branch.&lt;/p&gt;</comment>
                            <comment id="129437" author="pjones" created="Tue, 6 Oct 2015 13:10:54 +0000"  >&lt;p&gt;Frederik&lt;/p&gt;

&lt;p&gt;As soon as the fixes are finalized for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6634&quot; title=&quot;(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed: when reaching Catalog full condition&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6634&quot;&gt;&lt;del&gt;LU-6634&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt; we&apos;ll create 2.7.x versions.&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="131160" author="ferner" created="Thu, 22 Oct 2015 12:46:41 +0000"  >&lt;p&gt;As far as I can see, at least the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt; has been merged into master. Unfortunately I have not managed to cleanly merge that into the 2.7 branch. Any chance someone could point me to a version for 2.7.x for that patch? (Though I don&apos;t think I&apos;ll re-enable changelogs on any of the production systems until we have both patches, I still would like to apply the first patch on test systems ASAP.)&lt;/p&gt;</comment>
                            <comment id="222069" author="gerrit" created="Thu, 1 Mar 2018 16:38:54 +0000"  >&lt;p&gt;James Simmons (uja.ornl@yahoo.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31478&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31478&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7138&quot; title=&quot;LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()-&amp;gt;journal_info == ((void *)0) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7138&quot;&gt;&lt;del&gt;LU-7138&lt;/del&gt;&lt;/a&gt; sptlrpc: make srpc_info writable&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: a2f7394ca5affe35c568ba2971c3bf3aeb2a7843&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="29826">LU-6556</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="30350">LU-6634</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxn9b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10020"><![CDATA[1]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>