<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:18:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15517] MDT stuck in recovery if one other MDT is failed over to partner node</title>
                <link>https://jira.whamcloud.com/browse/LU-15517</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;MDT fails to enter recovery when there is one MDT running on its partner MDS. The recovery_status file indicates &quot;WAITING&quot; and reports the MDT running on its parter as a non-ready MDT.&lt;/p&gt;

&lt;p&gt;The console log of the MDT unable to enter recovery repeatedly shows messages like:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[Thu Feb  3 12:15:18 2022] Lustre: 4102:0:(ldlm_lib.c:1827:extend_recovery_timer()) lquake-MDT0009: extended recovery timer reached hard limit: 900, extend: 1
[Thu Feb  3 12:15:18 2022] Lustre: 4102:0:(ldlm_lib.c:1827:extend_recovery_timer()) Skipped 29 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;My steps to reproduce:&lt;br/&gt;
1. Start all MDTs on their primary MDS (in my case, that&apos;s jet1...jet16 =&amp;gt; MDT0000...MDT000f). Allow them to complete recovery.&lt;br/&gt;
2. umount MDT000e on jet15, and mount it on jet16. Allow it to complete recovery.&lt;br/&gt;
3. umount MDT0009 on jet10, and then mount it again (on the same node, jet10, where it was happily running moments ago).&lt;/p&gt;

&lt;p&gt;Debug log shows that the update log for MDT000e was not received. There are no messages in the console log regarding MDT000e after MDT0009 starts up.&lt;/p&gt;

&lt;p&gt;I determined the PID of the thread lod_sub_recovery_thread() for MDT000e. The debug log shows the thread starts up, follows roughly this sequence of calls, and never returns from ptlrpc_set_wait().&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lod_sub_prep_llog&amp;gt;llog_osd_get_cat_list-&amp;gt;dt_locate_at-&amp;gt;lu_object_find_at-&amp;gt; ?-&amp;gt;
osp_attr_get-&amp;gt;osp_remote_sync-&amp;gt;ptlrpc_queue_wait-&amp;gt;ptlrpc_set_wait
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So apparently the RPC never times out, so upper layers never get to retry, there are no error messages reporting the problem, and the MDT never enters recovery.&lt;/p&gt;

&lt;p&gt;Our patch stack is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;82ea54e (tag: 2.12.8_6.llnl) LU-13356 client: don&apos;t use OBD_CONNECT_MNE_SWAB
d06f5b2 LU-15357 mdd: fix changelog context leak
543b60b LU-9964 llite: prevent mulitple group locks
d776b67 LU-15234 lnet: Race on discovery queue
77040da Revert &quot;LU-15234 lnet: Race on discovery queue&quot;
bba827c LU-15234 lnet: Race on discovery queue
7faa872 LU-14865 utils: llog_reader.c printf type mismatch
5dc104e LU-13946 build: OpenZFS 2.0 compatibility
0fef268 TOSS-4917 grant: chatty warning in tgt_grant_incoming
5b70822 TOSS-4917 grant: improve debug for grant calcs
8af02e8 log lfs setstripe paths to syslog
391be81 Don&apos;t install lustre init script on systemd systems
c613926 LLNL build customizations
a4f71cd TOSS-4431 build: build ldiskfs only for x86_64
067cb55 (tag: v2_12_8, tag: 2.12.8) New release 2.12.8&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;See &lt;a href=&quot;https://github.com/LLNL/lustre/tree/2.12.8-llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/tree/2.12.8-llnl&lt;/a&gt; for the details.&lt;/p&gt;</description>
                <environment>3.10.0-1160.53.1.1chaos.ch6.x86_64&lt;br/&gt;
zfs-0.7.11-9.8llnl.ch6.x86_64&lt;br/&gt;
lustre-2.12.8_6.llnl-1.ch6.x86_64</environment>
        <key id="68484">LU-15517</key>
            <summary>MDT stuck in recovery if one other MDT is failed over to partner node</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 3 Feb 2022 20:29:40 +0000</created>
                <updated>Fri, 22 Jul 2022 22:43:18 +0000</updated>
                                            <version>Lustre 2.12.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="325151" author="ofaaland" created="Thu, 3 Feb 2022 20:34:47 +0000"  >&lt;p&gt;This appears to be new to us, which makes me a bit suspicious of the top commit, see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15453&quot; title=&quot;MDT shutdown hangs on  mutex_lock, possibly cld_lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15453&quot;&gt;LU-15453&lt;/a&gt; / &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;.&#160; But I haven&apos;t yet proven it by reverting that patch and re-testing.&lt;/p&gt;</comment>
                            <comment id="325152" author="ofaaland" created="Thu, 3 Feb 2022 20:41:01 +0000"  >&lt;p&gt;Debug log and console log attached.&lt;/p&gt;</comment>
                            <comment id="325182" author="ofaaland" created="Thu, 3 Feb 2022 22:32:05 +0000"  >&lt;p&gt;Note there is no network or lnet issue within the cluster.&#160; As I mentioned, all the targets mount fine on their primary MDS, and (one at a time), on their secondary MDS.&#160; lnetctl ping works correctly in both directions.&lt;/p&gt;</comment>
                            <comment id="325183" author="ofaaland" created="Thu, 3 Feb 2022 22:34:55 +0000"  >&lt;p&gt;I&apos;ve also attached the debug log and dmesg output from jet16, where both MDT000e and MDT000f are running.&lt;/p&gt;</comment>
                            <comment id="325185" author="ofaaland" created="Thu, 3 Feb 2022 22:38:47 +0000"  >&lt;p&gt;For my reference, my local ticket is TOSS5536&lt;/p&gt;</comment>
                            <comment id="325193" author="ofaaland" created="Thu, 3 Feb 2022 22:51:58 +0000"  >&lt;p&gt;Once it&apos;s entered ptlrpc_set_wait(), the debug log reports this over and over again:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:13.0:1643913970.009932:0:4100:0:(client.c:2390:ptlrpc_set_wait()) set ffffa1897c264e00 going to sleep for 0 seconds
00000100:00000001:13.0:1643913970.009933:0:4100:0:(client.c:1726:ptlrpc_check_set()) Process entered
00000100:00000001:13.0:1643913970.009933:0:4100:0:(client.c:2647:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1)
00000100:00000001:13.0:1643913970.009934:0:4100:0:(client.c:1195:ptlrpc_import_delay_req()) Process entered
00000100:00000001:13.0:1643913970.009934:0:4100:0:(client.c:1250:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1)
00000100:00000001:13.0:1643913970.009935:0:4100:0:(client.c:2140:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:13.0:1643913970.009936:0:4100:0:(client.c:1726:ptlrpc_check_set()) Process entered
00000100:00000001:13.0:1643913970.009936:0:4100:0:(client.c:2647:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1)
00000100:00000001:13.0:1643913970.009936:0:4100:0:(client.c:1195:ptlrpc_import_delay_req()) Process entered
00000100:00000001:13.0:1643913970.009937:0:4100:0:(client.c:1250:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1)
00000100:00000001:13.0:1643913970.009937:0:4100:0:(client.c:2140:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0)  &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="325329" author="pjones" created="Fri, 4 Feb 2022 18:37:06 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="325348" author="ofaaland" created="Fri, 4 Feb 2022 20:30:16 +0000"  >&lt;p&gt;I re-tested with&#160;lustre-2.12.8_5.llnl (without the OBD_CONNECT_MNE_SWAB patch) and see the same issue.&#160; &#160;I&apos;ll have to do a lot more testing to determine how far back this behavior was introduced, if that&apos;s necessary.&lt;/p&gt;</comment>
                            <comment id="325412" author="ofaaland" created="Mon, 7 Feb 2022 07:04:07 +0000"  >&lt;p&gt;I tested with&#160;lustre-2.12.7_2.llnl-2.ch6.x86_64 and reproduced the issue, so it&apos;s not new.&#160; See &lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&lt;/a&gt; for the code.&#160; Patch stack is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;* f372cfb (tag: 2.12.7_2.llnl) LU-14865 utils: llog_reader.c printf type mismatch
* 5b7709c LU-14733 o2iblnd: Avoid double posting invalidate
* cdde64a LU-14733 o2iblnd: Move racy NULL assignment
* 38fade1 LU-12836 osd-zfs: Catch all ZFS pool change events
* c7d6ecf LU-13946 build: OpenZFS 2.0 compatibility
* ec7dd98 TOSS-4917 grant: chatty warning in tgt_grant_incoming
* eb6d90f TOSS-4917 grant: improve debug for grant calcs
* 4f202a3 log lfs setstripe paths to syslog
* 939ba8a Don&apos;t install lustre init script on systemd systems
* 3cfca06 LLNL build customizations
* ce7af9a TOSS-4431 build: build ldiskfs only for x86_64
* 6030d0c (tag: v2_12_7, tag: 2.12.7) New release 2.12.7&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The amount of time I waited in this test was 20 minutes.  I don&apos;t recall how long I waited in my earlier experiments.&lt;/p&gt;</comment>
                            <comment id="325495" author="ofaaland" created="Mon, 7 Feb 2022 20:15:22 +0000"  >&lt;p&gt;I reproduced with lustre-2.12.5_10.llnl&lt;/p&gt;</comment>
                            <comment id="329792" author="ofaaland" created="Mon, 21 Mar 2022 20:06:41 +0000"  >&lt;p&gt;Ping&lt;/p&gt;</comment>
                            <comment id="329900" author="tappro" created="Tue, 22 Mar 2022 18:20:59 +0000"  >&lt;p&gt;Olaf, are you able to reproduce that on clean system or it is reproduced on the working cluster? In the last case, could you collect update llog info from the MDT with stuck recovery? (jet16 in your last example). You can use &lt;tt&gt;debugfs&lt;/tt&gt; for that:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# debugfs -c &amp;lt;device&amp;gt; -R &lt;span class=&quot;code-quote&quot;&gt;&quot;ls -l update_log_dir/&quot;&lt;/span&gt; &amp;gt; jet16_update_log_dir.txt&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="331271" author="ofaaland" created="Thu, 7 Apr 2022 03:34:21 +0000"  >&lt;p&gt;Hi Mikhail,&lt;br/&gt;
Sorry for the delay. Good question - I was not able to reproduce on a clean system (I thought I&apos;d tried that, but apparently not).&lt;/p&gt;

&lt;p&gt;I reproduced on jet again. In this instance, lquake-MDT000e is failed over and running on jet16, when it normally would be on jet15. lquake-MDT0009 was umounted and remounted on jet10, and is stuck in WAITING.&lt;/p&gt;

&lt;p&gt;NIDs are as follows, in case it helps interpret what you see:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ejet10: 172.19.1.120@o2ib100
ejet15: 172.19.1.125@o2ib100
ejet16: 172.19.1.126@o2ib100
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The file containing the contents of update_log_dir/ is 2.1GB compressed. Do you still want me to send it to you? If not, do you want me to run llog_reader on each file and collect the results instead? The ones I spot-checked reported 0 records, even when they were large. Here&apos;s an example:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@jet10:update_log_dir]# stat [0x40002d2c5:0x2:0x0]
  File: &apos;[0x40002d2c5:0x2:0x0]&apos;
  Size: 938979232       Blocks: 14913761   IO Block: 512    regular file
Device: 31h/49d Inode: 28181515    Links: 1
Access: (0644/-rw-r--r--)  Uid: (56288/ UNKNOWN)   Gid: (56288/ UNKNOWN)
Access: 2018-02-23 08:14:08.000000000 -0800
Modify: 2018-02-23 08:14:08.000000000 -0800
Change: 2018-02-23 08:14:08.000000000 -0800
 Birth: -

[root@jet10:update_log_dir]# llog_reader [0x40002d2c5:0x2:0x0]
Header size : 32768      llh_size : 472
Time : Fri Feb 23 08:14:08 2018
Number of records: 0    cat_idx: 13     last_idx: 161679
Target uuid : 
-----------------------
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="331279" author="ofaaland" created="Thu, 7 Apr 2022 06:29:21 +0000"  >&lt;p&gt;I ran llog_reader on each of the files in update_log_dir and attached the output in the file named&#160;lu-15517.llog_reader.out.&lt;/p&gt;

&lt;p&gt;The files with Number of records &amp;gt; 0 are:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000401:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000402:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000403:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000404:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000405:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000406:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000407:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000408:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x400000409:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040a:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040b:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040c:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040d:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040e:0x1:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40000040f:0x1:0x0&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Do you want me to do something with the files in the OI they refer to?&lt;/p&gt;</comment>
                            <comment id="331280" author="tappro" created="Thu, 7 Apr 2022 07:01:34 +0000"  >&lt;p&gt;Olaf, please check these files with llog_reader. I wonder if any of them contains &apos;holes&apos; in records but llog_reader in your version of Lustre might not be able to translate update records correctly and report them as &apos;Unknown&apos; type, you can check.&#160; I&apos;d ask you to upload these files listed above for analysis here, is that possible?&lt;/p&gt;</comment>
                            <comment id="331282" author="tappro" created="Thu, 7 Apr 2022 07:11:43 +0000"  >&lt;p&gt;ah, missed that you&apos;ve uploaded llog_reader output. Looks like it knows about update log, let me check it first&lt;/p&gt;</comment>
                            <comment id="331283" author="tappro" created="Thu, 7 Apr 2022 07:16:00 +0000"  >&lt;p&gt;Olaf, no need to upload all llogs right now, let me think a bit first&lt;/p&gt;</comment>
                            <comment id="331477" author="tappro" created="Sat, 9 Apr 2022 07:22:48 +0000"  >&lt;p&gt;Olaf, the reason why I&apos;ve asked about reproducing that on clean system is that there were several issues with update log processing causing MDT-MDT recovery to fail. Could you check if any of MDT reported errors from &lt;tt&gt;lod_sub_recovery_thread&lt;/tt&gt; or &lt;tt&gt;llog_process_thread()&lt;/tt&gt; or any other llog processing complains when MDT is stuck in WAITING state?&#160;&lt;/p&gt;</comment>
                            <comment id="331655" author="ofaaland" created="Mon, 11 Apr 2022 23:14:58 +0000"  >&lt;p&gt;Hi Mikhail,&lt;br/&gt;
I found that there are several errors reported in lod_sub_process_config().&#160; They were not encountered during my experiment, but earlier in the day today when I stopped the targets prior to a software update.&#160; Note they appeared on several nodes, all MDTs.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 11 14:46:58 jet2 kernel: LustreError: 27555:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0001-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet3 kernel: LustreError: 15820:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0002-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet5 kernel: LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet7 kernel: LustreError: 22882:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0006-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet8 kernel: LustreError: 21185:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0007-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet9 kernel: LustreError: 3160:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0008-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet10 kernel: LustreError: 18778:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0009-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet12 kernel: LustreError: 3992:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000b-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet14 kernel: LustreError: 1060:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000d-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
Apr 11 14:46:58 jet15 kernel: LustreError: 22650:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000e-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="331656" author="ofaaland" created="Mon, 11 Apr 2022 23:17:23 +0000"  >&lt;p&gt;I also saw these errors on jet6 (which was hosting MDT0005):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2022-04-11 14:48:39 [1649713719.122533] LustreError: 30035:0:(osp_object.c:594:osp_attr_get()) lquake-MDT0000-osp-MDT0005:osp_attr_get update error [0x20000000a:0x0:0x0]: rc = -5
2022-04-11 14:48:39 [1649713719.141533] LustreError: 30035:0:(llog_cat.c:447:llog_cat_close()) lquake-MDT0000-osp-MDT0005: failure destroying log during cleanup: rc = -5
2022-04-11 14:48:39 [1649713719.467531] Lustre: server umount lquake-MDT0005 complete &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="331657" author="ofaaland" created="Mon, 11 Apr 2022 23:34:52 +0000"  >&lt;p&gt;In all cases, the lod_sub_process_config() error was preceeded by a osp_disconnect() error, like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2022-04-11 14:46:58 [1649713618.877752] Lustre: Failing over lquake-MDT0004
2022-04-11 14:46:58 [1649713618.896752] LustreError: 11-0: lquake-MDT0001-osp-MDT0004: operation mds_disconnect to node 172.19.1.112@o2ib100 failed: rc = -107
2022-04-11 14:46:58 [1649713618.915752] Lustre: lquake-MDT000e-osp-MDT0004: Connection to lquake-MDT000e (at 172.19.1.125@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
2022-04-11 14:46:58 [1649713618.936752] LustreError: 5267:0:(osp_dev.c:485:osp_disconnect()) lquake-MDT0003-osp-MDT0004: can&apos;t disconnect: rc = -19
2022-04-11 14:46:58 [1649713618.950752] LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="332299" author="tappro" created="Tue, 19 Apr 2022 09:54:43 +0000"  >&lt;p&gt;Olaf, I&apos;d propose to add several patches to your stack. I think the problem in general is that update llogs became quite big at some or several MDTs. There are series of patches to manage update llogs more gracefully (top to bottom):&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/47011&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47011&lt;/a&gt; - &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15645&quot; title=&quot;gap in recovery llog should not be a fatal error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15645&quot;&gt;&lt;del&gt;LU-15645&lt;/del&gt;&lt;/a&gt; obdclass: llog to handle gaps&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/47010&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47010&lt;/a&gt; - &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13195&quot; title=&quot;replay-single test_118: dt_declare_record_write() ASSERTION( dt-&amp;gt;do_body_ops ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13195&quot;&gt;&lt;del&gt;LU-13195&lt;/del&gt;&lt;/a&gt; osp: osp_send_update_req() should check generation&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/45847&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45847&lt;/a&gt; - &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12577&quot; title=&quot;chlg_load failed to process llog -2 or -5 on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12577&quot;&gt;&lt;del&gt;LU-12577&lt;/del&gt;&lt;/a&gt; llog: protect partial updates from readers&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/46863&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46863&lt;/a&gt; - &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13195&quot; title=&quot;replay-single test_118: dt_declare_record_write() ASSERTION( dt-&amp;gt;do_body_ops ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13195&quot;&gt;&lt;del&gt;LU-13195&lt;/del&gt;&lt;/a&gt; osp: invalidate object on write error&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/46873&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46873&lt;/a&gt;&#160; - mostly to make upper one to apply cleanly&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This might help MDT to process update logs after all if problem is in them as I am expecting. Another patch that could also be useful is about OSP_DISCONNECT handling between MDTs:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/44753&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44753&lt;/a&gt; - osp: do force disconnect if import is not ready&lt;/p&gt;

&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="42181" name="dk.1.jet16.txt.gz" size="293895" author="ofaaland" created="Thu, 3 Feb 2022 22:33:55 +0000"/>
                            <attachment id="42179" name="dk.3.txt.gz" size="5436819" author="ofaaland" created="Thu, 3 Feb 2022 20:40:54 +0000"/>
                            <attachment id="42182" name="dmesg.jet16.txt.gz" size="26942" author="ofaaland" created="Thu, 3 Feb 2022 22:33:54 +0000"/>
                            <attachment id="42180" name="dmesg.txt.gz" size="27663" author="ofaaland" created="Thu, 3 Feb 2022 20:40:44 +0000"/>
                            <attachment id="43107" name="lu-15517.llog_reader.out" size="335840" author="ofaaland" created="Thu, 7 Apr 2022 06:08:56 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02h87:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>