<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:20:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8753] Recovery already passed deadline with DNE</title>
                <link>https://jira.whamcloud.com/browse/LU-8753</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;MDT&lt;span class=&quot;error&quot;&gt;&amp;#91;0-1,6-16&amp;#93;&lt;/span&gt; (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.&lt;br/&gt;
MDT&lt;span class=&quot;error&quot;&gt;&amp;#91;2-5&amp;#93;&lt;/span&gt; reach the timeout, and report in the log that recovery has hung and should be aborted.  After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log.  The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages

&amp;lt;ConMan&amp;gt; Console [zinc3] departed by &amp;lt;root@localhost&amp;gt; on pts/0 at 10-15 15:50.
2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;The earliest such messages are:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The last few are:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?&lt;/p&gt;</description>
                <environment>lustre-2.8.0_3.chaos-1.ch6.x86_64&lt;br/&gt;
16 MDTs</environment>
        <key id="41010">LU-8753</key>
            <summary>Recovery already passed deadline with DNE</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="dinatale2">Giuseppe Di Natale</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Mon, 24 Oct 2016 22:57:01 +0000</created>
                <updated>Wed, 19 Sep 2018 10:27:21 +0000</updated>
                            <resolved>Mon, 30 Jan 2017 19:24:08 +0000</resolved>
                                                    <fixVersion>Lustre 2.10.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="170846" author="dinatale2" created="Mon, 24 Oct 2016 22:57:49 +0000"  >&lt;p&gt;Went ahead and linked &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7675&quot; title=&quot;replay-single test_101 times out after aborting recovery on mount of the mds1&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7675&quot;&gt;&lt;del&gt;LU-7675&lt;/del&gt;&lt;/a&gt;, it sounds like it may be related.&lt;/p&gt;</comment>
                            <comment id="170848" author="pjones" created="Mon, 24 Oct 2016 23:11:06 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="171061" author="ofaaland" created="Wed, 26 Oct 2016 01:53:11 +0000"  >&lt;p&gt;To give some sense of scale, zinc1 (which hosts the MGS and MDT0000) has encountered this after I aborted recovery.  So far 80,000 &quot;update_records.c:72:update_records_dump&quot; rows have been dumped, and it&apos;s not done yet.&lt;/p&gt;</comment>
                            <comment id="171062" author="ofaaland" created="Wed, 26 Oct 2016 01:57:34 +0000"  >&lt;p&gt;This is appearing every time, or almost every time, we abort recovery on an MDT.&lt;/p&gt;</comment>
                            <comment id="171099" author="laisiyao" created="Wed, 26 Oct 2016 14:03:29 +0000"  >&lt;p&gt;The log does&apos;t help much, which just tells DNE recovery can&apos;t finish, but not why. Could you upload all MDS syslog? In the mean time I&apos;ll see how it can be approved.&lt;/p&gt;</comment>
                            <comment id="171197" author="ofaaland" created="Wed, 26 Oct 2016 15:54:48 +0000"  >&lt;p&gt;Hello Lai,&lt;/p&gt;

&lt;p&gt;Are there two separate issues here, or do they likely have the same root cause?  That is (a) recovery is hung and (b) whatever happens after abort_recovery, with all the update_records_dump() in the logs.&lt;/p&gt;

&lt;p&gt;If you think think they are two issues, then I should make a separate ticket for the hung recovery.&lt;/p&gt;</comment>
                            <comment id="171207" author="di.wang" created="Wed, 26 Oct 2016 16:50:26 +0000"  >&lt;p&gt;Indeed, it looks like they are separate issue. Please create a new ticket for b).&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck,
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Usually caused by one of recovery threads failing on retrieving update recovery logs or replaying these update records.  As lai said, more console logs or stack trace on lsh-MDT0002 will definitely help us figure out the issue here. thanks.&lt;/p&gt;</comment>
                            <comment id="171212" author="ofaaland" created="Wed, 26 Oct 2016 17:00:48 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;OK, I&apos;ll upload more logs.  Can you change the summary line of this ticket to &quot;Recovery already passed deadline with DNE&quot; or similar?  I don&apos;t have perms to modify that field.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="171420" author="ofaaland" created="Thu, 27 Oct 2016 18:46:20 +0000"  >&lt;p&gt;I needed to reboot a different lustre 2.8 cluster w/ DNE today and the issue occurs there as well.&lt;/p&gt;

&lt;p&gt;Providing logs from this instance instead of the earlier one, because I have both console and lctl dk output.&lt;/p&gt;</comment>
                            <comment id="171421" author="ofaaland" created="Thu, 27 Oct 2016 18:53:43 +0000"  >&lt;p&gt;lustre-related syslog for entire jet lustre cluster including the time span when recovery became stuck on jet7.&lt;/p&gt;

&lt;p&gt;Between the recovery stuck issue, and the lu-8763 issue, it&apos;s messy.  But see the recovery stuck issue from a cold boot of all the MDS nodes.&lt;/p&gt;</comment>
                            <comment id="171423" author="ofaaland" created="Thu, 27 Oct 2016 19:04:43 +0000"  >&lt;p&gt;Please label this topllnl, thanks.&lt;/p&gt;</comment>
                            <comment id="171430" author="pjones" created="Thu, 27 Oct 2016 19:28:42 +0000"  >&lt;p&gt;Olaf&lt;/p&gt;

&lt;p&gt;I have done so but I am surprised that you could not do this yourself. If you confirm that this is the case then I will dig into your JIRA permissions&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="171432" author="ofaaland" created="Thu, 27 Oct 2016 19:39:41 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;That would be great, thanks.  I can add labels at the time I create a ticket, but I cannot change them after the fact.  I confirmed that on this ticket.&lt;/p&gt;</comment>
                            <comment id="171437" author="pjones" created="Thu, 27 Oct 2016 20:55:04 +0000"  >&lt;p&gt;Olaf&lt;/p&gt;

&lt;p&gt;I made a tweak to your permissions. I&apos;m not 100% certain it will help but give it another try&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="171439" author="ofaaland" created="Thu, 27 Oct 2016 20:58:10 +0000"  >&lt;p&gt;Peter,&lt;br/&gt;
That worked.  Thank you.&lt;br/&gt;
-Olaf&lt;/p&gt;</comment>
                            <comment id="171454" author="di.wang" created="Thu, 27 Oct 2016 22:13:40 +0000"  >&lt;p&gt;According to the debug log&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:02000000:10.0:1477593165.292609:0:143210:0:(import.c:1539:ptlrpc_import_recovery_state_machine()) lquake-MDT0006: Connection restored to lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100)
00000040:00020000:11.0:1477593165.661058:0:31346:0:(llog.c:591:llog_process_thread()) lquake-MDT000b-osp-MDT0006 retry remote llog process
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It looks like MDT0006 keeps trying to retrieve the update log from MDT000b (172.19.1.122@o2ib100), but keep getting EIO here. Do you have the log on MDT000b ? Thanks.&lt;/p&gt;</comment>
                            <comment id="171459" author="ofaaland" created="Thu, 27 Oct 2016 22:32:08 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
That service would have been running on either jet7 or jet8.  Both those nodes have been rebooted since the instance you&apos;re looking at, so I can&apos;t get the lctl dk output anymore.&lt;br/&gt;
However, you should have the relevant console log output in lustre.log.gz.&lt;/p&gt;</comment>
                            <comment id="171489" author="di.wang" created="Fri, 28 Oct 2016 00:30:17 +0000"  >&lt;p&gt;Ah, according to the log, MDT000b(jet12) recover seems failed because of invalid log record. ( hmm, I did not see it for long time).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 27 10:24:35 jet12 kernel: [  356.925961] LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920
Oct 27 10:24:35 jet12 kernel: [  356.947238] LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22
Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920
Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22
Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108
Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message
Oct 27 10:24:36 jet12 kernel: [  358.012215] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108
Oct 27 10:24:36 jet12 kernel: [  358.028270] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message
Oct 27 10:24:37 jet12 kernel: [  359.735327] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Oct 27 10:24:37 jet12 kernel: [  359.748515] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) Skipped 2 previous similar messages
Oct 27 10:24:37 jet12 kernel: [  359.760557] LustreError: 34889:0:(update_records.c:72:update_records_dump()) master transno = 231932247340 batchid = 210453399132 flags = 0 ops = 154 params = 80
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;which cause the recovery stuck on jet12, then cause other MDTs recovery stuck. &lt;/p&gt;</comment>
                            <comment id="171495" author="di.wang" created="Fri, 28 Oct 2016 01:11:38 +0000"  >&lt;p&gt;Olaf: &lt;br/&gt;
I think you need delete these update logs on all of MDTs of lquake to avoid the further problem.&lt;/p&gt;

&lt;p&gt;you can mount the mdt as ZFS, then delete all update_log_dir and update_log&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;rm -rf /mnt/zfs/update_log*
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note: this will not destroy the data, but only delete those un-committed recovery logs, but since these update logs are corrupted, they are useless anyway.&lt;/p&gt;

&lt;p&gt;I am trying to figure out why these update logs are corrupted, but not much information here. &lt;/p&gt;</comment>
                            <comment id="171641" author="ofaaland" created="Fri, 28 Oct 2016 20:26:56 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
Might it have to do with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8569&quot; title=&quot;Sharded DNE directory full of files that don&amp;#39;t exist&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8569&quot;&gt;&lt;del&gt;LU-8569&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="171642" author="ofaaland" created="Fri, 28 Oct 2016 20:28:10 +0000"  >&lt;p&gt;I see now you are working on that ticket and already know about it.&lt;/p&gt;</comment>
                            <comment id="171704" author="ofaaland" created="Sat, 29 Oct 2016 01:13:32 +0000"  >&lt;p&gt;I tried deleting the update_log and update_log_dir on all the MDTs under ZFS.  I was able to delete update_log and the contents of update_log_dir, but not able to remove the directory update_log_dir itself.&lt;/p&gt;

&lt;p&gt;Attempting to mount MDT0000 fails with EEXIST, and the debug log reports the attempt to create the FID_SEQ_UPDATE_LOG fails:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-10-28 17:30:14.292693 00000020:00001000:5.0::0:143276:0:(local_storage.c:379:__local_file_create()) create new object [0x200000009:0x0:0x0]
2016-10-28 17:30:14.292694 00000020:00000001:5.0::0:143276:0:(local_storage.c:260:local_object_create()) Process entered
2016-10-28 17:30:14.292694 00080000:00000001:5.0::0:143276:0:(osd_object.c:1530:osd_object_create()) Process entered
2016-10-28 17:30:14.292716 00080000:00000010:5.0::0:143276:0:(osd_object.c:1236:__osd_attr_init()) kmalloced &apos;bulk&apos;: 520 at ffff887ef276f400.
2016-10-28 17:30:14.292719 00080000:00000010:5.0::0:143276:0:(osd_object.c:1268:__osd_attr_init()) kfreed &apos;bulk&apos;: 520 at ffff887ef276f400.
2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:241:fid_is_on_ost()) Process entered
2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
2016-10-28 17:30:14.292728 00080000:00000001:5.0::0:143276:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)
2016-10-28 17:30:14.292729 00080000:00000001:5.0::0:143276:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
2016-10-28 17:30:14.292730 00000020:00000001:5.0::0:143276:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Any suggestions?&lt;/p&gt;</comment>
                            <comment id="171706" author="ofaaland" created="Sat, 29 Oct 2016 01:28:29 +0000"  >&lt;p&gt;Renaming the directory doesn&apos;t change the behavior.&lt;/p&gt;</comment>
                            <comment id="171707" author="di.wang" created="Sat, 29 Oct 2016 02:38:04 +0000"  >&lt;p&gt;hmm, it looks like on ZFS, it create the log file in a different way. It seems to me &quot;rm update_log&quot; only delete the name, the real object is still there. Unfortunately, I am not ZFS expert here. Let me ask around.  Alex, any ideas? &lt;/p&gt;</comment>
                            <comment id="171708" author="di.wang" created="Sat, 29 Oct 2016 02:41:51 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Might it have to do with LU-8569?
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;it might be indeed.&lt;/p&gt;</comment>
                            <comment id="171712" author="bzzz" created="Sat, 29 Oct 2016 07:31:27 +0000"  >&lt;p&gt;please check with the actual code:&lt;br/&gt;
	rc = &lt;del&gt;zap_add(osd&lt;/del&gt;&amp;gt;od_os, zapid, buf, 8, 1, zde, oh-&amp;gt;ot_tx);&lt;br/&gt;
	if (rc)&lt;br/&gt;
		GOTO(out, rc);&lt;/p&gt;

&lt;p&gt;rm (direct mount and ZPL, right?) doesn&apos;t update OI.&lt;/p&gt;</comment>
                            <comment id="171730" author="di.wang" created="Mon, 31 Oct 2016 05:21:41 +0000"  >&lt;p&gt;Sigh, I have to deal with this in lustre. Let me make a patch. &lt;/p&gt;</comment>
                            <comment id="171732" author="gerrit" created="Mon, 31 Oct 2016 06:21:49 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/23490&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/23490&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; lod: delete update_log/update_log_dir&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2af38532d9e11d098f5c765ac8a0263a943c240c&lt;/p&gt;</comment>
                            <comment id="171733" author="di.wang" created="Mon, 31 Oct 2016 06:23:31 +0000"  >&lt;p&gt;I just pushed the patch &lt;a href=&quot;http://review.whamcloud.com/23490&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/23490&lt;/a&gt;. I do not have ZFS environment to try on my side, Olaf, could you please try to see if this is helpful for your mount failure?&lt;/p&gt;</comment>
                            <comment id="172107" author="ofaaland" created="Thu, 3 Nov 2016 04:19:44 +0000"  >&lt;p&gt;Hi Di,&lt;/p&gt;

&lt;p&gt;I get the following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-11-02 20:47:31 [ 3932.246998] Lustre: MGS: Logs for fs lquake were removed by user request.  All servers must be restarted in order to regenerate the logs.
2016-11-02 20:47:31 [ 3932.449165] LustreError: 79681:0:(osd_object.c:437:osd_object_init()) lquake-MDT0000: lookup [0x200000009:0x0:0x0]/0xa4 failed: rc = -2
2016-11-02 20:47:31 [ 3932.464118] LustreError: 79681:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -2
2016-11-02 20:47:31 [ 3932.475878] Lustre: Failing over lquake-MDT0000
2016-11-02 20:47:32 [ 3932.568791] Lustre: server umount lquake-MDT0000 complete
2016-11-02 20:47:32 [ 3932.575517] LustreError: 79681:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-2)

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that I initially started both the mgs and mdt0000 with -o writeconf. &lt;/p&gt;</comment>
                            <comment id="172123" author="di.wang" created="Thu, 3 Nov 2016 06:07:05 +0000"  >&lt;p&gt;Hi, Olaf&lt;/p&gt;

&lt;p&gt;Could you please retry and dump -1 debug log?  Please be sure to remove update_log* under zfs before retry, thanks.&lt;/p&gt;</comment>
                            <comment id="172254" author="ofaaland" created="Fri, 4 Nov 2016 01:35:41 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
Attached dk.jet1.1478223101.gz for the mount failure with your patch applied.&lt;/p&gt;</comment>
                            <comment id="172425" author="di.wang" created="Sat, 5 Nov 2016 02:44:23 +0000"  >&lt;p&gt;Olaf,&lt;br/&gt;
Could you please also delete the OI entry for update_log under ZFS, should be like &quot;/mnt/zfs/oi.9/0x200000009\:0x1\:0x0&quot;?  &lt;br/&gt;
Anyway please make sure both update_log and oi.9/0x200000009\:0x1\:0x0 are deleted. And also if update_log_dir can not be removed, let&apos;s restore it back.  If it still fails, please collect -1 debug log for me. Thanks.&lt;/p&gt;</comment>
                            <comment id="172659" author="gerrit" created="Mon, 7 Nov 2016 23:37:10 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/23635&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/23635&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; llog: add some debug information&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 615f878ceafb612ccbb868ad752ecd14458068dd&lt;/p&gt;</comment>
                            <comment id="172662" author="di.wang" created="Mon, 7 Nov 2016 23:42:44 +0000"  >&lt;p&gt;Olaf&lt;/p&gt;

&lt;p&gt;Do you still to have that corrupted log file, it should be a regular file under update_log_dir  &quot;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40019a58:0x6:0x0&amp;#93;&lt;/span&gt;&quot;. If not, could you please add this debug patch &lt;a href=&quot;http://review.whamcloud.com/23635&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/23635&lt;/a&gt;, retry?  If error happens, could you please post the console message and update the corrupted log file? Thanks. &lt;/p&gt;</comment>
                            <comment id="172681" author="ofaaland" created="Tue, 8 Nov 2016 01:12:36 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Could you please also delete the OI entry for update_log under ZFS, should be like &quot;/mnt/zfs/oi.9/0x200000009\:0x1\:0x0&quot;?&lt;br/&gt;
Anyway please make sure both update_log and oi.9/0x200000009\:0x1\:0x0 are deleted. And also if update_log_dir can not be removed, let&apos;s restore it back. If it still fails, please collect -1 debug log for me. Thanks.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Joe mounted the MDT0000 dataset via ZPL and find there is no entry with name beginning &apos;0x200000009&apos;, not under oi.9/ nor anywhere.  update_log is deleted.  update_log_dir could not be deleted (fails with &apos;not empty&apos; even though it appears to be so).  It appears to be damaged:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;update_log_dir/[root@jet1:zfs]# ll -d update_*
drw-r--r-- 2 root root 18446744073709551140 Oct 28 13:40 update_log_dir
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Joe attempted the import with debug and subsystem_debug set to -1.  I&apos;ve attached the log, dk.jet1.1478565846.gz&lt;/p&gt;</comment>
                            <comment id="172686" author="ofaaland" created="Tue, 8 Nov 2016 01:46:49 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do you still to have that corrupted log file, it should be a regular file under update_log_dir &quot;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x40019a58:0x6:0x0&amp;#93;&lt;/span&gt;&quot;.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;MDT000b update_log_dir contains file named &quot;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x240019a58:0x6:0x0&amp;#93;&lt;/span&gt;&quot;.  It is attached in mdt0b.0x240019a58_0x6_0x0.tgz.&lt;/p&gt;

&lt;p&gt;Is that the correct one?&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;

</comment>
                            <comment id="172690" author="di.wang" created="Tue, 8 Nov 2016 02:05:18 +0000"  >&lt;p&gt;Hmm, it looks like update_log &quot;0x200000009\:0x1\:0x0&quot; still exists in the system. &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)
00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Could you please tell me what lines are these logs mapping to? thanks these seems not match my b2_8 version.&lt;/p&gt;</comment>
                            <comment id="172692" author="di.wang" created="Tue, 8 Nov 2016 02:18:20 +0000"  >&lt;p&gt;Olaf, yes, that is the one, thanks.&lt;/p&gt;</comment>
                            <comment id="172700" author="di.wang" created="Tue, 8 Nov 2016 03:57:51 +0000"  >&lt;p&gt;Ah, Olaf&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The corrupted log is on MDT0009, instead of MDT000b, could you please check there? thanks.&lt;/p&gt;</comment>
                            <comment id="172807" author="dinatale2" created="Tue, 8 Nov 2016 18:11:02 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;Below are snippets of code to help you identify where the log messages you posted are from.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In lustre/osd-zfs/osd_oi.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;249 
250         rc = osd_fld_lookup(env, osd, fid_seq(fid), range);
251         if (rc != 0) {
252                 if (rc != -ENOENT)
253                         CERROR(&quot;%s: &quot;DFID&quot; lookup failed: rc = %d\n&quot;,
254                                osd_name(osd), PFID(fid), rc);
255                 RETURN(0);
256         }
257 
258         if (fld_range_is_ost(range))
259                 RETURN(1);
260 
261         RETURN(0);
262 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In lustre/osd-zfs/osd_object.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;1563         zde-&amp;gt;zde_pad = 0;
1564         zde-&amp;gt;zde_dnode = db-&amp;gt;db_object;
1565         zde-&amp;gt;zde_type = IFTODT(attr-&amp;gt;la_mode &amp;amp; S_IFMT);
1566 
1567         zapid = osd_get_name_n_idx(env, osd, fid, buf);
1568 
1569         rc = -zap_add(osd-&amp;gt;od_os, zapid, buf, 8, 1, zde, oh-&amp;gt;ot_tx);
1570         if (rc)
1571                 GOTO(out, rc);
1572 
1573         /* Add new object to inode accounting.
1574          * Errors are not considered as fatal */
1575         rc = -zap_increment_int(osd-&amp;gt;od_os, osd-&amp;gt;od_iusr_oid,
1576                                 (attr-&amp;gt;la_valid &amp;amp; LA_UID) ? attr-&amp;gt;la_uid : 0, 1,
1577                                 oh-&amp;gt;ot_tx);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In lustre/osd-zfs/osd_object.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;1594         rc = osd_init_lma(env, obj, fid, oh);
1595         if (rc) {
1596                 CERROR(&quot;%s: can not set LMA on &quot;DFID&quot;: rc = %d\n&quot;,
1597                        osd-&amp;gt;od_svname, PFID(fid), rc);
1598                 /* ignore errors during LMA initialization */
1599                 rc = 0;
1600         }
1601 
1602 out:
1603         up_write(&amp;amp;obj-&amp;gt;oo_guard);
1604         RETURN(rc);
1605 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In lustre/obdclass/local_storage.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 251 int local_object_create(const struct lu_env *env,
 252                         struct local_oid_storage *los,
 253                         struct dt_object *o, struct lu_attr *attr,
 254                         struct dt_object_format *dof, struct thandle *th)
 255 {
 256         struct dt_thread_info   *dti = dt_info(env);
 257         u64                      lastid;
 258         int                      rc;
 259 
 260         ENTRY;
 261 
 262         rc = dt_create(env, o, attr, NULL, dof, th);
 263         if (rc)
 264                 RETURN(rc);
 265 
 266         if (los == NULL)
 267                 RETURN(rc);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In lustre/obdclass/local_storage.c:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 378         CDEBUG(D_OTHER, &quot;create new object &quot;DFID&quot;\n&quot;,
 379                PFID(lu_object_fid(&amp;amp;dto-&amp;gt;do_lu)));
 380         rc = local_object_create(env, los, dto, attr, dof, th);
 381         if (rc)
 382                 GOTO(unlock, rc);
 383         LASSERT(dt_object_exists(dto));
 384 
 385         if (dti-&amp;gt;dti_dof.dof_type == DFT_DIR) {
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="172808" author="dinatale2" created="Tue, 8 Nov 2016 18:12:30 +0000"  >&lt;p&gt;I&apos;m also hunting for the corrupted log on MDT0009.&lt;/p&gt;</comment>
                            <comment id="172851" author="dinatale2" created="Tue, 8 Nov 2016 22:02:15 +0000"  >&lt;p&gt;I went ahead and attached the log file from mdt0009. File name is mdt09.0x240019a58_0x6_0x0.tgz&lt;/p&gt;</comment>
                            <comment id="172861" author="di.wang" created="Tue, 8 Nov 2016 23:27:46 +0000"  >&lt;p&gt;Thanks, Giuseppe. I will check.&lt;/p&gt;</comment>
                            <comment id="173238" author="di.wang" created="Fri, 11 Nov 2016 01:07:59 +0000"  >&lt;p&gt;Hello, Giuseppe, mdt09.0x240019a58_0x6_0x0.tgz and mdt0b.0x240019a58_0x6_0x0.tgz seems same. could you please check? thanks.&lt;/p&gt;</comment>
                            <comment id="173498" author="ofaaland" created="Mon, 14 Nov 2016 18:25:14 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;Attaching logs.2016-11-14.tgz which includes:&lt;/p&gt;

&lt;p&gt;mnt/lu-8753/oi.88/0x240019a58:0x6:0x0&lt;br/&gt;
mnt/lu-8753/update_log_dir/&lt;span class=&quot;error&quot;&gt;&amp;#91;0x240019a58:0x6:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
mnt/lu-8753/oi.10/0x20000000a:0x9:0x0/&lt;span class=&quot;error&quot;&gt;&amp;#91;0x240019a58:0x6:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
mnt/lu-8753/CONFIGS/&lt;br/&gt;
mnt/lu-8753/CONFIGS/lquake-MDT0009&lt;br/&gt;
mnt/lu-8753/CONFIGS/params&lt;br/&gt;
mnt/lu-8753/CONFIGS/lquake-client&lt;/p&gt;

&lt;p&gt;The CONFIGS directory is there to document which target&apos;s dataset these files came from.&lt;/p&gt;

&lt;p&gt;Note that the oi.88, oi.10, and update_log_dir entries all refer to the same underlying object/file.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="173554" author="di.wang" created="Tue, 15 Nov 2016 00:03:35 +0000"  >&lt;p&gt;Thanks Olaf, though the file seems still same. This multiple link llog file does not looks right to me. especially links both under oi.88 and oi.10. Hmm in llog_osd_create(), it should not do llog_osd_regular_fid_add_name_entry() for ZFS.&lt;/p&gt;

&lt;p&gt;So either OSD can tell which FID is for llog (remember these normal llog FID inside OSD) or the llog_osd can tell if this is on ldiskfs or ZFS. Alex any suggestion here?&lt;/p&gt;

&lt;p&gt;Hmm, it looks like we can input the llog type to OSD by dt_object_format, then remember the llog type inside LMA, then we also move those llog_osd_regular_fid_add_name_entry() into ldiskfs-osd. I will cook a patch.&lt;/p&gt;</comment>
                            <comment id="173559" author="ofaaland" created="Tue, 15 Nov 2016 01:49:38 +0000"  >&lt;p&gt;Thanks Di.&lt;br/&gt;
I double checked and there&apos;s no object with 40019a58 in the name on MDT000b.  I think the original file I submitted actually was from MDT0009, and I mislabeled it; the pools for MDT0009 and MDT000b reside on drives that are in the same enclosure and visible from the same set of nodes.  Sorry for the confusion.&lt;/p&gt;

&lt;p&gt;I am sure of the provenance of the one I submitted today because of the way I created the tar file.&lt;/p&gt;</comment>
                            <comment id="173560" author="di.wang" created="Tue, 15 Nov 2016 01:53:38 +0000"  >&lt;p&gt;I checked the llog file you posted here. Unfortunately this is not the corrupted llog file. I am cooking a patch for this multiple link llog issue.&lt;/p&gt;</comment>
                            <comment id="173632" author="ofaaland" created="Tue, 15 Nov 2016 16:25:12 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
1) With regards to the problem of the update llog getting corrupted, we have no more places to look to determine the root cause, right?&lt;br/&gt;
2) If I&apos;m able to get the filesystem back up again, and reproduce the problem, do you have advice on how to capture the data we need?&lt;/p&gt;</comment>
                            <comment id="173637" author="ofaaland" created="Tue, 15 Nov 2016 16:37:24 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
I created ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8833&quot; title=&quot;too many links to update llogs on zfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8833&quot;&gt;&lt;del&gt;LU-8833&lt;/del&gt;&lt;/a&gt; for the patch to correct the number of links to the update llogs under zfs.&lt;/p&gt;</comment>
                            <comment id="173676" author="di.wang" created="Tue, 15 Nov 2016 18:48:28 +0000"  >&lt;p&gt;Hello, Olaf, &lt;/p&gt;

&lt;p&gt;Yes, if we can not get the corrupt log file, it is hard to find the root cause here. If you can reproduce the issue, then please apply the patch &lt;a href=&quot;http://review.whamcloud.com/#/c/23635/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/23635/&lt;/a&gt; and also the corruptted log file, thanks.&lt;/p&gt;

&lt;p&gt;Btw: Were you able to delete thoese update_log files finally? or you will reformat the system (sorry about that. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;) ? Thanks.&lt;/p&gt;</comment>
                            <comment id="174066" author="ofaaland" created="Thu, 17 Nov 2016 16:29:35 +0000"  >&lt;p&gt;Hi Di, I&apos;ll apply your patch and attempt to reproduce.  I wasn&apos;t able to delete those files yet; I&apos;m going to spend a little longer looking into it.  Thanks.&lt;/p&gt;</comment>
                            <comment id="175213" author="di.wang" created="Mon, 28 Nov 2016 17:32:49 +0000"  >&lt;p&gt;Olaf: any good news? &lt;/p&gt;</comment>
                            <comment id="175274" author="ofaaland" created="Mon, 28 Nov 2016 20:41:25 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
No new data yet.&lt;/p&gt;</comment>
                            <comment id="175305" author="ofaaland" created="Mon, 28 Nov 2016 23:19:09 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
I encountered the invalid llog tail issue after lunch today.  I&apos;ll attach logs momentarily.&lt;/p&gt;</comment>
                            <comment id="175311" author="ofaaland" created="Mon, 28 Nov 2016 23:41:10 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;Confirm this is the right llog and I&apos;ll upload it.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[faaland1@zinci syslog]$grep invalid *
console.zinc1:2016-11-28 15:09:41 [1123885.603795] LustreError: 96133:0:(llog_osd.c:954:llog_osd_next_block()) lsh-MDT000c-osp-MDT0000: invalid llog tail at log id 0x1:7024/0 offset 70320128 bytes 32768

[root@zinc13:update_log_dir]# ls ../CONFIGS/
lsh-MDT000c  lsh-client  param

[root@zinc13:update_log_dir]# pwd
/mnt/lu-8753/update_log_dir

[root@zinc13:update_log_dir]# ll *1b70*
-rw-r--r-- 1 root root 71419576 Dec 31  1969 [0x500001b70:0x1:0x0]
-rw-r--r-- 1 root root        0 Dec 31  1969 [0x500001b70:0x2:0x0]
[root@zinc13:update_log_dir]# llog_reader &apos;[0x500001b70:0x1:0x0]&apos; | less
[root@zinc13:update_log_dir]# llog_reader &apos;[0x500001b70:0x1:0x0]&apos; | grep -C 10 70320128
rec #8910 type=106a0000 len=8536 offset 70238896
Bit 8911 of 6687 not set
rec #8912 type=106a0000 len=8536 offset 70254592
rec #8913 type=106a0000 len=8536 offset 70263128
rec #8914 type=106a0000 len=8536 offset 70271664
Bit 8915 of 6687 not set
rec #8916 type=106a0000 len=8536 offset 70287360
rec #8917 type=106a0000 len=8536 offset 70295896
rec #8918 type=106a0000 len=8536 offset 70304432
off 70312968 skip 7160 to next chunk.
off 70320128 skip 32768 to next chunk.
rec #8924 type=106a0000 len=9056 offset 70352896
rec #8925 type=106a0000 len=9056 offset 70361952
rec #8926 type=106a0000 len=9048 offset 70371008
Bit 8927 of 6687 not set
rec #8928 type=106a0000 len=9056 offset 70385664
rec #8929 type=106a0000 len=9056 offset 70394720
rec #8930 type=106a0000 len=9056 offset 70403776
Bit 8931 of 6687 not set
rec #8932 type=106a0000 len=9056 offset 70418432
rec #8933 type=106a0000 len=9048 offset 70427488
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="175321" author="ofaaland" created="Mon, 28 Nov 2016 23:59:06 +0000"  >&lt;p&gt;Attached console logs and a text file listing which target is running on which node, for the most recent instance of the issue on Zinc.&lt;/p&gt;

&lt;p&gt;target_to_node_map.nov28.txt&lt;br/&gt;
console_logs.nov28.tgz&lt;/p&gt;</comment>
                            <comment id="175324" author="ofaaland" created="Tue, 29 Nov 2016 00:06:17 +0000"  >&lt;p&gt;Attached lustre debug logs for zinc target which is still in recovery (zinc7/lsh-MDT0006), and for target holding llog with invalid llog tail (zzinc13/lsh-MDT000c).  Also attached what I believe is the relevant llog file from MDT000c.&lt;/p&gt;

&lt;p&gt;dk.zinc7.1480375634.gz&lt;br/&gt;
dk.zinc13.1480375634.gz&lt;br/&gt;
lsh-mdt000c-1b70.nov28.tgz&lt;/p&gt;</comment>
                            <comment id="175327" author="ofaaland" created="Tue, 29 Nov 2016 00:20:43 +0000"  >&lt;p&gt;Attached lustre debug log for the node reporting the invalid tail.&lt;/p&gt;

&lt;p&gt;dk.zinc1.1480375634.gz&lt;/p&gt;</comment>
                            <comment id="175328" author="ofaaland" created="Tue, 29 Nov 2016 00:23:18 +0000"  >&lt;p&gt;I will leave the file system as-is (down for users, due to the MDT stuck in recovery) until you have all the information you need.&lt;/p&gt;</comment>
                            <comment id="175331" author="di.wang" created="Tue, 29 Nov 2016 00:39:20 +0000"  >&lt;p&gt;Thanks, Olaf. It looks correct, I will check right away.&lt;/p&gt;</comment>
                            <comment id="175353" author="di.wang" created="Tue, 29 Nov 2016 07:07:13 +0000"  >&lt;p&gt;According to the dump of this update log, it looks like there is a hole from 70312968 to 70320128 (all are &quot;0&quot;), which cause the &quot;corruption&quot;.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;master transno = 17237427556 batchid = 12884938203 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8916 type=106a0000 len=8536 offset 70287360, total 6584
offset 70295896 index 8917 type 106a0000
master transno = 17237427557 batchid = 12884938202 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8917 type=106a0000 len=8536 offset 70295896, total 6585
offset 70304432 index 8918 type 106a0000
master transno = 17237427805 batchid = 12884938204 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8918 type=106a0000 len=8536 offset 70304432, total 6586


offset 70312968 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 70312968 skip 7160 to next chunk. test_bit yes
offset 70320128 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 70320128 skip 32768 to next chunk. test_bit yes

offset 70352896 index 8924 type 106a0000
master transno = 17286373092 batchid = 12884981614 flags = 0 ops = 189 params = 86 rec_len 9056
rec #8924 type=106a0000 len=9056 offset 70352896, total 6587
offset 70361952 index 8925 type 106a0000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It looks like lgh_write_offset is not being reset during recovery, but need further investigation. Thanks.&lt;/p&gt;</comment>
                            <comment id="175445" author="gerrit" created="Tue, 29 Nov 2016 18:11:49 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/24008&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/24008&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; llog: remove lgh_write_offset&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 9f73a3f56778dd72feb802fc80d58c55dd3d4088&lt;/p&gt;</comment>
                            <comment id="175447" author="di.wang" created="Tue, 29 Nov 2016 18:14:17 +0000"  >&lt;p&gt;Olaf: please enable D_HA and also collect the corrupt llog file, in case this patch does not help. Thanks.&lt;/p&gt;</comment>
                            <comment id="175526" author="ofaaland" created="Tue, 29 Nov 2016 21:41:57 +0000"  >&lt;p&gt;Hi Di,&lt;/p&gt;

&lt;p&gt;Disregard my message about not having lgh_write_offset.   Not sure what I was looking at, it&apos;s there.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="175534" author="di.wang" created="Tue, 29 Nov 2016 22:02:45 +0000"  >&lt;p&gt;Btw: you probably need delete the update_logs or reformat the system, since the update log is already corrupted, which will cause recovery stuck anyway. Sorry about that.&lt;/p&gt;</comment>
                            <comment id="176409" author="di.wang" created="Sun, 4 Dec 2016 16:32:29 +0000"  >&lt;p&gt;Hi, Olaf, how does the test go?&lt;/p&gt;</comment>
                            <comment id="176517" author="ofaaland" created="Mon, 5 Dec 2016 18:20:00 +0000"  >&lt;p&gt;Hi Di,&lt;br/&gt;
Nothing to report yet.  We reformatted the file system but I am still working on getting updated RPMs on there.  Thanks for checking.  Hopefully we&apos;ll get it going by tomorrow and reproduce.&lt;/p&gt;</comment>
                            <comment id="176893" author="dinatale2" created="Wed, 7 Dec 2016 18:08:58 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;We couldn&apos;t remove update_log_dir thru the ZPL because of an issue with ZAP size bookkeeping. I went ahead and made &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8916&quot; title=&quot;ods-zfs doesn&amp;#39;t manage ZAP sizes correctly&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8916&quot;&gt;&lt;del&gt;LU-8916&lt;/del&gt;&lt;/a&gt; for that issue.&lt;/p&gt;</comment>
                            <comment id="176902" author="ofaaland" created="Wed, 7 Dec 2016 18:28:09 +0000"  >&lt;p&gt;Updated Lustre on the test filesystem jet/lquake with the patch to remove lgh_write_offset.  Will attempt to reproduce today.&lt;/p&gt;</comment>
                            <comment id="177759" author="ofaaland" created="Wed, 14 Dec 2016 20:07:56 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;With the lgh_write_offset patch recovery seems to complete more reliably.  I&apos;m now seeing new symptoms, and I don&apos;t know if they are a consequence of the patch or a separate problem that was hidden by the recovery issue.  I&apos;ll summarize what I am seeing here; please tell me if you think it&apos;s a separate issue that should get a separate ticket.&lt;/p&gt;

&lt;p&gt;I&apos;d created several directories striped across all 16 MDTs.  Within each such directory, a node was running mdtest and creating directories and files and deleting them.  I&apos;m not sure which phase it was in at the time I powered off 4 of the 16 MDS nodes (MDT0008 through MDT000b).  I then powered them on again.  During recovery, the server logs show:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:48:46 [  241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58
2016-12-13 14:48:46 [  241.972729] Lustre: Skipped 7 previous similar messages
2016-12-13 14:48:46 [  241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36-&amp;gt;987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:48:46 [  242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages
2016-12-13 14:48:50 [  245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997)

&amp;lt;&amp;lt;&amp;lt; transno messages redacted &amp;gt;&amp;gt;&amp;gt;

2016-12-13 14:48:55 [  246.868005] Lustre: lquake-MDT0009: Recovery over after 1:39, of 31 clients 31 recovered and 0 were evicted.
2016-12-13 14:48:55 [  251.145347] sched: RT throttling activated
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then there seems to be some inter-MDT issue:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:53:16 [  512.145169] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669506/real 1481669506]  req@ffff883ef5309b00 x1553642790382996/t0(0) o400-&amp;gt;lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669596 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:54:46 [  602.183013] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669596/real 1481669596]  req@ffff883ee54c6f00 x1553642790383484/t0(0) o400-&amp;gt;lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669686 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:56:16 [  692.220902] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669686/real 1481669686]  req@ffff883edf31f800 x1553642790384036/t0(0) o400-&amp;gt;lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669776 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:56:23 [  698.815522] Lustre: lquake-MDT000a-osp-MDT0009: Connection restored to 172.19.1.121@o2ib100 (at 172.19.1.121@o2ib100)
2016-12-13 14:56:23 [  698.828569] Lustre: Skipped 23 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And clients seem to get evicted as a result of that later inter-MDT issue: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:53:16 [352250.742120]   req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36-&amp;gt;lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 13 to 1 dl 1481669596 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
2016-12-13 14:54:46 [352340.778326] LustreError: 8274:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.
2016-12-13 14:54:46 [352340.778326]   req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36-&amp;gt;lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 16 to 1 dl 1481669686 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
2016-12-13 14:56:13 [352428.151175] LustreError: 11-0: lquake-MDT000a-mdc-ffff88103d075000: operation mds_reint to node 172.19.1.121@o2ib100 failed: rc = -107
2016-12-13 14:56:42 [352456.923731] LustreError: 167-0: lquake-MDT000a-mdc-ffff88103d075000: This client was evicted by lquake-MDT000a; in progress operations using this service will fail.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After the 14:56  message LustreError message stop and listings, rmdirs and unlinks seem to work, but ...&lt;/p&gt;

&lt;p&gt;There are damaged directories.  Long listings of certain directories produce this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lstat(&quot;dir.mdtest.1.173&quot;, 0x61d930)     = -1 ENOENT (No such file or directory)
write(2, &quot;ls: &quot;, 4ls: )                     = 4
write(2, &quot;cannot access dir.mdtest.1.173&quot;, 30cannot access dir.mdtest.1.173) = 30
write(2, &quot;: No such file or directory&quot;, 27: No such file or directory) = 27
write(2, &quot;\n&quot;, 1
)                       = 1
lstat(&quot;dir.mdtest.4.170&quot;, 0x61d9f0)     = -1 ENOENT (No such file or directory)
write(2, &quot;ls: &quot;, 4ls: )                     = 4
write(2, &quot;cannot access dir.mdtest.4.170&quot;, 30cannot access dir.mdtest.4.170) = 30
write(2, &quot;: No such file or directory&quot;, 27: No such file or directory) = 27
write(2, &quot;\n&quot;, 1
)                       = 1
lstat(&quot;dir.mdtest.1.172&quot;, 0x61dab0)     = -1 ENOENT (No such file or directory)
write(2, &quot;ls: &quot;, 4ls: )                     = 4
write(2, &quot;cannot access dir.mdtest.1.172&quot;, 30cannot access dir.mdtest.1.172) = 30
write(2, &quot;: No such file or directory&quot;, 27: No such file or directory) = 27
write(2, &quot;\n&quot;, 1
)                       = 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and client console log messages like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 15:42:12 [355186.720441] LustreError: 159150:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2
2016-12-13 15:44:33 [355328.281847] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2
2016-12-13 15:44:33 [355328.292937] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 23 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="177772" author="di.wang" created="Wed, 14 Dec 2016 21:24:43 +0000"  >&lt;p&gt;Thanks for update, Olaf.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:48:46 [  241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58
2016-12-13 14:48:46 [  241.972729] Lustre: Skipped 7 previous similar messages
2016-12-13 14:48:46 [  241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36-&amp;gt;987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:48:46 [  242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages
2016-12-13 14:48:50 [  245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is ok, especially the recovery seems good according to following message. Actually this &quot;...waking for gap..&quot; message has been removed from console log in 2.9 see &lt;a href=&quot;https://review.whamcloud.com/#/c/21418/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/21418/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Those inter-MDT request timeout messages seems suggest MDT000a is stuck or dead for some reasons. Do you have the console logs or stack trace on MDT000a when this happened?&lt;/p&gt;


&lt;p&gt;This damaged directories seems a bigger issue, but it unlikely caused by lgh_write_offset patch, could you please create a new ticket? And collecting -1 debug log when you do find on these directories, I want to know if this is caused by damaged stripes or dangling entries. Thanks. &lt;/p&gt;</comment>
                            <comment id="177775" author="ofaaland" created="Wed, 14 Dec 2016 21:42:32 +0000"  >&lt;p&gt;Attached console log from jet11 which was running MDT000a, called console.jet11.2016-12-13-14-47.  This is from the lustre startup on 2016-12-13 at 14:47.&lt;/p&gt;</comment>
                            <comment id="177786" author="di.wang" created="Wed, 14 Dec 2016 23:30:58 +0000"  >&lt;p&gt;It looks recovery is stucked on MDT000a, and also recovery is aborted somehow because of this timeout.  It seems there are 16 clients did not send replay req to the MDS?   Hmm, did you manually abort the recovery? otherwise this should not happen.  Do you have lustre debug log for this node? &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:53:57 [  550.240713] Lustre: lquake-MDT000a: Client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) reconnecting, waiting for 31 clients in recovery for 0:45
2016-12-13 14:53:57 [  550.258291] Lustre: Skipped 46 previous similar messages
2016-12-13 14:53:59 [  552.241660] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883efd628850 x1553641484060400/t0(68719540833) o1000-&amp;gt;lquake-MDT000f-mdtlov_UUID@172.19.1.126@o2ib100:320/0 lens 344/0 e 0 to 0 dl 1481669700 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:54:00 [  552.274899] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 44 previous similar messages
2016-12-13 14:54:44 [  597.107557] Lustre: lquake-MDT000a: Recovery already passed deadline 0:01, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:45 [  598.108926] Lustre: lquake-MDT000a: Recovery already passed deadline 0:02, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:45 [  598.129849] Lustre: Skipped 2 previous similar messages
2016-12-13 14:54:46 [  599.211411] Lustre: lquake-MDT000a: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:46 [  599.232239] Lustre: Skipped 4 previous similar messages
2016-12-13 14:54:48 [  601.220136] Lustre: lquake-MDT000a: Recovery already passed deadline 0:05, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then update replay is aborted, which might cause filesystem inconsistency. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;  This might explain those damaged directory entries.  &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 14:56:13 [  685.746427] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f2c41d850 x1553641484047916/t0(68719540823) o1000-&amp;gt;lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:410/0 lens 344/0 e 17 to 0 dl 1481669790 ref 1 fl Complete:/4/ffffffff rc 0/-1
2016-12-13 14:56:13 [  685.793338] Lustre: lquake-MDT000a: Denying connection for new client f25a5bc2-93d4-c9d5-aae8-bd315848b7ea(at 192.168.128.24@o2ib18), waiting for 31 known clients (3 recovered, 12 in progress, and 16 evicted) to recover in 21188503:49
2016-12-13 14:56:13 [  685.818693] Lustre: Skipped 15 previous similar messages
2016-12-13 14:56:13 [  686.254147] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f333f0050 x1553641484047916/t0(68719540823) o1000-&amp;gt;lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:454/0 lens 344/0 e 0 to 0 dl 1481669834 ref 1 fl Complete:/6/ffffffff rc 0/-1
2016-12-13 14:56:13 [  686.284145] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) Skipped 971 previous similar messages
2016-12-13 14:56:22 [  695.215436] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f333f2050 x1553641484047916/t0(68719540823) o1000-&amp;gt;lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:455/0 lens 344/0 e 0 to 0 dl 1481669835 ref 1 fl Complete:/6/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Olaf: Could you please tell me more about how do you run the test and fail MDT? Thanks.&lt;/p&gt;</comment>
                            <comment id="177788" author="ofaaland" created="Wed, 14 Dec 2016 23:43:20 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
The mds nodes were all powered off while clients were connected and the file system was active.  This file system has 16 MDTs hosted on 16 MDS nodes, and 4 OSTs hosted on 4 OSS nodes.&lt;/p&gt;

&lt;p&gt;The first test, where the issue seems to have started, was to power off nodes jet&lt;span class=&quot;error&quot;&gt;&amp;#91;9-12&amp;#93;&lt;/span&gt; which host MDT000&lt;span class=&quot;error&quot;&gt;&amp;#91;8-b&amp;#93;&lt;/span&gt;.  They were then powered back on again, pools imported, and lustre mounted.  There were about 18 mdtest jobs running at the time, one pe clientr node.  Each was running within a separate directory that was striped across all 16 MDTs, with the default stripe count set to 16 so that subdirectories created by mdtest were also striped across all 16 MDTs.&lt;/p&gt;

&lt;p&gt;I did not issue the abort_recovery; Lustre itself did that, but I don&apos;t know what triggered it.  Sorry, I don&apos;t have the debug log.&lt;/p&gt;</comment>
                            <comment id="177799" author="ofaaland" created="Thu, 15 Dec 2016 00:56:29 +0000"  >&lt;p&gt;I notice that there are at least a few instances of this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Attaching console logs for all the server nodes, see console.since-dec13.tgz&lt;/p&gt;

&lt;p&gt;I accidentally included logs for jet&lt;span class=&quot;error&quot;&gt;&amp;#91;21,22&amp;#93;&lt;/span&gt; which are not part of the lustre file system.  Disregard those.&lt;/p&gt;</comment>
                            <comment id="177810" author="di.wang" created="Thu, 15 Dec 2016 02:49:26 +0000"  >&lt;p&gt;Olaf, Are there a lot of update_logs under  update_log_dir/  on MDT000a, could you please upload those update logs? if it is not too much. Thanks.  I suspect all of these recovery hassel is caused by this update log failure.&lt;/p&gt;</comment>
                            <comment id="177826" author="gerrit" created="Thu, 15 Dec 2016 06:14:42 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/24364&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/24364&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; llog: add some debug patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 91b2fd79275ef0c8b0cc6d72f66f7c53593bebcf&lt;/p&gt;</comment>
                            <comment id="177827" author="di.wang" created="Thu, 15 Dec 2016 06:15:59 +0000"  >&lt;p&gt;Olaf: I made a debug patch &lt;a href=&quot;https://review.whamcloud.com/24364&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/24364&lt;/a&gt;  based on &lt;a href=&quot;http://review.whamcloud.com/24008&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/24008&lt;/a&gt;, could you please try the test with this patch? thank you!&lt;/p&gt;</comment>
                            <comment id="177938" author="ofaaland" created="Thu, 15 Dec 2016 21:48:03 +0000"  >&lt;p&gt;Hi Di,&lt;br/&gt;
I&apos;m working on building updated lustre rpms for our testing, with these patches and those from other tickets as well.  It will probably take me until Monday to get the rpms built and get control of the cluster again.&lt;/p&gt;

&lt;p&gt;There are indeed logs of update logs on mdt000a.  Raw, 218GB.  Do you know the ID of the key log(s)?  Or any way I can select a useful subset to give you?&lt;/p&gt;</comment>
                            <comment id="177957" author="di.wang" created="Thu, 15 Dec 2016 22:29:38 +0000"  >&lt;p&gt;Unfortunately, the console message only tell me which record is wrong. Hmm, is there a lot of files. Maybe you can run llog_read for each of them, then looking for the message &quot;to next chunk&quot;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;llog_reader  xxxx/update_log_dir/XXXX | grep &quot;to next chunk&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Maybe you can even tweak llog_reader a bit to print the llog index to which corruption happened in index 80.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/utils/llog_reader.c b/lustre/utils/llog_reader.c
index f7dae4c..19805a2 100644
--- a/lustre/utils/llog_reader.c
+++ b/lustre/utils/llog_reader.c
@@ -293,8 +293,8 @@ int llog_pack_buffer(int fd, struct llog_log_hdr **llog,
                    cur_rec-&amp;gt;lrh_len &amp;gt; (*llog)-&amp;gt;llh_hdr.lrh_len) {
                        cur_rec-&amp;gt;lrh_len = (*llog)-&amp;gt;llh_hdr.lrh_len -
                                offset % (*llog)-&amp;gt;llh_hdr.lrh_len;
-                       printf(&quot;off %lu skip %u to next chunk.\n&quot;, offset,
-                              cur_rec-&amp;gt;lrh_len);
+                       printf(&quot;idx %u off %lu skip %u to next chunk.\n&quot;,
+                              cur_rec-&amp;gt;lrh_index, offset, cur_rec-&amp;gt;lrh_len);
                        i--;
                } else if (ext2_test_bit(idx, LLOG_HDR_BITMAP(*llog))) {
                        printf(&quot;rec #%d type=%x len=%u offset %lu\n&quot;, idx,
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="178423" author="ofaaland" created="Mon, 19 Dec 2016 19:25:55 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
Friday the new lustre build was installed in the image and the servers rebooted.  I&apos;ll bounce them again today to see if they recover normally.&lt;/p&gt;

&lt;p&gt;Disregard my earlier message about llogs, my mistake.  I&apos;ll get back to you about what I find in the llogs.&lt;/p&gt;
</comment>
                            <comment id="178456" author="ofaaland" created="Mon, 19 Dec 2016 21:42:09 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
Only one log contained an invalid length (0).  I&apos;ve attached it as 0x48000a04b-0x1-0x0.tgz&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Bit 77 of 3 not set
Bit 78 of 3 not set
rec #79 type=106a0000 len=3048 offset 500040
rec #0 off 503088 orig_len 0 skip 21200 to next chunk.
Bit 90 of 3 not set
Bit 91 of 3 not set
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="178477" author="di.wang" created="Tue, 20 Dec 2016 00:40:48 +0000"  >&lt;p&gt;Hi, Olaf&lt;/p&gt;

&lt;p&gt;Thanks for uploading the file, it looks like some update log is not being written, so it leaves a hole there&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;master transno = 60130033439 batchid = 55834779575 flags = 0 ops = 172 params = 84 rec_len 8520
Bit 78 of 4 not set offset 491520
offset 500040 index 79 type 106a0000
master transno = 60130041273 batchid = 55834779576 flags = 0 ops = 88 params = 21 rec_len 3048
rec #79 type=106a0000 len=3048 offset 500040, total 1
offset 503088 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 503088 skip 8520 to next chunk. test_bit yes      ----&amp;gt;&amp;gt;&amp;gt; skip 8520 bytes, then it is valid again.
offset 511608 index 81 type 106a0000
master transno = 60130043788 batchid = 55834779578 flags = 0 ops = 18 params = 3 rec_len 728
Bit 81 of 4 not set offset 511608
offset 512336 index 82 type 106a0000
master transno = 60130043789 batchid = 55834779579 flags = 0 ops = 18 params = 3 rec_len 728
Bit 82 of 4 not set offset 512336
offset 513064 index 83 type 106a0000
master transno = 60130043798 batchid = 55834779580 flags = 0 ops = 88 params = 21 rec_len 2968
Bit 83 of 4 not set offset 513064
offset 516032 index 84 type 106a0000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it looks like writing update record (rec_len = 8520), then the following write is not cancelled, so it cause a hole in the update log, which cause the issue.&lt;/p&gt;</comment>
                            <comment id="178482" author="ofaaland" created="Tue, 20 Dec 2016 02:13:31 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;I bounced servers or stopped and restarted lustre several times today, in varying combinations.  MDT000a still seems to take longer to get connected than the others; several times the recovery_status procfile showed MDTs were waiting for MDT000a.  However within a few minutes it seems to successfully complete recovery.&lt;/p&gt;

&lt;p&gt;I have the debug logs from one of these attempts, and can upload if it&apos;s helpful.  Let me know.&lt;/p&gt;</comment>
                            <comment id="178490" author="di.wang" created="Tue, 20 Dec 2016 03:23:26 +0000"  >&lt;p&gt;Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.&lt;/p&gt;</comment>
                            <comment id="178561" author="di.wang" created="Tue, 20 Dec 2016 18:39:12 +0000"  >&lt;p&gt;Olaf: I just add another fix to this ticket &lt;a href=&quot;https://review.whamcloud.com/24364&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/24364&lt;/a&gt; , hopefully this can resolve the update log corruption issue. Please try it. Thanks.&lt;/p&gt;</comment>
                            <comment id="178603" author="ofaaland" created="Wed, 21 Dec 2016 00:13:40 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;This FS (lquake) was last reformatted November 21.  I expected recovery to get stuck also, given the presence of the corrupted update log.  I don&apos;t know why it did not.   I&apos;ve attached the console log, see console.zinc11.2016-12-19&lt;/p&gt;</comment>
                            <comment id="178672" author="di.wang" created="Wed, 21 Dec 2016 16:58:15 +0000"  >&lt;p&gt;According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange,&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages
2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages
2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100)
2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages
2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force.
2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force.
2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages
2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.&lt;/p&gt;</comment>
                            <comment id="178719" author="di.wang" created="Wed, 21 Dec 2016 19:06:58 +0000"  >&lt;p&gt;Hi, Olaf&lt;/p&gt;

&lt;p&gt;So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do you still have the console log on MDT0006 for the last run?  I want to check if this corrupt log is being hit in last recovery. Thanks.&lt;/p&gt;</comment>
                            <comment id="180010" author="gerrit" created="Mon, 9 Jan 2017 05:44:21 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/24008/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/24008/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; llog: remove lgh_write_offset&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f&lt;/p&gt;</comment>
                            <comment id="181838" author="gerrit" created="Tue, 24 Jan 2017 05:20:01 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/24364/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/24364/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt; osp: add rpc generation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 0844905a308d614c86b56df70c8f03e5d59ee286&lt;/p&gt;</comment>
                            <comment id="182621" author="pjones" created="Mon, 30 Jan 2017 19:24:08 +0000"  >&lt;p&gt;As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                                                <inwardlinks description="is blocked by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="34139">LU-7675</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="33134">LU-7426</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="42278">LU-8916</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="41229">LU-8787</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="24592" name="0x48000a04b-0x1-0x0.tgz" size="108674" author="ofaaland" created="Mon, 19 Dec 2016 21:39:07 +0000"/>
                            <attachment id="24526" name="console.jet11.2016-12-13-14-47" size="14484" author="ofaaland" created="Wed, 14 Dec 2016 21:41:36 +0000"/>
                            <attachment id="23825" name="console.jet7.gz" size="1170564" author="ofaaland" created="Thu, 27 Oct 2016 18:46:20 +0000"/>
                            <attachment id="24537" name="console.since-dec13.tgz" size="1789593" author="ofaaland" created="Thu, 15 Dec 2016 00:56:56 +0000"/>
                            <attachment id="24611" name="console.zinc11.2016-12-19" size="173124" author="ofaaland" created="Wed, 21 Dec 2016 00:14:09 +0000"/>
                            <attachment id="24210" name="console_logs.nov28.tgz" size="18762" author="ofaaland" created="Mon, 28 Nov 2016 23:59:06 +0000"/>
                            <attachment id="23959" name="dk.jet1.1478223101.gz" size="610781" author="ofaaland" created="Fri, 4 Nov 2016 01:35:41 +0000"/>
                            <attachment id="23998" name="dk.jet1.1478565846.gz" size="697424" author="ofaaland" created="Tue, 8 Nov 2016 01:12:36 +0000"/>
                            <attachment id="23826" name="dk.recovery_stuck.jet7.1477593159.gz" size="54655" author="ofaaland" created="Thu, 27 Oct 2016 18:46:20 +0000"/>
                            <attachment id="23827" name="dk.recovery_stuck.jet7.1477593344.gz" size="6906" author="ofaaland" created="Thu, 27 Oct 2016 18:46:20 +0000"/>
                            <attachment id="24214" name="dk.zinc1.1480375634.gz" size="12707700" author="ofaaland" created="Tue, 29 Nov 2016 00:20:43 +0000"/>
                            <attachment id="24212" name="dk.zinc13.1480375634.gz" size="13964012" author="ofaaland" created="Tue, 29 Nov 2016 00:06:17 +0000"/>
                            <attachment id="24211" name="dk.zinc7.1480375634.gz" size="14177191" author="ofaaland" created="Tue, 29 Nov 2016 00:06:17 +0000"/>
                            <attachment id="24066" name="logs.2016-11-14.tgz" size="12828415" author="ofaaland" created="Mon, 14 Nov 2016 18:25:14 +0000"/>
                            <attachment id="24213" name="lsh-mdt000c-1b70.nov28.tgz" size="7139954" author="ofaaland" created="Tue, 29 Nov 2016 00:06:17 +0000"/>
                            <attachment id="23828" name="lustre.log.gz" size="4539425" author="ofaaland" created="Thu, 27 Oct 2016 18:53:43 +0000"/>
                            <attachment id="24022" name="mdt09.0x240019a58_0x6_0x0.tgz" size="12821282" author="dinatale2" created="Tue, 8 Nov 2016 22:01:14 +0000"/>
                            <attachment id="24001" name="mdt0b.0x240019a58_0x6_0x0.tgz" size="12821248" author="ofaaland" created="Tue, 8 Nov 2016 01:46:49 +0000"/>
                            <attachment id="24209" name="target_to_node_map.nov28.txt" size="327" author="ofaaland" created="Mon, 28 Nov 2016 23:59:06 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyt8f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>