<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:22:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9068] Hardware problem resulting in bad blocks</title>
                <link>https://jira.whamcloud.com/browse/LU-9068</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We encountered a hardware problem on the MDT storage device (DDN 7700) that resulted in bad blocks. The file system continued to operate but yesterday went read-only when it stumbled over a bad sector.&lt;/p&gt;

&lt;p&gt;We ran fsck against the file system with the most current e2fsprogs which repaired the file system but dumped 90 objects/files into lost+found. All but 2 belonged to one user. But one of the files/objects belongs to root and has a low inode number #5749 that appears to be a data file.&lt;/p&gt;

&lt;p&gt;We are very concerned that this particular file may be lustre relevant and would like your guidance on what we should do. (Obviously we are able to mount the file system ldiskfs.)&lt;/p&gt;</description>
                <environment></environment>
        <key id="43447">LU-9068</key>
            <summary>Hardware problem resulting in bad blocks</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="jamervi">Joe Mervini</reporter>
                        <labels>
                    </labels>
                <created>Tue, 31 Jan 2017 16:53:02 +0000</created>
                <updated>Tue, 31 Jan 2017 23:37:30 +0000</updated>
                            <resolved>Tue, 31 Jan 2017 23:37:30 +0000</resolved>
                                    <version>Lustre 2.5.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="182783" author="pjones" created="Tue, 31 Jan 2017 16:55:57 +0000"  >&lt;p&gt;Joe&lt;/p&gt;

&lt;p&gt;I created this ticket on your behalf. Please confirm whether you are able to post updates ok&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="182784" author="jamervi" created="Tue, 31 Jan 2017 17:06:55 +0000"  >&lt;p&gt;Thanks Peter. Yes - It&apos;s coming up. The site is really sluggish here at Sandia. Don&apos;t know if it&apos;s just us...&lt;/p&gt;</comment>
                            <comment id="182805" author="jamervi" created="Tue, 31 Jan 2017 18:31:49 +0000"  >&lt;p&gt;We examined the file with debugfs and these are the stats (I&apos;m attaching the full output.):&lt;/p&gt;

&lt;p&gt;debugfs 1.42.13.wc5 (15-Apr-2016)&lt;br/&gt;
/dev/mapper/360001ff0a096f0000000000d88530000: catastrophic mode - not reading inode or group bitmaps&lt;br/&gt;
Inode: 5749   Type: regular    Mode:  0644   Flags: 0x0&lt;br/&gt;
Generation: 3182218171    Version: 0x00000000:00000000&lt;br/&gt;
User:     0   Group:     0   Size: 4153344&lt;br/&gt;
File ACL: 0    Directory ACL: 0&lt;br/&gt;
Links: 1   Blockcount: 8120&lt;br/&gt;
Fragment:  Address: 0    Number: 0    Size: 0&lt;br/&gt;
 ctime: 0x588e2a64:6eccc4e4 &amp;#8211; Sun Jan 29 10:46:12 2017&lt;br/&gt;
 atime: 0x589093a1:254e4128 &amp;#8211; Tue Jan 31 06:39:45 2017&lt;br/&gt;
 mtime: 0x588e2a64:6eccc4e4 &amp;#8211; Sun Jan 29 10:46:12 2017&lt;br/&gt;
crtime: 0x588e2a64:6eccc4e4 &amp;#8211; Sun Jan 29 10:46:12 2017&lt;br/&gt;
Size of extra inode fields: 28&lt;br/&gt;
Extended attributes stored in inode body:&lt;br/&gt;
  lma = &quot;00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 6b 9c 01 00 00 00 00 00 &quot; (24)&lt;br/&gt;
  lma: fid=&lt;span class=&quot;error&quot;&gt;&amp;#91;0x1:0x19c6b:0x0&amp;#93;&lt;/span&gt; compat=0 incompat=0&lt;/p&gt;

&lt;p&gt;Since the create time occurred after we encountered a hardware hick-up we believe that it might be benign but would like confirmation.&lt;/p&gt;</comment>
                            <comment id="182813" author="adilger" created="Tue, 31 Jan 2017 18:57:10 +0000"  >&lt;p&gt;It looks like this is indeed a Lustre internal file, based on the FID reported in the stat output:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lma: fid=[0x1:0x19c6b:0x0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This object is an llog file, and can be renamed to &lt;tt&gt;O/1/d11/105579&lt;/tt&gt; (the FID number in decimal).  However, it may be that the MDS has moved on once it detected this log file was missing (there may be some error messages on the MDS console to that effect). At worst this may mean a few OST objects are orphaned, but they will be cleaned up when LFSCK is run with Lustre 2.7 or later (it isn&apos;t mentioned in this ticket what version you are running).&lt;/p&gt;

&lt;p&gt;So, in summary, it is possible to repair this file, which may be helpful, but isn&apos;t critical if this would take the system offline. &lt;/p&gt;</comment>
                            <comment id="182819" author="jamervi" created="Tue, 31 Jan 2017 19:16:05 +0000"  >&lt;p&gt;We rebooted the node and started lustre. The file system mounted and was in the process of connecting clients when it panicked with an LBUG. (This is before trying to rename the file.)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 31 11:51:57 gmds1 kernel: [944595.780330] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880bd2005800 x1557048164247136/t0(177528794042) o35-&amp;gt;0e26
9f32-c7c8-9037-e29a-c1f762282b43@10.1.9.69@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888778 ref 2 fl Interpret:/6/ffffffff rc 0/-1
Jan 31 11:51:57 gmds1 kernel: [944596.013130] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 202 previous similar messages
Jan 31 11:52:12 gmds1 kernel: [944610.729259] Lustre: gscratch-MDT0000: Client 5a2a05db-5f33-aecf-9cce-44f4bfe225ab (at 10.1.6.230@o2ib) reconnecting, waiting for 7326 clients in recovery for 3:28
Jan 31 11:52:12 gmds1 kernel: [944610.887017] Lustre: Skipped 431 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.182223] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880c1c85c850 x1556994744167148/t0(177528744813) o35-&amp;gt;ea44
7cc5-9dd6-7b57-077b-df9f767a42e4@10.1.1.156@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888795 ref 2 fl Interpret:/6/ffffffff rc 0/-1
Jan 31 11:52:14 gmds1 kernel: [944612.415138] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 127 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.589958] Lustre: gscratch-MDT0000: Denying connection for new client 4bb27966-87a7-7a95-6e3e-1007a28f70bd (at 10.1.7.205@o2ib), waiting for all 7326 known clients (2152 recovere
d, 1857 in progress, and 0 evicted) to recover in 3:26
Jan 31 11:52:14 gmds1 kernel: [944612.756886] Lustre: Skipped 12 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.896489] LustreError: 86016:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST004a-osc-MDT0000: error opening log id 0x19cee:1:0: rc = -2
Jan 31 11:52:15 gmds1 kernel: [944613.046320] LustreError: 86016:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST004a-osc-MDT0000: cannot find handle for llog 0x19cee:1: -2
Jan 31 11:52:24 gmds1 kernel: [944622.532283] LustreError: 86182:0:(llog_osd.c:757:llog_osd_next_block()) gscratch-MDT0000-osd: missed desired record? 45953 &amp;gt; 39475
Jan 31 11:52:24 gmds1 kernel: [944622.681106] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 6 in progress, 0 in flight: -2
Jan 31 11:52:24 gmds1 kernel: [944622.833222] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) LBUG
Jan 31 11:52:24 gmds1 kernel: [944622.908388] Pid: 86182, comm: osp-syn-89-0
Jan 31 11:52:24 gmds1 kernel: [944622.979766]
Jan 31 11:52:24 gmds1 kernel: [944622.979768] Call Trace:
Jan 31 11:52:25 gmds1 kernel: [944623.113187]  [&amp;lt;ffffffffa05ed8f5&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jan 31 11:52:25 gmds1 kernel: [944623.185016]  [&amp;lt;ffffffffa05edef7&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
Jan 31 11:52:25 gmds1 kernel: [944623.255067]  [&amp;lt;ffffffffa15e6e03&amp;gt;] osp_sync_thread+0x753/0x7d0 [osp]
Jan 31 11:52:25 gmds1 kernel: [944623.325388]  [&amp;lt;ffffffff8154b79e&amp;gt;] ? schedule+0x3ee/0xb70
Jan 31 11:52:25 gmds1 kernel: [944623.395727]  [&amp;lt;ffffffffa15e66b0&amp;gt;] ? osp_sync_thread+0x0/0x7d0 [osp]
Jan 31 11:52:25 gmds1 kernel: [944623.466450]  [&amp;lt;ffffffff810a6d1e&amp;gt;] kthread+0x9e/0xc0
Jan 31 11:52:25 gmds1 kernel: [944623.534286]  [&amp;lt;ffffffff8100c2ca&amp;gt;] child_rip+0xa/0x20
Jan 31 11:52:25 gmds1 kernel: [944623.601056]  [&amp;lt;ffffffff810a6c80&amp;gt;] ? kthread+0x0/0xc0
Jan 31 11:52:25 gmds1 kernel: [944623.666502]  [&amp;lt;ffffffff8100c2c0&amp;gt;] ? child_rip+0x0/0x20
Jan 31 11:52:25 gmds1 kernel: [944623.731052]
Jan 31 11:52:26 gmds1 kernel: [944623.790477] Kernel panic - not syncing: LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="182820" author="jamervi" created="Tue, 31 Jan 2017 19:19:22 +0000"  >&lt;p&gt;Note: All the OSS have remained up with all OSTs mounted.&lt;/p&gt;</comment>
                            <comment id="182833" author="jamervi" created="Tue, 31 Jan 2017 19:53:46 +0000"  >&lt;p&gt;We brought down all the OSSs and OSTs and rebooted them. Brought up the MDS and MDT which looked happy. When I started bring the OSTs back online it LBUGed against another OST. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1549.368141] Lustre: gscratch-MDT0000: trigger OI scrub by RPC for [0x1:0x19d01:0x0], rc = 0 [1]
[ 1549.463828] LustreError: 5476:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST0036-osc-MDT0000: error opening log id 0x19d01:1:0: rc = -115
[ 1549.614586] LustreError: 5476:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST0036-osc-MDT0000: cannot find handle for llog 0x19d01:1: -115
[ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115
Jan 31 12:45:50 [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG
gmds1 kernel: [ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115
Jan 31 12:45:50 gmds1 kernel: [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG
[ 1549.996451] Pid: 5476, comm: osp-syn-54-0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;By the way - this is lustre 2.5.5 not 2.7. &lt;/p&gt;</comment>
                            <comment id="182834" author="adilger" created="Tue, 31 Jan 2017 20:05:37 +0000"  >&lt;p&gt;The error being reported on the MDS is for a different llog file &lt;tt&gt;0x19cee&lt;/tt&gt; than the one in &lt;tt&gt;lost+found&lt;/tt&gt;, which is &lt;tt&gt;0x19c6b&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;What version of Lustre are you running?  It looks like this LBUG (ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK )) is a duplicate with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6696&quot; title=&quot;ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6696&quot;&gt;&lt;del&gt;LU-6696&lt;/del&gt;&lt;/a&gt;, which was fixed in Lustre 2.9.0 (patch &lt;a href=&quot;http://review.whamcloud.com/19856&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19856&lt;/a&gt;).&lt;/p&gt;</comment>
                            <comment id="182835" author="jamervi" created="Tue, 31 Jan 2017 20:10:31 +0000"  >&lt;p&gt;We&apos;re running the toss version of 2.5.5.&lt;/p&gt;</comment>
                            <comment id="182836" author="jamervi" created="Tue, 31 Jan 2017 20:14:43 +0000"  >&lt;p&gt;We&apos;re looking for a work around to get the file system back up again.&lt;/p&gt;</comment>
                            <comment id="182852" author="adilger" created="Tue, 31 Jan 2017 22:15:19 +0000"  >&lt;p&gt;Sorry, I didn&apos;t see your reply until now.  Applying the patch to return the error from &lt;tt&gt;osp_sync_thread()&lt;/tt&gt; is the proper fix.  You &lt;em&gt;may&lt;/em&gt; be able to work around this by creating an empty &lt;tt&gt;O/1/105729&lt;/tt&gt; file on the MDT (using decimal object ID based on error messages), but it may be that this will also return an error message if the content is bad, instead of just a missing file.&lt;/p&gt;</comment>
                            <comment id="182857" author="ruth.klundt@gmail.com" created="Tue, 31 Jan 2017 23:32:08 +0000"  >&lt;p&gt;As Joe mentioned, we had already replaced the file and moved on to errors on some other log files. &lt;/p&gt;

&lt;p&gt;We finally removed the CATALOGS file and rebooted. This got us past the LBUG and the file system is back, minus a few items lost due to hardware.&lt;/p&gt;

&lt;p&gt;Ok to close.&lt;/p&gt;</comment>
                            <comment id="182858" author="pjones" created="Tue, 31 Jan 2017 23:37:30 +0000"  >&lt;p&gt;ok - thanks Ruth&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="30548">LU-6696</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="25131" name="stat.5749.txt" size="16978" author="jamervi" created="Tue, 31 Jan 2017 17:51:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz25b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10020"><![CDATA[1]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>