<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:25:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9410] on-disk bitmap corrupted</title>
                <link>https://jira.whamcloud.com/browse/LU-9410</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We had 2 OSS and 3 different OST crash with bitmap corrupted messages.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd
Apr  3 18:38:16 nbp1-oss6 kernel: 
Apr  3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3.
Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only
Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;These errors were on 2 different backend RAID devices. Note worthy&#160; items: &lt;br/&gt;
 1 .The filesystem was +90% full and 1/2 of the data was deleted.&lt;br/&gt;
 2. OSTs are formatted with &quot; -E packed_meta_blocks=1 &quot;&lt;/p&gt;</description>
                <environment></environment>
        <key id="45747">LU-9410</key>
            <summary>on-disk bitmap corrupted</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Thu, 27 Apr 2017 01:40:52 +0000</created>
                <updated>Thu, 22 Mar 2018 17:22:35 +0000</updated>
                            <resolved>Mon, 28 Aug 2017 07:05:03 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                    <fixVersion>Lustre 2.10.1</fixVersion>
                    <fixVersion>Lustre 2.11.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="193803" author="adilger" created="Thu, 27 Apr 2017 17:38:54 +0000"  >&lt;p&gt;There is a patch &lt;a href=&quot;https://review.whamcloud.com/16679&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/16679&lt;/a&gt; &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7114&quot; title=&quot;ldiskfs: corrupted bitmaps handling patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7114&quot;&gt;&lt;del&gt;LU-7114&lt;/del&gt;&lt;/a&gt; ldiskfs: corrupted bitmaps handling patches&quot; that allows ldiskfs to handle this error more gracefully. &lt;/p&gt;</comment>
                            <comment id="196071" author="mhanafi" created="Tue, 16 May 2017 19:02:12 +0000"  >&lt;p&gt;We hit this this issue again. We are trying to determine root cause and eliminate Intel CAS as a possible source. Is FSCK expected to detected and fix these types of errors?&lt;/p&gt;</comment>
                            <comment id="196254" author="mhanafi" created="Wed, 17 May 2017 21:42:24 +0000"  >&lt;p&gt;Some background. We been running with 2.7 on all our OSS for sometime and haven&apos;t see this error. A few months ago we expanded 3 OSSes with additional 12 OSTs bring the total to 24 osts per OSS. These are the OSS that have hit this issue. This has occurred on different back-end raids and no errors are logged on the raids. Typically the errors is seen during high load on the OSS.&#160;&lt;/p&gt;

&lt;p&gt;We need a way to debug and find the root cause of the issue. We are open to installing a debug patch. After the last crash the OST was not scanned with fsck. If these errors are real corruption on the disk would an fsck find them?&lt;/p&gt;

&lt;p&gt;Please advise.&#160;&lt;/p&gt;</comment>
                            <comment id="196375" author="pjones" created="Thu, 18 May 2017 17:52:53 +0000"  >&lt;p&gt;Fan Yong&lt;/p&gt;

&lt;p&gt;Could you please advise here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="196399" author="mhanafi" created="Thu, 18 May 2017 21:52:58 +0000"  >&lt;p&gt;in ldiskfs_mb_init_cache between the time the bitmap is read and then checked in ldiskfs_mb_generate_from_pa isn&apos;t possible for the bitmaps to changed.&lt;/p&gt;</comment>
                            <comment id="196817" author="yong.fan" created="Wed, 24 May 2017 01:20:44 +0000"  >&lt;p&gt;We have ever hit similar troubles with similar message as following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group xxx corrupted: mmmm blocks free in bitmap, nnnn - in gd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For almost all of these cases, &apos;mmmm&apos; &amp;gt; &apos;nnnn&apos;. That means the bitmap contains more available blocks than the statistic in the group desc. One suspected point is related with ldiskfs_ext_walk_space. Would you please to try the patch &lt;a href=&quot;https://review.whamcloud.com/#/c/21603/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/21603/&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8410&quot; title=&quot;fiemap vs walk race&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8410&quot;&gt;&lt;del&gt;LU-8410&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="203392" author="mhanafi" created="Mon, 24 Jul 2017 18:11:33 +0000"  >&lt;p&gt;We hit this bug with the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8410&quot; title=&quot;fiemap vs walk race&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8410&quot;&gt;&lt;del&gt;LU-8410&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="203446" author="yong.fan" created="Tue, 25 Jul 2017 03:17:07 +0000"  >&lt;p&gt;Have you ever run &lt;tt&gt;e2fsck&lt;/tt&gt; after the bitmap corruption? What does the &lt;tt&gt;e2fsck&lt;/tt&gt; report?&lt;/p&gt;</comment>
                            <comment id="203450" author="mhanafi" created="Tue, 25 Jul 2017 03:36:50 +0000"  >&lt;p&gt;We had 2 crashes and we ran e2fsck both time. It didn&apos;t find any thing.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="203451" author="yong.fan" created="Tue, 25 Jul 2017 03:42:01 +0000"  >&lt;p&gt;Then what is your kernel version?&lt;/p&gt;</comment>
                            <comment id="203453" author="mhanafi" created="Tue, 25 Jul 2017 04:21:16 +0000"  >&lt;p&gt;we have hit it with&lt;/p&gt;

&lt;p&gt;2.6.32-642.15.1.el6&lt;/p&gt;

&lt;p&gt;and 2.6.32-642.13.1.el6&lt;/p&gt;

&lt;p&gt;We just recovered from a crash and fsck show nothing.&lt;/p&gt;

&lt;p&gt;(e2fsck -fp)&lt;/p&gt;</comment>
                            <comment id="203456" author="yong.fan" created="Tue, 25 Jul 2017 05:13:13 +0000"  >&lt;p&gt;It seems some in-RAM data corruption. We have hit similar trouble at other customer sites. Currently, we suspect that it is some kernel issue and may be fixed via the kernel patch &quot;Addresses-Google-Bug: 2828254&quot;. One of our partners is verifying that whether the trouble can be fixed via such patch.&lt;/p&gt;</comment>
                            <comment id="203458" author="mhanafi" created="Tue, 25 Jul 2017 05:48:36 +0000"  >&lt;p&gt;Do we have the patch ported to 2.7?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="203459" author="yong.fan" created="Tue, 25 Jul 2017 05:51:33 +0000"  >&lt;p&gt;It is EXT4 itself patch, we are still verifying it.&lt;/p&gt;</comment>
                            <comment id="203483" author="mhanafi" created="Tue, 25 Jul 2017 12:42:01 +0000"  >&lt;p&gt;We had a crash were running fsck did find lots of wrong free block counts.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Free blocks count wrong &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group #326001 (0, counted=32768).
Fix&amp;lt;y&amp;gt;? yes
Free blocks count wrong &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group #326002 (0, counted=32768).
Fix&amp;lt;y&amp;gt;? yes
Free blocks count wrong &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group #326003 (0, counted=32768).
Fix&amp;lt;y&amp;gt;? yes
Free blocks count wrong &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group #326004 (0, counted=32768).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="203500" author="yong.fan" created="Tue, 25 Jul 2017 16:05:53 +0000"  >&lt;p&gt;Then the EXT4 patch may not cover your case, have to fix it via &lt;tt&gt;e2fsck&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="203532" author="mhanafi" created="Tue, 25 Jul 2017 21:15:12 +0000"  >&lt;p&gt;have you ported the patch to centos6.x yet. We are going to try to port the patch over to centos6 but if you have done so it would save us the work.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="203557" author="yong.fan" created="Wed, 26 Jul 2017 01:25:28 +0000"  >&lt;p&gt;You mean the EXT4 patch &quot;Addresses-Google-Bug: 2828254&quot;, right? We have not ported such patch yet, we are not 100% sure that  it is really useful for fixing our current in-RAM data corruption, still in verifying. But as you described in the comment &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=203483&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-203483&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=203483&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-203483&lt;/a&gt;, means there is real on-disk data corruption. I am afraid that only porting such patch may be not enough for your case.&lt;/p&gt;</comment>
                            <comment id="203558" author="mhanafi" created="Wed, 26 Jul 2017 01:32:56 +0000"  >&lt;p&gt;What about a debug patch? We are hitting this every few hours on our production file system.&lt;/p&gt;</comment>
                            <comment id="203560" author="yong.fan" created="Wed, 26 Jul 2017 01:35:48 +0000"  >&lt;p&gt;I just got the latest feedback from our partner minutes ago, they still hit the in-RAM bitmap corruption after applying such EXT4 patch. So we have to make new investigation. Sorry for that.&lt;/p&gt;</comment>
                            <comment id="203561" author="yong.fan" created="Wed, 26 Jul 2017 01:40:35 +0000"  >&lt;p&gt;Would you please to describe what operations you did that may trigger your trouble few hour ago? Please upload the latest logs for the corruption. My understand is that the corruption still occurred on the new expended OSTs, right?&lt;/p&gt;</comment>
                            <comment id="203566" author="mhanafi" created="Wed, 26 Jul 2017 02:04:13 +0000"  >&lt;p&gt;The filesystem has 20 OSS and 360 OSTs. We have been seen bitmap corruption and remount to ro every few hours on different osts and OSSs. 9 of 10 time fsck doesn&apos;t report any errors.&lt;/p&gt;

&lt;p&gt;What type of logs would be helpfull.&lt;/p&gt;

&lt;p&gt;We set are mountoption to error=panic so we can get a crash dump right away.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="203567" author="yong.fan" created="Wed, 26 Jul 2017 02:45:32 +0000"  >&lt;p&gt;Both the crash dump and /var/log/messages may be helpful. You have mentioned that you did not hit the trouble before you expend your OSS, my understand is that the data corruption only happened on the new OSTs, not on the old ones, is it true?&lt;/p&gt;</comment>
                            <comment id="203568" author="mhanafi" created="Wed, 26 Jul 2017 03:47:42 +0000"  >&lt;p&gt;We have 2 filesystem that have been expanded and we are only seeing this on them. I need to double check to make sure we never had the corruption on the old osts. Since the new osts are empty they have higher utilization.&#160;&lt;/p&gt;

&lt;p&gt;I will upload crash dump.&lt;/p&gt;</comment>
                            <comment id="203569" author="mhanafi" created="Wed, 26 Jul 2017 04:00:36 +0000"  >&lt;p&gt;It will take sometime to upload the vmcore. But i have attached bt and dmesg.&lt;/p&gt;</comment>
                            <comment id="203642" author="mhanafi" created="Wed, 26 Jul 2017 20:26:31 +0000"  >&lt;p&gt;crash dumps uploaded. I have attached 2 more backtrace from 2 more crash dumps. We have had the same ost crash for the past 3 times.&lt;/p&gt;</comment>
                            <comment id="203677" author="gerrit" created="Thu, 27 Jul 2017 09:11:58 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28249&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28249&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: read bitmap with lock&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: acbca29e7b28dcad35acdc4d59e3fafc833a572c&lt;/p&gt;</comment>
                            <comment id="203678" author="yong.fan" created="Thu, 27 Jul 2017 09:14:46 +0000"  >&lt;p&gt;Mahmoud,&lt;/p&gt;

&lt;p&gt;I made a kernel patch (&lt;a href=&quot;https://review.whamcloud.com/28249&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28249&lt;/a&gt;) based on el6.8 (2.6.32-642.15.1.el6) to resolve some in-RAM data corruption. Would you please to try such patch?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="203722" author="mhanafi" created="Thu, 27 Jul 2017 18:13:40 +0000"  >&lt;p&gt;We deactivate one of the OST that kept crashing and we have been stable for 12hours. We will give the patch a try.&lt;/p&gt;</comment>
                            <comment id="203729" author="jaylan" created="Thu, 27 Jul 2017 22:00:36 +0000"  >&lt;p&gt;I thought nasf wrote:&lt;br/&gt;
&quot;I just got the latest feedback from our partner minutes ago, they still hit the in-RAM bitmap corruption after applying such EXT4 patch. So we have to make new investigation. Sorry for that.&quot;&lt;/p&gt;</comment>
                            <comment id="203742" author="yong.fan" created="Fri, 28 Jul 2017 00:49:48 +0000"  >&lt;p&gt;One of our partners has ported the EXT4 patch by themselves, they finally told me that it does not resolve their issue but without showing me their patch. My patch 28249 is NOT the EXT4 patch to be ported, it prevents the bitmap readers accessing the bitmap if without lock the buffer head. It avoids complex logic of the EXT4 patch and may cover more corner cases. So I hope NASA site can try the patch. The side-effect of the patch 28249 is that it may affect the performance a bit, but very very little.&lt;/p&gt;</comment>
                            <comment id="203833" author="jaylan" created="Fri, 28 Jul 2017 21:28:18 +0000"  >&lt;p&gt;Thanks, nasf! We will give it a try.&lt;/p&gt;</comment>
                            <comment id="204974" author="mhanafi" created="Wed, 9 Aug 2017 23:32:34 +0000"  >&lt;p&gt;The patch provided did not help with the bitmap errors! Did the crash dump provide any helpful info?&lt;/p&gt;</comment>
                            <comment id="204979" author="yong.fan" created="Thu, 10 Aug 2017 00:20:44 +0000"  >&lt;p&gt;What is the storage (hardware vendor) you are using?&lt;/p&gt;</comment>
                            <comment id="204981" author="mhanafi" created="Thu, 10 Aug 2017 00:31:31 +0000"  >&lt;p&gt;netapp&#160;E5500&lt;/p&gt;

&lt;p&gt;When we run fsck it sometimes will fix quota stuff but no bitmap.&lt;/p&gt;</comment>
                            <comment id="204986" author="mhanafi" created="Thu, 10 Aug 2017 02:39:41 +0000"  >&lt;p&gt;One thing I noticed during recover on a client that was active on the ost got this error&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[190767.968385] LustreError: 3655:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:1261:ptlrpc_connect_interpret()) nbp2-OST0157_UUID went back in time (transno 17179894486 was previously committed, server now claims 12899642596)! See https:&lt;span class=&quot;code-comment&quot;&gt;//bugzilla.lustre.org/show_bug.cgi?id=9646&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="204996" author="yong.fan" created="Thu, 10 Aug 2017 11:20:13 +0000"  >&lt;p&gt;Mahmoud,&lt;/p&gt;

&lt;p&gt;I was told that your system may crash every few hours. I assume that they are similar bitmap corruption as the you described in the ticket summary, right?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245659corrupted: 32768 blocks free in bitmap, 0 - in gd&lt;br/&gt;
Apr  3 18:38:16 nbp1-oss6 kernel: &lt;br/&gt;
Apr  3 18:38:16 nbp1-oss6 kernel: Aborting journal on device dm-3.&lt;br/&gt;
Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs (dm-42): Remounting filesystem read-only&lt;br/&gt;
Apr  3 18:38:16 nbp1-oss6 kernel: LDISKFS-fs error (device dm-42): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 245660corrupted: 32768 blocks free in bitmap, 0 - in gd&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;And, when you run &lt;tt&gt;e2fsck&lt;/tt&gt; after the corruption, the &lt;tt&gt;e2fsck&lt;/tt&gt; will NOT report bitmap issues, instead, sometimes, nothing inconsistency found, sometimes, quota things repaired, right? Have I missed anything else for your current trouble?&lt;/p&gt;</comment>
                            <comment id="204998" author="mhanafi" created="Thu, 10 Aug 2017 11:29:16 +0000"  >&lt;p&gt;That is correct. Most of the time an fsck will report something like this&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;nbp2-OST0118: nbp2-OST0118 contains a file system with errors, check forced.
nbp2-OST0118: nbp2-OST0118: 776296/74698752 files (20.6% non-contiguous), 2193846359/19122880512 blocks
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205000" author="mhanafi" created="Thu, 10 Aug 2017 11:38:08 +0000"  >&lt;p&gt;uploading lustre-log.1502322177.17192.txt.gz to ftp site /uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was taken at the time of OST0157 hitting bitmap error at 16:37:49 PDT&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205019" author="ihara" created="Thu, 10 Aug 2017 15:16:36 +0000"  >&lt;p&gt;please make sure ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.8.series really contains rhel6.6/ext4-corrupted-inode-block-bitmaps-handling-patches.patch. I think you might be missing that patch for RHEL6.8 kernel.&lt;/p&gt;</comment>
                            <comment id="205024" author="mhanafi" created="Thu, 10 Aug 2017 15:35:58 +0000"  >&lt;p&gt;@shuichi Ihara,&lt;/p&gt;

&lt;p&gt;We looked at that patch and decided not to apply it because these don&apos;t appear to be real on disk corruptions.&lt;/p&gt;</comment>
                            <comment id="205046" author="ihara" created="Thu, 10 Aug 2017 16:33:38 +0000"  >&lt;p&gt;Yes, that&apos;s true. but, still detect and print corruption and you can fsck at maintance window instead of ReadOnly immediately. &lt;/p&gt;</comment>
                            <comment id="205096" author="mhanafi" created="Thu, 10 Aug 2017 23:07:36 +0000"  >&lt;p&gt;1. Most of the OSTs that hit this bug have Flex block group size=64 vs others set to Flex block group size=256. The back end raid is set for 1MB stripe size. (8 data disk with 128MB per disk stripe). And we pack all metadata blocks in the front of the LUN. Could this be a factor?&lt;/p&gt;

&lt;p&gt;2. Does fsck in fact check for bitmap corruption on disk? if we don&apos;t see it fixing anything does that confirm that these are in memory corruptions?&lt;/p&gt;

&lt;p&gt;3. If these are in memory corruption can we get a debug patch that will re-read from disk before marking the bitmap as bad.&lt;/p&gt;

&lt;p&gt;4. Can you provide any other debug patch to help narrow the root cause?&lt;/p&gt;</comment>
                            <comment id="205143" author="yong.fan" created="Fri, 11 Aug 2017 13:12:44 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhanafi&quot; class=&quot;user-hover&quot; rel=&quot;mhanafi&quot;&gt;mhanafi&lt;/a&gt;, would you please to show me the output:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;dumpe2fs -f $OST_device
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="205166" author="mhanafi" created="Fri, 11 Aug 2017 17:25:31 +0000"  >&lt;p&gt;We had a ost go read only today and I was able to gather some very useful info.&lt;/p&gt;

&lt;p&gt;I dump /proc/fs/ldiskfs/dm-13/mb_groups It had I/O error for block groups that the ost was complaining about.&lt;/p&gt;


&lt;p&gt;I dump (dumpe2fs) the block group info for the device and it did in fact show blockgroups with 0 free blocks but having free blocks. like this one that triggered the OST to readonly.&lt;/p&gt;

&lt;p&gt; Group 314421: (Blocks 10302947328-10302980095) &lt;span class=&quot;error&quot;&gt;&amp;#91;INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &#160; Checksum 0x8d80, unused inodes 128&lt;br/&gt;
 &#160; Block bitmap at 369141 (bg #11 + 8693), Inode bitmap at 970965 (bg #29 + 20693)&lt;br/&gt;
 &#160; Inode table at 3773736-3773743 (bg #115 + 5416)&lt;br/&gt;
 &#160; 0 free blocks, 128 free inodes, 0 directories, 128 unused inodes&lt;br/&gt;
 &#160; Free blocks: 10302947328-10302980095&lt;/p&gt;

&lt;p&gt; I recheck the blockgroups, using dumpe2fs after the fsck and it had fixed those blocks.&lt;br/&gt;
 Group 314421: (Blocks 10302947328-10302980095) &lt;span class=&quot;error&quot;&gt;&amp;#91;INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &#160; Checksum 0x5b37, unused inodes 128&lt;br/&gt;
 &#160; Block bitmap at 369141 (bg #11 + 8693), Inode bitmap at 970965 (bg #29 + 20693)&lt;br/&gt;
 &#160; Inode table at 3773736-3773743 (bg #115 + 5416)&lt;br/&gt;
 &#160; 32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes&lt;br/&gt;
 &#160; Free blocks: 10302947328-10302980095&lt;/p&gt;

&lt;p&gt;I will attach the full dumpe2fs output to the case&lt;/p&gt;</comment>
                            <comment id="205173" author="gerrit" created="Fri, 11 Aug 2017 18:13:23 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28489&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28489&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: enable mb debug&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d4e2f024d91f375085c24906313b7bc522464c20&lt;/p&gt;</comment>
                            <comment id="205174" author="yong.fan" created="Fri, 11 Aug 2017 18:21:46 +0000"  >&lt;blockquote&gt;
&lt;p&gt;1. Most of the OSTs that hit this bug have Flex block group size=64 vs others set to Flex block group size=256. The back end raid is set for 1MB stripe size. (8 data disk with 128MB per disk stripe). And we pack all metadata blocks in the front of the LUN. Could this be a factor?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I am not sure for this.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;2. Does fsck in fact check for bitmap corruption on disk? if we don&apos;t see it fixing anything does that confirm that these are in memory corruptions?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The &lt;tt&gt;e2fsck&lt;/tt&gt; will verify the free blocks/inodes in the bitmap with the value recorded in the group descriptor. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;3. If these are in memory corruption can we get a debug patch that will re-read from disk before marking the bitmap as bad.&lt;/p&gt;

&lt;p&gt;4. Can you provide any other debug patch to help narrow the root cause?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I made a debug patch (&lt;a href=&quot;https://review.whamcloud.com/28489&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28489&lt;/a&gt;) with mb debug enable, please apply it with the former patch (&lt;a href=&quot;https://review.whamcloud.com/#/c/28249/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/28249/&lt;/a&gt;) together. Please NOTE: the mb debug switch is under /sys/kernel/debug/ldiskfs/mballoc-debug on the server node. It is disabled by default with the value 0. Please set it as &apos;1&apos; before you mount up the Lustre device.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="205204" author="bob.c" created="Fri, 11 Aug 2017 20:22:13 +0000"  >&lt;p&gt;we appear to be missing &lt;a href=&quot;https://review.whamcloud.com/#/c/16312/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/16312/&lt;/a&gt; (&lt;a href=&quot;https://review.whamcloud.com/#/c/16679/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/16679/&lt;/a&gt;) from 6.8 cent version. Does this cause any issue with debug patches or recommended action?&lt;/p&gt;

&lt;p&gt;We are planning to run with 16312 and suggested debug patches&lt;/p&gt;

&lt;p&gt;for some reason this patch never made in into the series/6.8 &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Are we convinced this is  a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7114&quot; title=&quot;ldiskfs: corrupted bitmaps handling patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7114&quot;&gt;&lt;del&gt;LU-7114&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="205232" author="mhanafi" created="Sat, 12 Aug 2017 01:03:07 +0000"  >&lt;p&gt;After the kernel and lustre rebuild/install we don&apos;t see /sys/kernel/debug/ldiskfs option.&lt;/p&gt;

&lt;p&gt;&#160;nbp2-oss1 /boot # cat config-2.6.32-642.15.1.el6.20170609.x86_64.lustre273 |grep CONFIG_EXT4_DEBUG&lt;br/&gt;
CONFIG_EXT4_DEBUG=y&lt;br/&gt;
nbp2-oss1 /boot # ls -l /sys/kernel/debug/&lt;br/&gt;
total 0&lt;/p&gt;</comment>
                            <comment id="205233" author="jaylan" created="Sat, 12 Aug 2017 01:05:29 +0000"  >&lt;p&gt;Kernel was rebuilt with CONFIG_EXT4_DEBUG on.&lt;/p&gt;</comment>
                            <comment id="205234" author="yong.fan" created="Sat, 12 Aug 2017 01:08:51 +0000"  >&lt;blockquote&gt;
&lt;p&gt;After the kernel and lustre rebuild/install we don&apos;t see /sys/kernel/debug/ldiskfs option.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;What is the output:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;find /proc /sys -name mballoc-debug
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="205235" author="yong.fan" created="Sat, 12 Aug 2017 01:13:11 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Kernel was rebuilt with CONFIG_EXT4_DEBUG on.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Sorry, in my test, the  CONFIG_EXT4_DEBUG is disabled by default, so I thought NASA may disable it by default also. Then to avoid rebuilding kernel, I made patch to remove the conditional compile for mb debug. On the other hand, I added more debug for initializing BG block bitmap case.&lt;/p&gt;</comment>
                            <comment id="205236" author="mhanafi" created="Sat, 12 Aug 2017 01:16:00 +0000"  >&lt;p&gt;Do we should get the debugging without the kernel recompile?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205237" author="jaylan" created="Sat, 12 Aug 2017 01:20:00 +0000"  >&lt;p&gt;The kernel was originally built with CONFIG_EXT4_DEBUG disabled and lustre server rpms were built for that.&lt;br/&gt;
Mahmoud told me that he did not see /sys/kernel/debug/ldiskfs option, then I rebuilt the kernel and the lustre.&lt;/p&gt;

&lt;p&gt;So, we have both available for testing. If CONFIG_EXT4_DEBUG not needed in the kernel, how we can enable the mb-debug?&lt;/p&gt;</comment>
                            <comment id="205238" author="yong.fan" created="Sat, 12 Aug 2017 01:26:06 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Do we should get the debugging without the kernel recompile?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;With my patch applied, you need NOT recompile the kernel. In fact, the  CONFIG_EXT4_DEBUG is almost redundant since we can control the mb debug via the debug level.&lt;br/&gt;
Considering the logs when corruption happened &quot;32768 blocks free in bitmap&quot;, it seems that the BG was initialized by logic instead of loading from disk. On some degree, it can explain the in-ram corruption. So I added more debug information in my patch. I think it is worth to try.&lt;/p&gt;</comment>
                            <comment id="205239" author="yong.fan" created="Sat, 12 Aug 2017 01:37:23 +0000"  >&lt;blockquote&gt;
&lt;p&gt;we appear to be missing &lt;a href=&quot;https://review.whamcloud.com/#/c/16312/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/16312/&lt;/a&gt; (&lt;a href=&quot;https://review.whamcloud.com/#/c/16679/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/16679/&lt;/a&gt;) from 6.8 cent version. Does this cause any issue with debug patches or recommended action?&lt;/p&gt;

&lt;p&gt;We are planning to run with 16312 and suggested debug patches&lt;/p&gt;

&lt;p&gt;for some reason this patch never made in into the series/6.8 &lt;/p&gt;

&lt;p&gt;Are we convinced this is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7114&quot; title=&quot;ldiskfs: corrupted bitmaps handling patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7114&quot;&gt;&lt;del&gt;LU-7114&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It is true that we missed such patch, Andreas has pointed it out in the first comment:&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=193803&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-193803&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=193803&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-193803&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But this patch is mostly used for handling the case after the bitmap corruption happened. It allows the system to go ahead without failure right away, then the users can run &lt;tt&gt;e2fsck&lt;/tt&gt; at the maintain windows. As &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhanafi&quot; class=&quot;user-hover&quot; rel=&quot;mhanafi&quot;&gt;mhanafi&lt;/a&gt; commented:&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=205024&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205024&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-9410?focusedCommentId=205024&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205024&lt;/a&gt;, it may not help too much for NASA case.&lt;/p&gt;</comment>
                            <comment id="205240" author="jaylan" created="Sat, 12 Aug 2017 01:51:12 +0000"  >&lt;p&gt;OK, running kernel without CONFIG_EXT4_DEBUG and lustre with your patches, how do we enable debugging if we do not see /sys/kernel/debug/ldiskfs/ ? Please elaborate. Thanks.&lt;/p&gt;</comment>
                            <comment id="205241" author="yong.fan" created="Sat, 12 Aug 2017 02:02:10 +0000"  >&lt;p&gt;What is the output with ldiskfs.ko insmod:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;find /proc /sys -name mballoc-debug
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="205242" author="mhanafi" created="Sat, 12 Aug 2017 02:09:02 +0000"  >&lt;p&gt;find /proc /sys -name mballoc-debug &lt;/p&gt;

&lt;p&gt;has not output&lt;/p&gt;</comment>
                            <comment id="205243" author="yong.fan" created="Sat, 12 Aug 2017 02:18:09 +0000"  >&lt;p&gt;Please attach the source file ldiskfs/mballoc.c, you can find it in your compile directory. Thanks!&lt;/p&gt;</comment>
                            <comment id="205244" author="jaylan" created="Sat, 12 Aug 2017 02:26:05 +0000"  >&lt;p&gt;mballoc.c attached.&lt;/p&gt;</comment>
                            <comment id="205246" author="yong.fan" created="Sat, 12 Aug 2017 02:56:53 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/28489&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28489&lt;/a&gt; is refreshed, please try again. Thanks!&lt;/p&gt;</comment>
                            <comment id="205251" author="mhanafi" created="Sun, 13 Aug 2017 00:06:25 +0000"  >&lt;p&gt;So haven&apos;t put patch debug 28489 in place but are now running with &quot;&lt;del&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7114&quot; title=&quot;ldiskfs: corrupted bitmaps handling patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7114&quot;&gt;&lt;del&gt;LU-7114&lt;/del&gt;&lt;/a&gt;&lt;/del&gt;&quot; patch. It already has found bitmap errors.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;ug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:43 nbp2-oss20 kernel: 
Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:43 nbp2-oss20 kernel: 
Aug 12 01:05:43 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:44 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:45 nbp2-oss20 kernel: 
Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:45 nbp2-oss20 kernel: 
Aug 12 01:05:45 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:46 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:47 nbp2-oss20 kernel: 
Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:47 nbp2-oss20 kernel: 
Aug 12 01:05:47 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:49 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:50 nbp2-oss20 kernel: 
Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:50 nbp2-oss20 kernel: 
Aug 12 01:05:50 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:53 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:54 nbp2-oss20 kernel: 
Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:54 nbp2-oss20 kernel: 
Aug 12 01:05:54 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:59 nbp2-oss20 kernel: 
Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:05:59 nbp2-oss20 kernel: 
Aug 12 01:05:59 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:06:05 nbp2-oss20 kernel: 
Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 01:06:05 nbp2-oss20 kernel: 
Aug 12 01:06:05 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 275790
Aug 12 01:06:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 275790 corrupted: 32768 blocks free in bitmap, 0 - in gd




&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Some time later&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 276684 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:05:12 nbp2-oss20 kernel: 
Aug 12 04:05:12 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 276685 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:07:56 nbp2-oss20 pcp-pmie[5801]: High 1-minute load average 354load@nbp2-oss20
Aug 12 04:07:56 nbp2-oss20 - in gd
Aug 12 04:07:56 nbp2-oss20 kernel: 
Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 304861 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:07:56 nbp2-oss20 kernel: 
Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 304862 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:07:56 nbp2-oss20 kernel: 
Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 304863 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:07:56 nbp2-oss20 kernel: 
Aug 12 04:07:56 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 304864 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 12 04:07:56 nbp2-oss20 kernel: 
.....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;b&gt;It has marked 6727 uniq groups as bad for dm-21(ost319)&lt;/b&gt;&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205252" author="yong.fan" created="Sun, 13 Aug 2017 02:01:48 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7114&quot; title=&quot;ldiskfs: corrupted bitmaps handling patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7114&quot;&gt;&lt;del&gt;LU-7114&lt;/del&gt;&lt;/a&gt; will allow the system to go ahead without failure right away when found corrupted bitmap, but the corruption is still there. I would suggest to apply the patch &lt;a href=&quot;https://review.whamcloud.com/#/c/28489/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/28489/&lt;/a&gt;, it will give us more information the mb operations trace.&lt;/p&gt;</comment>
                            <comment id="205371" author="mhanafi" created="Mon, 14 Aug 2017 20:00:35 +0000"  >&lt;p&gt;With the new build are we suppose to have mballoc-debug in /proc or /sys?&lt;/p&gt;

&lt;p&gt;because the find doesn&apos;t find anything.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Never mind I figured this out. We need to mount debugfs for it to show up.&lt;/p&gt;</comment>
                            <comment id="205393" author="mhanafi" created="Tue, 15 Aug 2017 02:35:26 +0000"  >&lt;p&gt;Got block group debug logs with corruption. Block group is #270808. I will attach full log file to the case. syslog.gp270808.error.gz&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 in page 541616/0
Aug 14 18:37:14 nbp2-oss20 kernel: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 270808
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 in page 541616/0
Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 14 18:37:14 nbp2-oss20 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 in page 541616/0
Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 14 18:37:15 nbp2-oss20 kernel: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 in page 541616/0
Aug 14 18:37:15 nbp2-oss20 kernel: on-disk bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 corrupted: 32768 blocks free in bitmap, 0 - in gd
Aug 14 18:37:15 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: Error in loading buddy information &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 270808
Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1103): ldiskfs_mb_load_buddy: load group 270808
Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 1032): ldiskfs_mb_init_group: init group 270808
Aug 14 18:37:17 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/mballoc.c, 927): ldiskfs_mb_init_cache: put bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 270808 in page 541616/0

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205406" author="gerrit" created="Tue, 15 Aug 2017 08:53:12 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28550&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28550&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: handle unmatched bitmap&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0a4199ad21c5ac23a4a4e7e07847610ad8ec7994&lt;/p&gt;</comment>
                            <comment id="205408" author="yong.fan" created="Tue, 15 Aug 2017 09:07:38 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Aug 14 18:37:14 nbp2-oss20 kernel: (/tmp/rpmbuild-lustre-jlan-PYDDD1xV/BUILD/lustre-2.7.3/ldiskfs/balloc.c, 179): ldiskfs_init_block_bitmap: #24877: init the group 270808 of total groups 583584: group_blocks 32768, free_blocks 32768, free_blocks_in_gdp 0, ret 32768&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The logs shows that the ldiskfs_init_block_bitmap() initialized the bitmap, but the free blocks count in the group descriptor is still zero, that caused the subsequent ldiskfs_mb_check_ondisk_bitmap() failure. Currently, I can not say it is corruption, but more like logic issue. The patch will set the free block count based on the real free bits in the bitmap. It may be not the perfect solution, but we can try whether it can resolve your trouble or not.&lt;/p&gt;</comment>
                            <comment id="205460" author="mhanafi" created="Tue, 15 Aug 2017 19:42:48 +0000"  >&lt;p&gt;I used systemtap to catch one of these bad groups and dump out the ldiskfs_group_desc struct.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;mballoc.c:826: first_group: 274007 bg_free_blocks_count_hi: 0 bg_block_bitmap_hi: 0 bg_free_blocks_count_lo: 0
mballoc.c:826:$desc {.bg_block_bitmap_lo=328727, .bg_inode_bitmap_lo=930551, .bg_inode_table_lo=3450424, .bg_free_blocks_count_lo=0, .bg_free_inodes_count_lo=128, .bg_used_dirs_count_lo=0, .bg_flags=7, .bg_reserved=[...], .bg_itable_unused_lo=128, .bg_checksum=55256, .bg_block_bitmap_hi=0, .bg_inode_bitmap_hi=0, .bg_inode_table_hi=0, .bg_free_blocks_count_hi=0, .bg_free_inodes_count_hi=0, .bg_used_dirs_count_hi=0, .bg_itable_unused_hi=0, .bg_reserved2=[...]}


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;It also seem odd that dumpe2fs can produce different results for unused block groups. Sometimes it will show block_bitmap!=free_blocks and other time it will be ok.&lt;/p&gt;

&lt;p&gt;&#160;---&lt;/p&gt;

&lt;p&gt;in ldiskfs_valid_block_bitmap() I don&apos;t understand this&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (LDISKFS_HAS_INCOMPAT_FEATURE(sb, LDISKFS_FEATURE_INCOMPAT_FLEX_BG)) {
 /* with FLEX_BG, the inode/block bitmaps and itable
 * blocks may not be in the group at all
 * so the bitmap validation will be skipped &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; those groups
 * or it has to also read the block group where the bitmaps
 * are located to verify they are set.
 */
 &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 1;
 }

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We have flex_bg enabled would this apply to us?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;For the OST that are prone to the bitmap errors cat /proc/fs/ldiskfs/dm*/mb_groups will reproduce the errors.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205481" author="mhanafi" created="Wed, 16 Aug 2017 02:20:38 +0000"  >&lt;p&gt;Applied the new patch. After a full fsck mounting osts resulted in this many block groups getting corrected.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;----------------
service603
----------------
 4549 dm-33):

----------------
service604
----------------
 4425 dm-32):

----------------
service606
----------------
 4658 dm-29):

----------------
service610
----------------
 4631 dm-33):

----------------
service611
----------------
 4616 dm-28):

----------------
service616
----------------
 4652 dm-35):

----------------
service617
----------------
 4501 dm-21):

----------------
service619
----------------
 4657 dm-25):

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We need to rate limit the warnings.  &lt;/p&gt;</comment>
                            <comment id="205493" author="yong.fan" created="Wed, 16 Aug 2017 11:08:59 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhanafi&quot; class=&quot;user-hover&quot; rel=&quot;mhanafi&quot;&gt;mhanafi&lt;/a&gt;&lt;br/&gt;
It looks different from the original one, would you please to show me more logs (dmesg, /var/log/messages) about the latest corruption ? Is the system still accessible after above warning?&lt;/p&gt;</comment>
                            <comment id="205495" author="mhanafi" created="Wed, 16 Aug 2017 13:07:51 +0000"  >&lt;p&gt;here is part of dmesg. The high rate of messages caused the root drive scsi device to reset. But all but one server recovered. I had to turn down printk log level down to get the last one to recover.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 262310

LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 262311

LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 262312

LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 262313

LDISKFS-fs warning (device dm-33): ldiskfs_init_block_bitmap: Set free blocks as 32768 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; group 262314
LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.149.2.156@o2ib313 portal 28 match 1575300167923792 offset 0 length 520: 4
LNet: 12178:0:(lib-move.c:1487:lnet_parse_put()) Skipped 978380 previous similar messages
sd 0:0:1:0: attempting task abort! scmd(ffff880af433e0c0)
sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 a0 08 08 00 00 08 00
scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2)
scsi target0:0:1: enclosure level(0x0000),connector name(    ^C)
sd 0:0:1:0: task abort: SUCCESS scmd(ffff880af433e0c0)
sd 0:0:1:0: attempting task abort! scmd(ffff880a64ab46c0)
sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 e0 08 08 00 00 08 00
scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
scsi target0:0:1: enclosure_logical_id(0x50030480198f7e01), slot(2)
scsi target0:0:1: enclosure level(0x0000),connector name(    ^C)
sd 0:0:1:0: task abort: SUCCESS scmd(ffff880a64ab46c0)
sd 0:0:1:0: attempting task abort! scmd(ffff880b21cec180)
sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 c0 08 08 00 00 08 00
scsi target0:0:1: handle(0x000a), sas_address(0x4433221102000000), phy(2)
DISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts: 
LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts: 
mounted filesystem with ordered data mode. quota=on. Opts: 
LDISKFS-fs (dm-29): mounted filesystem with ordered data mode. quota=on. Opts: 

LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: nbp2-OST0081: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.43.107@o2ib (not set up)
Lustre: Skipped 3 previous similar messages
Lustre: nbp2-OST0081: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.29.130@o2ib (not set up)
Lustre: Skipped 113 previous similar messages
Lustre: nbp2-OST0081: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Lustre: nbp2-OST0081: Will be in recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; at least 2:30, or until 14441 clients reconnect
Lustre: nbp2-OST0081: Denying connection &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; client 35b99837-9505-fc4d-270f-f2d1ca30372d (at 10.151.30.176@o2ib), waiting &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; all 14441 known clients (44 recovered, 1 in progress, and 0 evicted) to recover in 5:10


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Here is /var/log/messages&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Dropping PUT from 12345-10.151.30.120@o2ib portal 28 match 1575477031778096 offset 0 length 520: 4
Aug 11 17:58:25 nbp2-oss10 kernel: LNet: 12075:0:(lib-move.c:1487:lnet_parse_put()) Skipped 1037319 previous similar messages
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-30):
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-28): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-31): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-21): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-19): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-22): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-26): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-33): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-23): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:35 nbp2-oss10 kernel: LDISKFS-fs (dm-32): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-34): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-24): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:40 nbp2-oss10 kernel: LDISKFS-fs (dm-25): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:40 nbp2-oss10 kernel: 
Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-29):
Aug 11 18:03:41 nbp2-oss10 kernel: LDISKFS-fs (dm-35): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:41 nbp2-oss10 kernel: mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:49 nbp2-oss10 kernel: LDISKFS-fs (dm-27): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.50.143@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Aug 11 18:03:50 nbp2-oss10 kernel: LustreError: Skipped 314 previous similar messages
Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.9.177@o2ib (not set up)
Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 11 previous similar messages
Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0009_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.8.85@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Aug 11 18:03:51 nbp2-oss10 kernel: LustreError: Skipped 3632 previous similar messages
Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.50.241@o2ib (not set up)
Aug 11 18:03:51 nbp2-oss10 kernel: Lustre: Skipped 180 previous similar messages
Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: 137-5: nbp2-OST0135_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.48.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Aug 11 18:03:52 nbp2-oss10 kernel: LustreError: Skipped 6273 previous similar messages
Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.151.7.158@o2ib (not set up)
Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: Skipped 402 previous similar messages
Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Aug 11 18:03:52 nbp2-oss10 kernel: Lustre: nbp2-OST00d1: Will be in recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; at least 2:30, or until 14452 clients reconnect

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="205496" author="gerrit" created="Wed, 16 Aug 2017 13:18:45 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28566&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28566&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: no check mb bitmap if flex_bg enabled&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8332a30959750c603bc572db1fcde8bc92f82a40&lt;/p&gt;</comment>
                            <comment id="205605" author="yong.fan" created="Thu, 17 Aug 2017 14:03:04 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhanafi&quot; class=&quot;user-hover&quot; rel=&quot;mhanafi&quot;&gt;mhanafi&lt;/a&gt;, I have to say that this issue may be related with the improperly bitmap consistency verification in our ldiskfs patch without handling flex_bg case. I made a patch &lt;a href=&quot;https://review.whamcloud.com/28566&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28566&lt;/a&gt; to handle related issues. Would you pleas to try (no need other former patches). Thanks!&lt;/p&gt;</comment>
                            <comment id="205659" author="jaylan" created="Thu, 17 Aug 2017 18:41:56 +0000"  >&lt;p&gt;I did a build with #28566 and #28550 yesterday. For testing purpose, do these two conflict?&lt;br/&gt;
I will undo #28550, but if these two do not collide, we can do testing with the builds I did yesterday.&lt;/p&gt;

&lt;p&gt;Never mind. I just did another build with #28550 pulled out.&lt;/p&gt;</comment>
                            <comment id="205683" author="mhanafi" created="Thu, 17 Aug 2017 22:32:59 +0000"  >&lt;p&gt;The filesystem is stable with the workaround patch (&lt;a href=&quot;https://review.whamcloud.com/#/c/28489/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;/28489/&lt;/a&gt;). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP.&lt;/p&gt;</comment>
                            <comment id="205710" author="yong.fan" created="Fri, 18 Aug 2017 00:47:51 +0000"  >&lt;p&gt;The patch 28550 will take effect before 28566, so if 28550 is applied, then 28566 is meaningless. But 28550 may do more things than the necessary fixes. I am afraid of some penitential side-effect.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;The filesystem is stable with the workaround patch (/28489/). Can we run with this patch for sometime without any underlining filesystem issues? Or should we replace it with 28566 ASAP.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It is interesting to know that. Because 28489 is just a debug patch, I cannot imagine how it can resolve your  issue. It may because your system has jumped over the groups with &quot;BLOCK_UNINIT&quot; flag and zero free blocks in GDP. If it is true, then applying 28566 will not show you more benefit. Since your system is stable running, you can replace the patches with 28566 when it &apos;corrupted&apos; next time.&lt;/p&gt;</comment>
                            <comment id="205711" author="mhanafi" created="Fri, 18 Aug 2017 01:07:45 +0000"  >&lt;p&gt;Sorry I typed the patch number. I wanted to say it is stable with 28550.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="205713" author="yong.fan" created="Fri, 18 Aug 2017 01:52:39 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Sorry I typed the patch number. I wanted to say it is stable with 28550.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Then it is reasonable. As I explained above, 28550 may do more than the necessary fixes. But since it runs stable, you can keep it until next &apos;corruption&apos;.&lt;/p&gt;</comment>
                            <comment id="206062" author="mhanafi" created="Tue, 22 Aug 2017 20:23:59 +0000"  >&lt;p&gt;updated: we have applied &lt;a href=&quot;https://review.whamcloud.com/28566&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28566&lt;/a&gt; Friday and the filesystem has been stable.&lt;/p&gt;</comment>
                            <comment id="206112" author="yong.fan" created="Wed, 23 Aug 2017 08:07:51 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhanafi&quot; class=&quot;user-hover&quot; rel=&quot;mhanafi&quot;&gt;mhanafi&lt;/a&gt; Thanks for the update.&lt;/p&gt;</comment>
                            <comment id="206156" author="mhanafi" created="Wed, 23 Aug 2017 16:39:04 +0000"  >&lt;p&gt;Does this patch require any changes to e2fsck?&lt;/p&gt;</comment>
                            <comment id="206158" author="yong.fan" created="Wed, 23 Aug 2017 16:50:24 +0000"  >&lt;p&gt;I think that there may be something can be improved for mke2fs, not e2fsck.&lt;/p&gt;</comment>
                            <comment id="206330" author="jaylan" created="Thu, 24 Aug 2017 20:28:03 +0000"  >&lt;p&gt;Do I need this patch for 2.10.0?&lt;/p&gt;</comment>
                            <comment id="206376" author="yong.fan" created="Fri, 25 Aug 2017 01:24:32 +0000"  >&lt;p&gt;Yes, master also needs the patch 28566.&lt;/p&gt;</comment>
                            <comment id="206539" author="gerrit" created="Mon, 28 Aug 2017 06:25:08 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/28566/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28566/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: no check mb bitmap if flex_bg enabled&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5506c15a65b3eebb9f15000105e6eb7c02742a10&lt;/p&gt;</comment>
                            <comment id="206690" author="gerrit" created="Mon, 28 Aug 2017 18:28:42 +0000"  >&lt;p&gt;Minh Diep (minh.diep@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28765&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28765&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: no check mb bitmap if flex_bg enabled&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 53d836b1e5d255558639fe8e4eae78a87a176d04&lt;/p&gt;</comment>
                            <comment id="207659" author="gerrit" created="Wed, 6 Sep 2017 17:01:03 +0000"  >&lt;p&gt;John L. Hammond (john.hammond@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/28765/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28765/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9410&quot; title=&quot;on-disk bitmap corrupted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9410&quot;&gt;&lt;del&gt;LU-9410&lt;/del&gt;&lt;/a&gt; ldiskfs: no check mb bitmap if flex_bg enabled&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 27f5b8b16416b04a561d0b0121860e2a5188be4a&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="12976">LU-1026</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="31980">LU-7114</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="51472">LU-10837</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="27846" name="bt.2017-07-26-02.48.00" size="783382" author="mhanafi" created="Wed, 26 Jul 2017 20:25:06 +0000"/>
                            <attachment id="27847" name="bt.2017-07-26-12.08.43" size="827901" author="mhanafi" created="Wed, 26 Jul 2017 20:25:06 +0000"/>
                            <attachment id="27838" name="foreach.out" size="753222" author="mhanafi" created="Wed, 26 Jul 2017 04:00:57 +0000"/>
                            <attachment id="27993" name="mballoc.c" size="148633" author="jaylan" created="Sat, 12 Aug 2017 02:25:28 +0000"/>
                            <attachment id="27990" name="ost258.dumpe2fs.after.fsck.gz" size="36138369" author="mhanafi" created="Fri, 11 Aug 2017 17:27:22 +0000"/>
                            <attachment id="27991" name="ost258.dumpe2fs.after.readonly.gz" size="36114533" author="mhanafi" created="Fri, 11 Aug 2017 17:27:12 +0000"/>
                            <attachment id="28001" name="syslog.gp270808.error.gz" size="14024695" author="mhanafi" created="Tue, 15 Aug 2017 02:37:21 +0000"/>
                            <attachment id="27839" name="vmcore-dmesg.txt" size="524288" author="mhanafi" created="Wed, 26 Jul 2017 04:00:57 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzbbz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>