<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:12:40 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1015] ldiskfs corruption with large LUNs</title>
                <link>https://jira.whamcloud.com/browse/LU-1015</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption.  Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well).  I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp.  The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28.  I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing.  Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients.  We hit a bug where the fsck shows corruption and doesn&apos;t mount lustre.  As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.   &lt;/p&gt;

&lt;p&gt;zgrep  Mounting ../conman.old/console.hyperion-agb27-20120115.gz&lt;/p&gt;

&lt;p&gt;2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000&lt;br/&gt;
2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001&lt;br/&gt;
2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002&lt;br/&gt;
2012-01-14 15:21:57 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 15:22:03 Mounting other filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000&lt;br/&gt;
2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001&lt;br/&gt;
2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002&lt;br/&gt;
2012-01-14 16:21:49 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 16:21:55 Mounting other filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000&lt;br/&gt;
2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001&lt;br/&gt;
2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002&lt;br/&gt;
2012-01-14 17:21:53 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 17:21:59 Mounting other filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 18:22:00 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 18:22:06 Mounting other filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 19:21:56 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 19:22:02 Mounting other filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 20:21:52 Mounting local filesystems:  [  OK  ]&lt;br/&gt;
2012-01-14 20:21:58 Mounting other filesystems:  [  OK  ]&lt;/p&gt;

&lt;p&gt;It appears that the corruption occurred way back on Friday 1/14 after 16:20.&lt;br/&gt;
I state this based upon the fact that the OST&apos;s did not make it back after the power cycle on 1/14 @ 17:20.&lt;/p&gt;

&lt;p&gt;Also the following fsck results only surfaced after that power cycle:&lt;/p&gt;

&lt;p&gt;2012-01-14 17:23:48 Group descriptor 0 checksum is invalid.  FIXED.&lt;br/&gt;
2012-01-14 17:23:48 Group descriptor 1 checksum is invalid.  FIXED.&lt;br/&gt;
2012-01-14 17:23:48 Group descriptor 2 checksum is invalid.  FIXED.&lt;/p&gt;
</description>
                <environment>lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64&lt;br/&gt;
toss/chaos 5&lt;br/&gt;
NetApp 22TB LUNs</environment>
        <key id="12923">LU-1015</key>
            <summary>ldiskfs corruption with large LUNs</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="adilger">Andreas Dilger</assignee>
                                    <reporter username="cindyheer">cindy heer</reporter>
                        <labels>
                            <label>ldiskfs</label>
                            <label>paj</label>
                    </labels>
                <created>Wed, 18 Jan 2012 12:24:52 +0000</created>
                <updated>Mon, 11 Jun 2012 22:40:04 +0000</updated>
                            <resolved>Mon, 11 Jun 2012 22:40:04 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                    <version>Lustre 2.1.1</version>
                    <version>Lustre 2.1.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="27108" author="pjones" created="Fri, 20 Jan 2012 11:35:01 +0000"  >&lt;p&gt;Andreas&lt;/p&gt;

&lt;p&gt;Can you please comment on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="27131" author="adilger" created="Fri, 20 Jan 2012 14:16:46 +0000"  >&lt;p&gt;I don&apos;t have Hyperion access myself, unfortunately.&lt;/p&gt;

&lt;p&gt;Are there any ldiskfs errors on the OSS consoles between 16:23, when the OSTs last successfully mounted, and 17:21, when they apparently failed to mount?&lt;/p&gt;

&lt;p&gt;Did all of the OSTs fail in a similar manner, or just a single one?&lt;/p&gt;

&lt;p&gt;Are the group descriptor checksum error messages just the first of many (i.e. are all the group checksums invalid), or is it only for groups 0, 1, 2?&lt;/p&gt;

&lt;p&gt;Could the OST(s) be mounted after the e2fsck run fixed those checksum errors, or were there other errors/corruption that prevented the OST(s) from mounting?&lt;/p&gt;

&lt;p&gt;It looks like you are running the RHEL6.2 (220) kernel, so this should be relatively up-to-date w.r.t. upstream ext4 patches, or at least would hopefully narrow down the number of upstream kernel patches to look at.&lt;/p&gt;</comment>
                            <comment id="27141" author="cliffw" created="Fri, 20 Jan 2012 15:39:18 +0000"  >&lt;p&gt;Syslog from 13:20 to ~17:30 2012-01-14&lt;/p&gt;</comment>
                            <comment id="27145" author="cliffw" created="Fri, 20 Jan 2012 15:55:38 +0000"  >&lt;p&gt;I have access, so i took the liberty. &lt;br/&gt;
I do see these errors throughout:&lt;br/&gt;
2012-01-14 04:22:21 Buffer I/O error on device sdj, logical block 1&lt;/p&gt;

&lt;p&gt;On 2012-01-14 the first correctable pfsck errors appear at the 13:23 pfsck.&lt;br/&gt;
At the 14:23 pfsck correctable errors appear on two disks.&lt;br/&gt;
The 15:23 and 16:23 pfsck are clean.&lt;br/&gt;
At 17:23, things explode&lt;/p&gt;

&lt;p&gt;2012-01-14 17:23:47 ls1-OST0002: recovering journal&lt;br/&gt;
2012-01-14 17:23:48 fsck.ldiskfs: Group descriptors look bad... trying backup blocks...&lt;br/&gt;
2012-01-14 17:23:48 One or more block group descriptor checksums are invalid.  Fix? yes&lt;br/&gt;
2012-01-14 17:23:48&lt;br/&gt;
Followed by a large mess of errors. &lt;br/&gt;
I don&apos;t see any odd error messages to account for this, is it possible the device was dual-mounted?&lt;br/&gt;
Is the hardware healthy? The syslog from 13:23 to the failure is attached.&lt;/p&gt;</comment>
                            <comment id="27146" author="cindyheer" created="Fri, 20 Jan 2012 16:25:54 +0000"  >&lt;p&gt;Thanks for looking around.  I&apos;m currently trying to build another test &lt;br/&gt;
with ddn hardware as a comparison.  The device was not dual mounted and &lt;br/&gt;
the Netapp hardware exhibits no errors. &lt;/p&gt;

</comment>
                            <comment id="27159" author="morrone" created="Fri, 20 Jan 2012 17:58:41 +0000"  >&lt;p&gt;This corruption occurs as a result of unclean shutdowns of the OSS.&lt;/p&gt;

&lt;p&gt;No, there are no errors from ldiskfs before this occurs.  The disk state is inconsistent after the unclean OSS shutdown.  If allowed to run without a full fsck, it will result in ldiskfs panicing the node.&lt;/p&gt;

&lt;p&gt;An preen fsck (fsck -p, which is the default in our init scripts) will not necessarily catch this.  If it fails to catch it and the node starts up, ldiskfs will panic the node later.&lt;/p&gt;

&lt;p&gt;No, not all OSTs hit this.  It is more-or-less random which OSTs will be corrupt after the unclean shutdown.&lt;/p&gt;

&lt;p&gt;Yes, so far fsck will fix the problems and allow ldiskfs to run without error.  But we haven&apos;t looked too hard to make sure there was no data loss.  I think some times we I did wind up with files in lost+found.&lt;/p&gt;

&lt;p&gt;Cindy, if you aren&apos;t already, after powering off the nodes you&apos;ll want to avoid using the default &quot;fsck.ldiskfs -p&quot; and instead do &quot;fsck.ldiskfs -f -n&quot; to really find the errors, and to avoid fixing them before Whamcloud can look at them.&lt;/p&gt;</comment>
                            <comment id="27160" author="morrone" created="Fri, 20 Jan 2012 18:02:33 +0000"  >&lt;p&gt;Actually, there WAS one time that I hosed an OST so badly that I gave up and just reformatted it (I was trying to make progress on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; at the time).  But I&apos;m not sure of the steps involved there...it may have been something like:&lt;/p&gt;

&lt;p&gt;  1) unclean power-off&lt;br/&gt;
  2) fsck -p (fix something, but not everything?)&lt;br/&gt;
  3) start ldiskfs, run for a while&lt;br/&gt;
  4) ldiskfs panics&lt;br/&gt;
  5) reboot and manually run fsck -f -y,  something not quite right...&lt;br/&gt;
  6) give up and format&lt;/p&gt;</comment>
                            <comment id="27630" author="cindyheer" created="Mon, 30 Jan 2012 18:59:18 +0000"  >&lt;p&gt;I have restarted testing on ls1 filesystem on hyperion (large lun testing with netapp) with the fsck -n in place.  I was not running fsck -p for this testing ever, but I did have the full fsck with -y in the past.  That has been modified.  I am also running the same testing on ls3 filesystem on hyperion (large luns with ddn hardware) that has been running for about a week and has not had any failures so far.  I continue to run iors to the filesystems while every hour I simulate an unclean shutdown on the oss (by powering them off).  &lt;/p&gt;</comment>
                            <comment id="27673" author="cindyheer" created="Tue, 31 Jan 2012 13:35:57 +0000"  >&lt;p&gt;I think I hit a bit of corruption again on the large LUN with Netapp (still no corruption exhibited with the DDN).  Here is the output (after oss is powered off to simulate oss crash).  I&apos;m running a journal replay fsck.ldiskfs -p and then a full fsck with -n flag): &lt;/p&gt;

&lt;p&gt;ldev fsck.ldiskfs -p %d&lt;br/&gt;
ls1-OST0001: ls1-OST0001: clean, 337/22888320 files, 26023621/5859409133 blocks&lt;br/&gt;
ls1-OST0000: ls1-OST0000 contains a file system with errors, check forced.&lt;br/&gt;
ls1-OST0002: ls1-OST0002: clean, 340/22888320 files, 24807109/5859409133 blocks&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Duplicate or bad block in use!&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769473: 4747954688&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546689: 4747954688&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546690: 4747984896 4747984897 4747984898&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546691: 4747988992 4747990016 4747990017&lt;br/&gt;
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546692: 4747986944 4747987968 4747987969&lt;br/&gt;
ls1-OST0000: ls1-OST0000: (There are 8 inodes containing multiply-claimed blocks.)&lt;br/&gt;
ls1-OST0000: &lt;br/&gt;
ls1-OST0000: ls1-OST0000: File /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012) &lt;br/&gt;
ls1-OST0000:   has 1 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
ls1-OST0000: ls1-OST0000:       /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012)&lt;br/&gt;
ls1-OST0000: ls1-OST0000: &lt;br/&gt;
ls1-OST0000: &lt;br/&gt;
ls1-OST0000: ls1-OST0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.&lt;br/&gt;
ls1-OST0000:    (i.e., without -a or -p options)&lt;br/&gt;
ldev: Fatal: parallel command execution failed&lt;/p&gt;

&lt;p&gt;e-agb27: pfsck.ldiskfs /dev/dm-1 /dev/dm-2 /dev/dm-0 &amp;#8211; -f -n -v -t&lt;/p&gt;

&lt;p&gt;e-agb27: fsck 1.41.90.1chaos (14-May-2011)&lt;br/&gt;
e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)&lt;br/&gt;
e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)&lt;/p&gt;

&lt;p&gt;e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)&lt;br/&gt;
e-agb27: Pass 1: Checking inodes, blocks, and sizes&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 1: Checking inodes, blocks, and sizes&lt;br/&gt;
e-agb27: Pass 1: Checking inodes, blocks, and sizes&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 2: Checking directory structure&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 3: Checking directory connectivity&lt;br/&gt;
e-agb27: Pass 4: Checking reference counts&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 5: Checking group summary information&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 2: Checking directory structure&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 3: Checking directory connectivity&lt;br/&gt;
e-agb27: Pass 4: Checking reference counts&lt;/p&gt;

&lt;p&gt;e-agb27: &lt;br/&gt;
e-agb27: Running additional passes to resolve blocks claimed by more than one inode...&lt;br/&gt;
e-agb27: Pass 1B: Rescanning for multiply-claimed blocks&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 1769473: 4747954688&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969&lt;br/&gt;
e-agb27: Pass 5: Checking group summary information&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 18546689: 4747954688&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 18546690: 4747984896 4747984897 4747984898&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 18546691: 4747988992 4747990016 4747990017&lt;br/&gt;
e-agb27: Multiply-claimed block(s) in inode 18546692: 4747986944 4747987968 4747987969&lt;br/&gt;
e-agb27: Pass 1C: Scanning directories for inodes with multiply-claimed blocks&lt;/p&gt;

&lt;p&gt;e-agb27: Pass 1D: Reconciling multiply-claimed blocks&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27:      337 inodes used (0.00%)&lt;br/&gt;
e-agb27:      131 non-contiguous files (38.9%)&lt;br/&gt;
e-agb27:        0 non-contiguous directories (0.0%)&lt;br/&gt;
e-agb27:          # of inodes with ind/dind/tind blocks: 0/0/0&lt;br/&gt;
e-agb27:          Extent depth histogram: 178/149&lt;br/&gt;
e-agb27: 26023621 blocks used (0.44%)&lt;br/&gt;
e-agb27:        0 bad blocks&lt;br/&gt;
e-agb27:       19 large files&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27:      291 regular files&lt;br/&gt;
e-agb27:       37 directories&lt;br/&gt;
e-agb27:        0 character device files&lt;br/&gt;
e-agb27:        0 block device files&lt;br/&gt;
e-agb27:        0 fifos&lt;br/&gt;
e-agb27:        0 links&lt;br/&gt;
e-agb27:        0 symbolic links (0 fast symbolic links)&lt;br/&gt;
e-agb27:        0 sockets&lt;br/&gt;
e-agb27: --------&lt;br/&gt;
e-agb27:      328 files&lt;br/&gt;
e-agb27: Memory used: 43816k/715264k (16113k/27704k), time:  7.44/ 6.65/ 0.75&lt;br/&gt;
e-agb27: I/O read: 29MB, write: 0MB, rate: 3.90MB/s&lt;/p&gt;

&lt;p&gt;e-agb27: &lt;br/&gt;
e-agb27:      340 inodes used (0.00%)&lt;br/&gt;
e-agb27:      132 non-contiguous files (38.8%)&lt;br/&gt;
e-agb27:        0 non-contiguous directories (0.0%)&lt;br/&gt;
e-agb27:          # of inodes with ind/dind/tind blocks: 0/0/0&lt;br/&gt;
e-agb27:          Extent depth histogram: 180/150&lt;br/&gt;
e-agb27: 24807109 blocks used (0.42%)&lt;br/&gt;
e-agb27:        0 bad blocks&lt;br/&gt;
e-agb27:       18 large files&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27:      294 regular files&lt;br/&gt;
e-agb27:       37 directories&lt;br/&gt;
e-agb27:        0 character device files&lt;br/&gt;
e-agb27:        0 block device files&lt;br/&gt;
e-agb27:        0 fifos&lt;br/&gt;
e-agb27:        0 links&lt;br/&gt;
e-agb27:        0 symbolic links (0 fast symbolic links)&lt;br/&gt;
e-agb27:        0 sockets&lt;br/&gt;
e-agb27: --------&lt;br/&gt;
e-agb27:      331 files&lt;br/&gt;
e-agb27: Memory used: 43816k/715264k (16113k/27704k), time: 15.80/ 6.73/ 0.76&lt;br/&gt;
e-agb27: I/O read: 30MB, write: 0MB, rate: 1.90MB/s&lt;/p&gt;



&lt;p&gt;e-agb27: (There are 8 inodes containing multiply-claimed blocks.)&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012) &lt;br/&gt;
e-agb27:   has 1 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /lost+found/#1769474 (inode #1769474, mod time Wed Jan 11 12:54:06 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /???/mountdata (inode #18546690, mod time Wed Jan 11 12:54:06 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File ... (inode #1769475, mod time Tue Jan 31 09:39:00 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /???/ls1-OST0000 (inode #18546691, mod time Tue Jan 31 09:39:00 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /lost+found/#1769476 (inode #1769476, mod time Sat Jan 14 16:23:53 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        ... (inode #18546692, mod time Sat Jan 14 16:23:53 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012) &lt;br/&gt;
e-agb27:   has 1 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /???/mountdata (inode #18546690, mod time Wed Jan 11 12:54:06 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /lost+found/#1769474 (inode #1769474, mod time Wed Jan 11 12:54:06 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File /???/ls1-OST0000 (inode #18546691, mod time Tue Jan 31 09:39:00 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        ... (inode #1769475, mod time Tue Jan 31 09:39:00 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: File ... (inode #18546692, mod time Sat Jan 14 16:23:53 2012) &lt;br/&gt;
e-agb27:   has 3 multiply-claimed block(s), shared with 1 file(s):&lt;br/&gt;
e-agb27:        /lost+found/#1769476 (inode #1769476, mod time Sat Jan 14 16:23:53 2012)&lt;br/&gt;
e-agb27: Clone multiply-claimed blocks? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Delete file? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Pass 2: Checking directory structure&lt;br/&gt;
e-agb27: Invalid inode number for &apos;.&apos; in directory inode 1769473.&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Pass 3: Checking directory connectivity&lt;br/&gt;
e-agb27: &apos;..&apos; in /lost+found/#1769473 (1769473) is / (2), should be /lost+found (11).&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Pass 4: Checking reference counts&lt;br/&gt;
e-agb27: Inode 2 ref count is 5, should be 6.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 11 ref count is 3, should be 2.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 1769473 ref count is 2, should be 1.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 1769474 ref count is 2, should be 1.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Unattached inode 1769475&lt;br/&gt;
e-agb27: Connect to /lost+found? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 18546689 ref count is 2, should be 3.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 18546691 ref count is 1, should be 2.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Unattached inode 18546692&lt;br/&gt;
e-agb27: Connect to /lost+found? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Unattached zero-length inode 18546693.  Clear? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Unattached inode 18546693&lt;br/&gt;
e-agb27: Connect to /lost+found? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode 18546695 ref count is 1, should be 2.  Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Pass 5: Checking group summary information&lt;/p&gt;

&lt;p&gt;e-agb27: Block bitmap differences:  &lt;del&gt;(6385&lt;/del&gt;&lt;del&gt;6391) -(453017600&lt;/del&gt;&lt;del&gt;453017602) -(453018113&lt;/del&gt;&lt;del&gt;453018114) -(453018626&lt;/del&gt;&lt;del&gt;453018627) -453019136 -453019648 -(453020162&lt;/del&gt;&lt;del&gt;453020416) -(453020672&lt;/del&gt;&lt;del&gt;453020673) -453021187 -453021696 -(453022720&lt;/del&gt;-453022721)&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Free blocks count wrong for group #13825 (32768, counted=32498).&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Free blocks count wrong (5835336736, counted=5835336466).&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Inode bitmap differences:  -1769477 -1769479&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Free inodes count wrong for group #13824 (125, counted=122).&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: Free inodes count wrong (22887977, counted=22887974).&lt;br/&gt;
e-agb27: Fix? no&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: ls1-OST0000: ********** WARNING: Filesystem still has errors **********&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27:      343 inodes used (0.00%)&lt;br/&gt;
e-agb27:      135 non-contiguous files (39.4%)&lt;br/&gt;
e-agb27:        0 non-contiguous directories (0.0%)&lt;br/&gt;
e-agb27:          # of inodes with ind/dind/tind blocks: 0/0/0&lt;br/&gt;
e-agb27:          Extent depth histogram: 184/150&lt;br/&gt;
e-agb27: 24072397 blocks used (0.41%)&lt;br/&gt;
e-agb27:        0 bad blocks&lt;br/&gt;
e-agb27:       18 large files&lt;br/&gt;
e-agb27: &lt;br/&gt;
e-agb27:      297 regular files&lt;br/&gt;
e-agb27:       38 directories&lt;br/&gt;
e-agb27:        0 character device files&lt;br/&gt;
e-agb27:        0 block device files&lt;br/&gt;
e-agb27:        0 fifos&lt;br/&gt;
e-agb27:        3 links&lt;br/&gt;
e-agb27:        0 symbolic links (0 fast symbolic links)&lt;br/&gt;
e-agb27:        0 sockets&lt;br/&gt;
e-agb27: --------&lt;br/&gt;
e-agb27:      335 files&lt;br/&gt;
e-agb27: Memory used: 46616k/1430528k (16114k/30503k), time: 142.55/129.36/ 1.75&lt;br/&gt;
e-agb27: I/O read: 721MB, write: 0MB, rate: 5.06MB/s&lt;br/&gt;
e-agb27: FAILED: pfsck.ldiskfs &amp;#8211; -f -n -v -t: 0&lt;/p&gt;
</comment>
                            <comment id="27674" author="cindyheer" created="Tue, 31 Jan 2012 13:43:29 +0000"  >&lt;p&gt;I will leave the oss (hyerion-agb27) down for further examination unless I hear otherwise.  &lt;/p&gt;</comment>
                            <comment id="27882" author="adilger" created="Fri, 3 Feb 2012 12:24:37 +0000"  >&lt;p&gt;Cindy, thanks for posting the e2fsck output.  Looking at the blocks that are duplicate allocated:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;e-agb27: Multiply-claimed block(s) in inode 1769473: 4747954688
e-agb27: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898
e-agb27: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017
e-agb27: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These are all beyond the 2^32 block limit (4747986944 = 0x11b008800) so it may be that this problem relates to overflow of 32-bit block numbers somewhere in the IO stack.  If you have the logs from the previous e2fsck runs that showed corruption, it would be useful to know if the type of corruption always the same or not (attaching a few e2fsck logs would be useful).  I have a suspicion based on this one log that it may relate to corruption of the block bitmap during the &lt;em&gt;previous&lt;/em&gt; journal recovery or e2fsck (possibly due to 2^32-block truncation) which later causes the blocks to be allocated twice.&lt;/p&gt;

&lt;p&gt;Several things can be done to begin debugging this, possibly in parallel:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Verify IO integrity for &amp;gt; 16TB LUNs&lt;/li&gt;
&lt;/ul&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;take a &amp;gt; 16TB OST out of service&lt;/li&gt;
	&lt;li&gt;mount it directly with ldiskfs&lt;/li&gt;
	&lt;li&gt;run &quot;llverfs -v -l -c 32 
{mountpoint}
&lt;p&gt;&quot; against the mounted OST filesystem&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;run a full e2fsck to determine if there is any corruption present&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
	&lt;li&gt;run the current test, but dump the journal contents before mounting or running e2fsck on the filesystem&lt;/li&gt;
&lt;/ul&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;dumpe2fs $ostdev &amp;gt; $logdir/$ostdev.$(date +%Y%m%d).stats&lt;/li&gt;
	&lt;li&gt;debugfs -c -R &quot;dump &amp;lt;8&amp;gt;&quot; $ostdev &amp;gt; $logdir/$ostdev.$(date +%y%m%d).journal&lt;/li&gt;
	&lt;li&gt;debugfs -c -R &quot;logdump -a&quot; $ostdev &amp;gt; $logdir/$ostdev.$(date +%Y%m%d).logdump&lt;/li&gt;
	&lt;li&gt;e2fsck -fp $ostdev 2&amp;gt;&amp;amp;1 | tee $logdir/$ostdev.$(date +%Y%m%d).e2fsck&lt;/li&gt;
	&lt;li&gt;dumpe2fs $ostdev &amp;gt; $logdir/$ostdev.$(date +%Y%m%d).stats.post&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
	&lt;li&gt;format some OSTs to be less than 16TB and try to reproduce the problem again on the NetApp hardware&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;llverfs is a non-destructive test that will write files until the device is full and then read them back and verify the data contents.  It is intended to catch errors in the IO path (filesystem, block layer, HBA, driver, controller) related to 32-bit address truncation, but it is not very fast (may take a couple of days, depending on LUN size and IO rate).  If this finds problems, there is a lower-level (filesystem destructive) &quot;llverdev&quot; tool that will run against the underlying block device and do a full write/read/verify cycle on the device, excluding the filesystem.  I don&apos;t have high hopes for this finding a problem, but it is useful to eliminate the chance that there are obvious bugs in the IO stack.  We ran full llverdev and llverfs tests previously with a DDN SFA10kE + RHEL5 on a 128TB LUN without problems, but there may be problems with RHEL6, the driver, the controller, etc. in your environment.&lt;/p&gt;

&lt;p&gt;The debugfs/e2fsck commands are intended to catch (or at least give us some chance to find post-facto) errors in the journal replay and/or e2fsck that are incorrectly marking blocks free, which are later being reallocated.  Also, what version of e2fsprogs is e2fsck-1.41.90-1chaos based on?  I now recall after working on this bug that there may have been some 64-bit bugs fixed in e2fsck that could potentially be causing problems as well.&lt;/p&gt;</comment>
                            <comment id="27962" author="marc@llnl.gov" created="Mon, 6 Feb 2012 11:30:13 +0000"  >&lt;p&gt;Andreas, thanks for the excellent analysis.  We have been running this same test to isolate the extent of the corruption.  We see this behavior on the 22TB luns on the NetApp hardware, but not on a smaller 3TB partition created on similar luns.  We also have not seen it on a DDN SFA10K with 16TB luns, but are reconfiguring the DDN to have luns &amp;gt; 16TB to see if we can reproduce it on that hardware.&lt;/p&gt;

&lt;p&gt;Whamcloud has access to Hyperion, so please coordinate with the Hyperion team to reserve some hardware, and your folks can run the tests you describe above.&lt;/p&gt;</comment>
                            <comment id="30539" author="cliffw" created="Mon, 5 Mar 2012 13:53:19 +0000"  >&lt;p&gt;Setup with 4 OSTS&lt;/p&gt;

&lt;p&gt;/dev/sda               12T   39G   12T   1% /p/osta&lt;br/&gt;
/dev/sdb               22T   39G   21T   1% /p/ostb&lt;br/&gt;
/dev/sdc               10T   39G  9.5T   1% /p/ostc&lt;br/&gt;
/dev/sdd               22T   39G   21T   1% /p/ostd&lt;/p&gt;

&lt;p&gt;Ran llverfs per Andreas on /dev/sdd - no issues. &lt;br/&gt;
Mounted all as Lustre FS, ran IOR from 3 clients, powercycled e-abg26 in the middle of the run,&lt;br/&gt;
replicated failure. Captured before/after data in script per Andreas instructions, attached.&lt;/p&gt;</comment>
                            <comment id="30540" author="cliffw" created="Mon, 5 Mar 2012 13:57:02 +0000"  >&lt;p&gt;Multiple files for size reason&lt;/p&gt;</comment>
                            <comment id="30541" author="cliffw" created="Mon, 5 Mar 2012 13:57:41 +0000"  >&lt;p&gt;Before run&lt;/p&gt;</comment>
                            <comment id="30542" author="cliffw" created="Mon, 5 Mar 2012 13:59:45 +0000"  >&lt;p&gt;small files&lt;/p&gt;</comment>
                            <comment id="30543" author="cliffw" created="Mon, 5 Mar 2012 14:04:53 +0000"  >&lt;p&gt;before run&lt;/p&gt;</comment>
                            <comment id="30544" author="cliffw" created="Mon, 5 Mar 2012 14:06:41 +0000"  >&lt;p&gt;after fail&lt;/p&gt;</comment>
                            <comment id="30545" author="cliffw" created="Mon, 5 Mar 2012 14:07:22 +0000"  >&lt;p&gt;after fail&lt;/p&gt;</comment>
                            <comment id="30546" author="cliffw" created="Mon, 5 Mar 2012 14:07:53 +0000"  >&lt;p&gt;Ran test on all disks, only sdd showed failure (so far)&lt;/p&gt;</comment>
                            <comment id="30548" author="cliffw" created="Mon, 5 Mar 2012 16:11:44 +0000"  >&lt;p&gt;Repeated test, sdb (2nd 21TB LUN) failed this time. Seems easy to repeat, awaiting further requests.&lt;/p&gt;</comment>
                            <comment id="30607" author="adilger" created="Tue, 6 Mar 2012 09:01:07 +0000"  >&lt;p&gt;Cliff, some questions:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;presumably the 4 LUNs (12 TB, 22TB, 10TB, 22TB) are on the same node?&lt;/li&gt;
	&lt;li&gt;you are not doing failover between multiple OSS nodes, but just rebooting the node in-place?&lt;/li&gt;
	&lt;li&gt;how many times did you run the test, and how many failures?  Two tests and two failures?&lt;/li&gt;
	&lt;li&gt;this is running on the NetApp hardware for both the under 16TB and over 16TB LUNs?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Hitting three failures on the &amp;gt; 16TB LUNs is a fairly good indication of the problem is limited to &amp;gt; 16TB LUN support, and not to the NetApp itself (assuming the &amp;lt; 16TB LUNs are also on the NetApp).  One option would be to run IOR directly on the OSS node against one of the &amp;gt; 16TB LUNs mounted with &quot;-t ldiskfs&quot; instead of &quot;-t lustre&quot;, then do a similar hard reset + e2fsck (and other debugging, which is hopefully in a script by now).  This would let us see if this is a problem in the lustre/obdfilter/fsfilt code or in the core ldiskfs code.  If this local testing still fails with ldiskfs, then it would be useful to test with ext4 to determine if the problem is in the base ext4 code.&lt;/p&gt;

&lt;p&gt;Look at the debug logs for the first test, and it does appear that the corruption is of the block bitmap above the 16TB mark, allowing the writes to reallocate those blocks.  It may be in ldiskfs or ext4, so starting the above testing would also help cut down the number of variables.&lt;/p&gt;</comment>
                            <comment id="30619" author="cliffw" created="Tue, 6 Mar 2012 14:04:49 +0000"  >&lt;p&gt;Per LLNL: all disk are from Netapp &#8220;E5400 RBODs&#8221;  All 4 LUNS are attached to a single OSS node. &lt;br/&gt;
I did not do failover, as there was nothing to fail to, I performed a hard powercycle using powerman. &lt;br/&gt;
The test was run twice, the debug data script was run prior to the first test. Two tests, two failures. &lt;/p&gt;

&lt;p&gt;Again, afaik these LUNS are all furnished by the same NetApp device.&lt;/p&gt;

&lt;p&gt;I will perform the local IOR test and report results. &lt;/p&gt;</comment>
                            <comment id="30625" author="cliffw" created="Tue, 6 Mar 2012 20:02:57 +0000"  >&lt;p&gt;Ran the test using IOR on a local ldiskfs mount. Failed. Results attached.&lt;br/&gt;
Reformatted all LUNS as ext4, re-ran test twice - no failures. The second time, ran IOR in a loop against all four drives (in sequence) for one hour prior to powercycle. &lt;/p&gt;</comment>
                            <comment id="30626" author="cliffw" created="Tue, 6 Mar 2012 20:04:53 +0000"  >&lt;p&gt;Failure with local ldiskfs mount.&lt;/p&gt;</comment>
                            <comment id="30627" author="cliffw" created="Tue, 6 Mar 2012 20:05:52 +0000"  >&lt;p&gt;Options for Lustre format:&lt;/p&gt;

&lt;p&gt;mkfs.lustre --reformat --ost --fsname lu1015 --mgsnode=192.168.120.25@o2ib --mkfsoptions=&apos;-t ext4 -J size=2048 -O extents -G 256 -i 69905&apos; /dev/sd$i &amp;amp;&lt;/p&gt;</comment>
                            <comment id="30629" author="adilger" created="Tue, 6 Mar 2012 20:59:18 +0000"  >&lt;p&gt;Cliff, is this running the LLNL ldiskfs RPM, or the ldiskfs from the Lustre tree?  It would be good to run vanilla ext4 powerfail tests a couple of extra times to more positively verify that the bug is not present with ext4, since we know that it is definitely still there for ldiskfs.&lt;/p&gt;</comment>
                            <comment id="30630" author="cliffw" created="Tue, 6 Mar 2012 21:35:18 +0000"  >&lt;p&gt;rpm -qa |grep disk&lt;br/&gt;
lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.gddd1a7c.x86_64_g0203c14.x86_64&lt;br/&gt;
lustre-ldiskfs-debuginfo-3.3.0-2.6.32_220.4.2.el6_lustre.gddd1a7c.x86_64_g0203c14.x86_64&lt;/p&gt;

&lt;p&gt;I will continue running the powerfail test on ext4.&lt;/p&gt;</comment>
                            <comment id="30649" author="cliffw" created="Wed, 7 Mar 2012 13:03:16 +0000"  >&lt;p&gt;Somewhat of a head-scratcher atm. Wanted to be certain only the large disks were seeing errors when formatted as ldiskfs, so last night ran a test writing to all four disks mounted as ldiskfs in a loop (short iteration of IOR) creating a new file each loop. &lt;br/&gt;
I let the test run for one hour before doing the power cycle, previously had powercycled within 10 minutes of test start.&lt;br/&gt;
First reboot - no disks! - restarted ibsrp via /etc/init.d/ibsrp - no errors found.&lt;br/&gt;
Second reboot as a test, I restarted ibsrp by hand prior to running the fsck script, again no errors found. &lt;br/&gt;
Not sure what to make of this outcome. &lt;/p&gt;

&lt;p&gt;Today, will re-run ext4 tests, so far no errors there. &lt;/p&gt;</comment>
                            <comment id="30733" author="cliffw" created="Thu, 8 Mar 2012 16:15:57 +0000"  >&lt;p&gt;I have continued running the local (ext4 and ldiskfs) tests, but have not had a failure in two days.&lt;br/&gt;
Am uncertain what if anything has changed. &lt;/p&gt;</comment>
                            <comment id="30749" author="adilger" created="Fri, 9 Mar 2012 03:26:34 +0000"  >&lt;p&gt;I thought of another possible way to positively exclude the NetApp from the picture here:&lt;/p&gt;

&lt;p&gt;Use LVM to create PVs and a volume group on the 22TB LUNs, something like (from memory, please check man pages):&lt;/p&gt;

&lt;p&gt;pvcreate /dev/sdb /dev/sdd&lt;br/&gt;
vgcreate -n vgtest /dev/sdb /dev/sdd&lt;/p&gt;

&lt;p&gt;Create a 32TB LV that is using only the first 16TB of these two LUNs:&lt;/p&gt;

&lt;p&gt;lvcreate -n lvtest_lo -L 16T /dev/vgtest /dev/sdb&lt;br/&gt;
lvextend -L +16T /dev/vgtest/lvtest_lo&lt;/p&gt;

&lt;p&gt;Create a 12TB LV that is using only the last 6TB of these two LUNs (the size may need to massaged to consume the rest of the space on both /dev/sdb and /dev/sdd:&lt;/p&gt;

&lt;p&gt;lvcreate -n lvtest_hi -L 6T /dev/vgtest /dev/sdb&lt;br/&gt;
lvextend -L +6T /dev/vgtest/lvtest_hi&lt;/p&gt;

&lt;p&gt;What this will do is create the &quot;lvtest_lo&quot; on only storage space that is using blocks of the NetApp that are exclusively below 16TB, but the ldiskfs filesystem is larger than 16TB.  Conversely, &quot;lvtest_hi&quot; is smaller than 16TB, but is using blocks of the NetApp above the 16TB limit.&lt;/p&gt;

&lt;p&gt;Run the original Lustre IOR test against all 4 LUNs.  If &quot;lvtest_lo&quot; hits the problem, then it is positively caused by Lustre or ldiskfs or ext4.  If &quot;lvetst_hi&quot; hits the problem, then it is positively a problem in the NetApp, because ldiskfs is smaller than 16TB (which didn&apos;t hit any failure before).  If it hits on /dev/sda or /dev/sdc then we are confused (it could be either again), but I hope not.&lt;/p&gt;</comment>
                            <comment id="30768" author="cliffw" created="Fri, 9 Mar 2012 13:06:19 +0000"  >&lt;p&gt;Here&apos;s the new setup for OSS, will report. &lt;br/&gt;
 &amp;#8212; Logical volume &amp;#8212;&lt;br/&gt;
  LV Name                /dev/vglu1015/lv1015_lo&lt;br/&gt;
  VG Name                vglu1015&lt;br/&gt;
  LV UUID                sgpofs-dO8q-nGd1-yNlj-Vgah-L1NT-cph17Y&lt;br/&gt;
  LV Write Access        read/write&lt;br/&gt;
  LV Status              available&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;open                 0&lt;br/&gt;
  LV Size                32.00 TiB&lt;br/&gt;
  Current LE             8388608&lt;br/&gt;
  Segments               2&lt;br/&gt;
  Allocation             inherit&lt;br/&gt;
  Read ahead sectors     auto&lt;/li&gt;
&lt;/ol&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;currently set to     256&lt;br/&gt;
  Block device           253:0&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;  &amp;#8212; Segments &amp;#8212;&lt;br/&gt;
  Logical extent 0 to 4194303:&lt;br/&gt;
    Type                linear&lt;br/&gt;
    Physical volume     /dev/sdb&lt;br/&gt;
    Physical extents    0 to 4194303&lt;/p&gt;

&lt;p&gt;  Logical extent 4194304 to 8388607:&lt;br/&gt;
    Type                linear&lt;br/&gt;
    Physical volume     /dev/sdd&lt;br/&gt;
    Physical extents    0 to 4194303&lt;/p&gt;


&lt;p&gt;  &amp;#8212; Logical volume &amp;#8212;&lt;br/&gt;
  LV Name                /dev/vglu1015/lv1015_hi&lt;br/&gt;
  VG Name                vglu1015&lt;br/&gt;
  LV UUID                qTWfgz-fr1M-Gss2-0WmE-EJfH-0is2-bTVKGZ&lt;br/&gt;
  LV Write Access        read/write&lt;br/&gt;
  LV Status              available&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;open                 0&lt;br/&gt;
  LV Size                11.60 TiB&lt;br/&gt;
  Current LE             3040872&lt;br/&gt;
  Segments               2&lt;br/&gt;
  Allocation             inherit&lt;br/&gt;
  Read ahead sectors     auto&lt;/li&gt;
&lt;/ol&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;currently set to     256&lt;br/&gt;
  Block device           253:1&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;  &amp;#8212; Segments &amp;#8212;&lt;br/&gt;
  Logical extent 0 to 1520435:&lt;br/&gt;
    Type                linear&lt;br/&gt;
    Physical volume     /dev/sdb&lt;br/&gt;
    Physical extents    4194304 to 5714739&lt;/p&gt;

&lt;p&gt;  Logical extent 1520436 to 3040871:&lt;br/&gt;
    Type                linear&lt;br/&gt;
    Physical volume     /dev/sdd&lt;br/&gt;
    Physical extents    4194304 to 5714739&lt;/p&gt;

</comment>
                            <comment id="30783" author="cliffw" created="Fri, 9 Mar 2012 22:20:47 +0000"  >&lt;p&gt;I am not certain this is going to uncover errors, as the performance through LVM is about 1/10 of the native performance. Any ideas for tuning this setup for better numbers?&lt;/p&gt;</comment>
                            <comment id="30784" author="adilger" created="Fri, 9 Mar 2012 23:09:04 +0000"  >&lt;p&gt;That is probably Sur to seeking between the LUNs and the LVs stripes across them. I didn&apos;t think the test took so long to run, just to start testing and then reboot. &lt;/p&gt;

&lt;p&gt;If it needs to be faster there are less &quot;good&quot; tests that could be run. &lt;/p&gt;

&lt;p&gt;For example, run on only the hi or lo LVs at one time, alternating, and then see which one fails.  That would take twice as long, but not 10x as long.   We would need to run several  times to be confident only one config is failing. &lt;/p&gt;</comment>
                            <comment id="30797" author="cliffw" created="Sun, 11 Mar 2012 13:45:03 +0000"  >&lt;p&gt;The current issue is that none of the configs are failing, I was running longer in hopes of generating a failure. But a failure has not occurred since last tuesday on any configuration, so I am currently quite puzzled. There have been hardware changes durning this, is it possible that this was a hardware issue and has been fixed by the LLNL controller changes? &lt;/p&gt;</comment>
                            <comment id="30798" author="cliffw" created="Sun, 11 Mar 2012 14:30:34 +0000"  >&lt;p&gt;To clarify, my concern is not the length of the tests, rather with 1/10 the IO rate, my concern is that we are no longer driving the hardware very hard, if this issue is related to speed or volume of IO at the device level we won&apos;t be able to replicate the issue. &lt;/p&gt;</comment>
                            <comment id="30799" author="adilger" created="Sun, 11 Mar 2012 15:35:07 +0000"  >&lt;p&gt;Running on &lt;em&gt;either&lt;/em&gt; the hi or lo LVs at one time should get the IO rate to the one LUN back to the original level. If that does not return the symptoms again, then we need to go back to the full LUN testing without LVM to verify that the problem can still be hit. &lt;/p&gt;</comment>
                            <comment id="39218" author="cliffw" created="Tue, 22 May 2012 13:46:49 +0000"  >&lt;p&gt;Returned to simpler setup, using straight ldiskfs was able to re-create errors on &amp;gt;21TB OST. Data attached. &lt;br/&gt;
Will try now with local IOR/ext4 to see if problem can be isolated to ldiskfs code. &lt;/p&gt;</comment>
                            <comment id="39250" author="cliffw" created="Tue, 22 May 2012 18:30:11 +0000"  >&lt;p&gt;I have reformatted with ext4, running iOR locally, and have had one failure, results attached.&lt;/p&gt;</comment>
                            <comment id="39251" author="adilger" created="Tue, 22 May 2012 18:36:52 +0000"  >&lt;p&gt;Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem.  It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn&apos;t show up regularly in testing.  &lt;/p&gt;

&lt;p&gt;The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001
From: Theodore Ts&apos;o &amp;lt;tytso@mit.edu&amp;gt;
Date: Mon, 21 May 2012 21:30:45 -0400
Subject: [PATCH] e2fsck: fix 64-bit journal support

64-bit journal support was broken; we weren&apos;t using the high bits from
the journal descriptor blocks in some cases!

Signed-off-by: &lt;span class=&quot;code-quote&quot;&gt;&quot;Theodore Ts&apos;o&quot;&lt;/span&gt; &amp;lt;tytso@mit.edu&amp;gt;
---
e2fsck/jfs_user.h |    4 ++--
e2fsck/journal.c  |   33 +++++++++++++++++----------------
e2fsck/recovery.c |   25 ++++++++++++-------------
3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h
index 9e33306..92f8ae2 100644
--- a/e2fsck/jfs_user.h
+++ b/e2fsck/jfs_user.h
@@ -18,7 +18,7 @@ struct buffer_head {
 	e2fsck_t	b_ctx;
 	io_channel 	b_io;
 	&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;	 	b_size;
-	blk_t	 	b_blocknr;
+	unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; b_blocknr;
 	&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;	 	b_dirty;
 	&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;	 	b_uptodate;
 	&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;	 	b_err;
diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c
index b669941..e94ef4e 100644
--- a/e2fsck/recovery.c
+++ b/e2fsck/recovery.c
@@ -309,7 +309,6 @@ &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; journal_skip_recovery(journal_t *journal)
 	&lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; err;
 }
 
-#&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; 0
 &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; inline unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; read_tag_block(&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; tag_bytes, journal_block_tag_t *tag)
 {
 	unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; block = be32_to_cpu(tag-&amp;gt;t_blocknr);
@@ -317,7 +316,6 @@ &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; inline unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; read_tag_block(&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; tag_bytes, journal_block_tag
 		block |= (__u64)be32_to_cpu(tag-&amp;gt;t_blocknr_high) &amp;lt;&amp;lt; 32;
 	&lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; block;
 }
-#endif
 
/*
 * calc_chksums calculates the checksums &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; the blocks described in the
 * descriptor block.
@@ -506,7 +504,8 @@ &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; do_one_pass(journal_t *journal,
 					unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; blocknr;
 
 					J_ASSERT(obh != NULL);
-					blocknr = be32_to_cpu(tag-&amp;gt;t_blocknr);
+					blocknr = read_tag_block(tag_bytes,
+								 tag);
 
 					/* If the block has been
 					 * revoked, then we&apos;re all done
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="39300" author="adilger" created="Wed, 23 May 2012 18:34:17 +0000"  >&lt;p&gt;Bumping priority on this for tracking.  It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.&lt;/p&gt;</comment>
                            <comment id="39766" author="adilger" created="Thu, 31 May 2012 17:05:18 +0000"  >&lt;p&gt;e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?&lt;/p&gt;</comment>
                            <comment id="40310" author="cliffw" created="Fri, 8 Jun 2012 19:46:59 +0000"  >&lt;p&gt;Running with latest e2fsprogs, one error recovered, logs attached.&lt;/p&gt;

&lt;p&gt;/dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps&lt;br/&gt;
lu1015-OST0000: recovering journal&lt;br/&gt;
lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0)&lt;br/&gt;
lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216.  FIXED.&lt;br/&gt;
lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks&lt;/p&gt;</comment>
                            <comment id="40311" author="cliffw" created="Fri, 8 Jun 2012 19:56:43 +0000"  >&lt;p&gt;file is lu1015.060812.tar.gz on the FTP site&lt;/p&gt;</comment>
                            <comment id="40328" author="adilger" created="Mon, 11 Jun 2012 03:42:52 +0000"  >&lt;p&gt;Cliff, how many runs did it take to hit this error?&lt;/p&gt;

&lt;p&gt;I don&apos;t think this is related to the problem seen before.  Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time.  It looks like this handling isn&apos;t tested very often and has a bug because the &quot;Truncating orphaned inode&quot; message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks.  This should be filed &amp;amp; fixed separately.&lt;/p&gt;</comment>
                            <comment id="40347" author="cliffw" created="Mon, 11 Jun 2012 10:22:37 +0000"  >&lt;p&gt;The error occured on the second run. The system ran large-lun.sh successfully prior to this.&lt;/p&gt;</comment>
                            <comment id="40404" author="adilger" created="Mon, 11 Jun 2012 22:37:44 +0000"  >&lt;p&gt;I&apos;ve been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash. &lt;/p&gt;</comment>
                            <comment id="40405" author="adilger" created="Mon, 11 Jun 2012 22:40:04 +0000"  >&lt;p&gt;Problem is fixed in released e2fsprogs-1.42.3.wc1.&lt;/p&gt;
</comment>
                    </comments>
                    <attachments>
                            <attachment id="10754" name="LU1015.log.gz" size="5558015" author="cliffw" created="Fri, 20 Jan 2012 15:39:18 +0000"/>
                            <attachment id="10925" name="after.tar" size="74240" author="cliffw" created="Mon, 5 Mar 2012 14:05:34 +0000"/>
                            <attachment id="10923" name="before.tar" size="4096" author="cliffw" created="Mon, 5 Mar 2012 13:59:45 +0000"/>
                            <attachment id="10921" name="full.sdd.1021.log.gz" size="2236717" author="cliffw" created="Mon, 5 Mar 2012 13:57:02 +0000"/>
                            <attachment id="10929" name="sdb.20120305.5143.stats.gz" size="9788605" author="cliffw" created="Mon, 5 Mar 2012 16:12:39 +0000"/>
                            <attachment id="10930" name="sdb.20120305.5143.stats.post.gz" size="9788679" author="cliffw" created="Mon, 5 Mar 2012 16:13:10 +0000"/>
                            <attachment id="10931" name="sdb.20121805.1805.stats.gz" size="9901112" author="cliffw" created="Mon, 5 Mar 2012 16:13:42 +0000"/>
                            <attachment id="10932" name="sdb.20125805.5815.stats.gz" size="9902941" author="cliffw" created="Mon, 5 Mar 2012 16:16:02 +0000"/>
                            <attachment id="10933" name="sdb.20125805.5815.stats.post.gz" size="9902944" author="cliffw" created="Mon, 5 Mar 2012 16:17:51 +0000"/>
                            <attachment id="10928" name="sdb.fail.tar" size="61440" author="cliffw" created="Mon, 5 Mar 2012 16:11:57 +0000"/>
                            <attachment id="10926" name="sdd.20120305.1910.stats.gz" size="9788342" author="cliffw" created="Mon, 5 Mar 2012 14:06:41 +0000"/>
                            <attachment id="10927" name="sdd.20120305.1910.stats.post.gz" size="9786730" author="cliffw" created="Mon, 5 Mar 2012 14:07:22 +0000"/>
                            <attachment id="10935" name="sdd.20120306.1204.stats.gz" size="9788599" author="cliffw" created="Tue, 6 Mar 2012 20:04:53 +0000"/>
                            <attachment id="10936" name="sdd.20120306.1204.stats.post.gz" size="9788631" author="cliffw" created="Tue, 6 Mar 2012 20:04:53 +0000"/>
                            <attachment id="11427" name="sdd.20120522.2011.e2fsck.gz" size="183" author="cliffw" created="Tue, 22 May 2012 13:47:23 +0000"/>
                            <attachment id="11428" name="sdd.20120522.2011.journal.gz" size="46" author="cliffw" created="Tue, 22 May 2012 13:47:39 +0000"/>
                            <attachment id="11429" name="sdd.20120522.2011.logdump.gz" size="1765656" author="cliffw" created="Tue, 22 May 2012 13:47:55 +0000"/>
                            <attachment id="11430" name="sdd.20120522.2011.stats.gz" size="9810492" author="cliffw" created="Tue, 22 May 2012 13:48:41 +0000"/>
                            <attachment id="11431" name="sdd.20120522.2011.stats.post.gz" size="9813267" author="cliffw" created="Tue, 22 May 2012 13:49:44 +0000"/>
                            <attachment id="11441" name="sdd.20120522.2323.e2fsck.gz" size="282" author="cliffw" created="Tue, 22 May 2012 18:31:49 +0000"/>
                            <attachment id="11440" name="sdd.20120522.2323.journal.gz" size="46" author="cliffw" created="Tue, 22 May 2012 18:31:40 +0000"/>
                            <attachment id="11439" name="sdd.20120522.2323.logdump.gz" size="26610" author="cliffw" created="Tue, 22 May 2012 18:31:29 +0000"/>
                            <attachment id="11438" name="sdd.20120522.2323.stats.gz" size="9898659" author="cliffw" created="Tue, 22 May 2012 18:31:17 +0000"/>
                            <attachment id="11437" name="sdd.20120522.2323.stats.post.gz" size="9898587" author="cliffw" created="Tue, 22 May 2012 18:30:52 +0000"/>
                            <attachment id="10924" name="sdd.20125905.5946.stats.gz" size="9789161" author="cliffw" created="Mon, 5 Mar 2012 14:04:53 +0000"/>
                            <attachment id="10922" name="sdd.20125905.5946.stats.post.gz" size="9789170" author="cliffw" created="Mon, 5 Mar 2012 13:57:41 +0000"/>
                            <attachment id="11436" name="sdd.ext4.full.fsck.txt.gz" size="1659" author="cliffw" created="Tue, 22 May 2012 18:30:25 +0000"/>
                            <attachment id="10934" name="sdd.fail.1.tar" size="10240" author="cliffw" created="Tue, 6 Mar 2012 20:04:53 +0000"/>
                            <attachment id="11432" name="sdd.full.fsck.txt.gz" size="60667" author="cliffw" created="Tue, 22 May 2012 13:49:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>metadata</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzuru7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2172</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>