<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:53:19 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5648] corrupt files contain extra data</title>
                <link>https://jira.whamcloud.com/browse/LU-5648</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Users have reported several recent cases of file corruption.  The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file.  The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.&lt;/p&gt;

&lt;p&gt;In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy.  The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB).  The archive tool reported 2,000,000 bytes written.  The extra 97,152 bytes appear to be valid structured user data of unknown origin. &lt;/p&gt;

&lt;p&gt;The corrupt file sizes are not always aligned on MiB boundaries, however.  Of the cases reported so far, these are the sizes involved:&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Example #&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Expected Size&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Actual Size&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;2000000&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;2097152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;2&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1008829&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;2097152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;3&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;36473&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1053224&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;4&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1008829&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1441432&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;In Example 1, the &quot;bad data&quot; begins immediately at the end of the expected data, with no sparse area between.  Seen below with &lt;tt&gt;od -A d -a&lt;/tt&gt;, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                         
1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
1999856   D   +   U   1   j   q   g   ;   7   J   r   {   &quot;   j   )   D            
1999872 enq   *   C   `   =   o   C   &amp;amp;   K   \   a   1   D   v   k  ht            
1999888   !   A   ;  ff   2   &quot;   G   i   m   9   e dle   $  si   T   )            
1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
1999920   1   4   $   c   W   l   M   D   &amp;amp;   3   U   J   B   )   t   {            
1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   &quot;   #   J   7            
1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   &amp;gt;            
2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0            
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                         

&lt;p&gt;In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary.  Here is another &lt;tt&gt;od&lt;/tt&gt; snippet illustrating the expected, sparse, and unexpected regions for example 2:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                         
1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
*                                                                                  
1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4            
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                         

&lt;p&gt;In all 4 examples the corrupt data resides within the second OST object.  In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user&apos;s data.&lt;/p&gt;</description>
                <environment>lustre-2.4.2-14chaos, ZFS OSD&lt;br/&gt;
</environment>
        <key id="26681">LU-5648</key>
            <summary>corrupt files contain extra data</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="nedbass">Ned Bass</reporter>
                        <labels>
                            <label>HB</label>
                            <label>llnl</label>
                    </labels>
                <created>Mon, 22 Sep 2014 21:06:46 +0000</created>
                <updated>Thu, 20 Oct 2022 15:28:00 +0000</updated>
                            <resolved>Fri, 23 Jan 2015 05:18:44 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                    <fixVersion>Lustre 2.7.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>16</watches>
                                                                            <comments>
                            <comment id="94660" author="jhammond" created="Mon, 22 Sep 2014 22:07:15 +0000"  >&lt;p&gt;Hi Ned,&lt;/p&gt;

&lt;p&gt;&amp;gt; This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user&apos;s data.&lt;/p&gt;

&lt;p&gt;Based on the report I don&apos;t think that this is a necessary conclusion. It could be that this is old data from a previously unlinked file and that the object sizes were corrupted. (Or you might be correct. Or something else...) What does stat return for these files? The real size or the expected size? Are you also able to stat the OST objects? Does the output of getstripe look normal?&lt;/p&gt;

&lt;p&gt;&amp;gt;  In examples 2-4 the file should have only one object.&lt;/p&gt;

&lt;p&gt;Do you mean that the files should have stripe count 1 but instead have stripe count 2?&lt;/p&gt;</comment>
                            <comment id="94665" author="nedbass" created="Mon, 22 Sep 2014 22:54:30 +0000"  >&lt;blockquote&gt;&lt;p&gt;What does stat return for these files? The real size or the expected size?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;stat returns the real size.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Are you also able to stat the OST objects?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I haven&apos;t done that yet.  I&apos;ll post my findings when I do.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Does the output of getstripe look normal?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes, for example here the getstripe output for Example 3 above (expected size 36473).&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oslic1:bass6]# ls -l /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
-rw------- 1 czapotts czapotts 1053224 Sep 20 09:20 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
[root@oslic1:bass6]# lfs getstripe -v /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
lmm_magic:          0x0BD10BD0
lmm_seq:            0x20001484c
lmm_object_id:      0x7b50
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
lmm_layout_gen:     0
lmm_stripe_offset:  17
        obdidx           objid           objid           group
            17         7853811       0x77d6f3                0
            51         7697453       0x75742d                0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;&lt;p&gt;Do you mean that the files should have stripe count 1 but instead have stripe count 2?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes.&lt;/p&gt;</comment>
                            <comment id="94669" author="nedbass" created="Mon, 22 Sep 2014 23:17:03 +0000"  >&lt;blockquote&gt;&lt;p&gt;Are you also able to stat the OST objects?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Note the objects have different owners and timestamps.  The contents of the objects match what I see in the file from a Lustre client.&lt;br/&gt;
Object 1:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vesta18:0]# pwd
/mnt/toss-2688/O/0
[root@vesta18:0]# ls -li d$(( 7853811 % 32))/7853811
11240968 -rw-rw-rw- 1 40161 40161 36473 Sep 20 09:20 d19/7853811
[root@vesta18:0]# zdb -dddd vesta18/fsv-ost0@toss-2688 11240968
Dataset vesta18/fsv-ost0@toss-2688 [ZPL], ID 306, cr_txg 70829659, 24.1T, 3872191 objects, rootbp DVA[0]=&amp;lt;2:aafb5c4bc00:200&amp;gt; DVA[1]=&amp;lt;0:aafcede7200:200&amp;gt; [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=70829659L/70829659P fill=3872191 cksum=1cbcf56f2f:8d6df67891c:17fb1399c1fb3:2ed1492ff40927

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11240968    1    16K   128K  7.00K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    ???&amp;lt;object#11240968&amp;gt;
        uid     40161
        gid     40161
        atime   Sat Sep 20 09:20:12 2014
        mtime   Sat Sep 20 09:20:10 2014
        ctime   Sat Sep 20 09:20:10 2014
        crtime  Sat Sep 20 09:19:53 2014
        gen     70763960
        mode    100666
        size    36473
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\363\326w\000\000\000\000\000
                trusted.fid = LH\001\000\002\000\000\000P{\000\000\000\000\000\000
                trusted.version = 8\255\314\000\024\000\000\000

Note:
# printf &apos;LH\001\000\002\000\000\000P{\000\000\000\000\000\000&apos; | hexdump -C
00000000  4c 48 01 00 02 00 00 00  50 7b 00 00 00 00 00 00  |LH......P{......|
00000010
## f_seq = 0x20001484c
## f_oid = 0x7b50
## f_stripe_index = 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Object 2:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vesta52:~]# cd /mnt/toss-2688/O/0/                                                           
[root@vesta52:0]# ls -li d$(( 7697453 % 32 ))/7697453
11554471 -rw-rw-rw- 1 44793 44793 4648 Sep 19 14:55 d13/7697453
[root@vesta52:0]# zdb -dddd vesta52/fsv-ost0@toss-2688 11554471
Dataset vesta52/fsv-ost0@toss-2688 [ZPL], ID 327, cr_txg 74769068, 26.2T, 3803450 objects, rootbp DVA[0]=&amp;lt;1:b8de3f5f000:200&amp;gt; DVA[1]=&amp;lt;2:b8e5bdf5800:200&amp;gt; [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=74769068L/74769068P fill=3803450 cksum=18f2760b51:7b49838026d:151711df02cf7:29c5f44d190860

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11554471    1    16K   128K     8K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    ???&amp;lt;object#11554471&amp;gt;
        uid     44793
        gid     44793
        atime   Wed Dec 31 16:00:00 1969
        mtime   Fri Sep 19 14:55:24 2014
        ctime   Fri Sep 19 14:55:24 2014
        crtime  Fri Sep 19 09:12:22 2014
        gen     74681220
        mode    100666
        size    4648
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000-tu\000\000\000\000\000
                trusted.fid = \177&amp;gt;\001\000\002\000\000\000r\354\000\000\000\000\000\000
                trusted.version = \032g\267\000\021\000\000\000

# printf &apos;\177&amp;gt;\001\000\002\000\000\000r\354\000\000\000\000\000\000&apos; | hexdump -C
00000000  7f 3e 01 00 02 00 00 00  72 ec 00 00 00 00 00 00  |.&amp;gt;......r.......|
00000010
## f_seq = 0x200013e7f
## f_oid = 0xec72
## f_stripe_index = 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="94671" author="jfc" created="Mon, 22 Sep 2014 23:26:39 +0000"  >&lt;p&gt;I&apos;m looking after this ticket for the time being.&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="94675" author="nedbass" created="Tue, 23 Sep 2014 00:37:10 +0000"  >&lt;blockquote&gt;&lt;p&gt;Do you mean that the files should have stripe count 1 but instead have stripe count 2?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Please disregard my comments about the stripe count. The filesystem default stripe count is 2, so that is expected.&lt;/p&gt;</comment>
                            <comment id="94678" author="nedbass" created="Tue, 23 Sep 2014 00:40:52 +0000"  >&lt;p&gt;I have another example, a corrupted plain text file where all data fits within the first object. It clearly appears that another user&apos;s data is appended to the end:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oslic1:bass6]# orig=/p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# copy=/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base

[root@oslic1:bass6]# ls -l $orig
-r-------- 1 czapotts czapotts 47 Jun  4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# ls -l $copy
-rw------- 1 czapotts czapotts 517 Jun  4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base

[root@oslic1:bass6]# cat $orig
K 12
svn:keywords
V 19
URL Author Date Rev
END

[root@oslic1:bass6]# cat $copy
K 12
svn:keywords
V 19
URL Author Date Rev
END
s
#MSUB -l partition=cab          # explicitly say where to run
#MSUB -l walltime=15:29:00      # maximum total CPU time
#MSUB -q pbatch                 # queue
#MSUB -A pemwork                # bank account
#MSUB -e error                  # send error to the same file as output
#MSUB -m be                     # send email when job begins/ends
#MSUB -N P0.0T415.v.00005       # job name

cd /p/lscratchd/zepeda/&amp;lt;path snipped&amp;gt;

echo start job `date`

srun -n 576 ./lmp_cab.mgpt -in input.shear &amp;gt; output

echo &quot;job ended: `date`&quot;

[root@oslic1:bass6]# lfs getstripe -v $copy
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
lmm_magic:          0x0BD10BD0
lmm_seq:            0x20001484c
lmm_object_id:      0x4f68
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
lmm_layout_gen:     0
lmm_stripe_offset:  77
        obdidx           objid           objid           group
            77         7754996       0x7654f4                0
            21         7699516       0x757c3c                0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;Examining the object on the server:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vesta78:~]# zdb -dddd vesta78/fsv-ost0@a 11560703
Dataset vesta78/fsv-ost0@a [ZPL], ID 328, cr_txg 77338879, 25.4T, 3835180 objects, rootbp DVA[0]=&amp;lt;0:b630dd63600:200&amp;gt; DVA[1]=&amp;lt;1:b6271568c00:200&amp;gt; [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=77338879L/77338879P fill=3835180 cksum=1cee422d3d:97fcac4292a:1b2c5923c2473:37578ea7aa912e

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11560703    1    16K   128K  5.00K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    ???&amp;lt;object#11560703&amp;gt;
        uid     7714
        gid     7714
        atime   Sat Sep 20 09:18:16 2014
        mtime   Wed Jun  4 12:24:23 2014
        ctime   Sat Sep 20 09:18:16 2014
        crtime  Fri Sep 19 12:46:01 2014
        gen     77254223
        mode    100666
        size    517
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\364Tv\000\000\000\000\000
                trusted.fid = \3558\001\000\002\000\000\000\025\307\000\000\000\000\000\000
                trusted.version = m\265\275\000\021\000\000\000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The OST object is owned by user 7714 (zepeda), not the owner of the original and copied files (czapotts). I looked for 517-byte files in zepeda&apos;s Lustre directory, and found two whose contents match the unexpected portion of the corrupt file. Note that the file is intact; it does not contain any data from the corrupt file:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oslic1:bass6]# cat /p/lscratchv/zepeda/&amp;lt;path snipped&amp;gt;
#!/bin/bash
#MSUB -l nodes=36               # number of nodes
#MSUB -l partition=cab          # explicitly say where to run
#MSUB -l walltime=15:29:00      # maximum total CPU time
#MSUB -q pbatch                 # queue
#MSUB -A pemwork                # bank account
#MSUB -e error                  # send error to the same file as output
#MSUB -m be                     # send email when job begins/ends
#MSUB -N P0.0T415.v.00005       # job name

cd /p/lscratchd/zepeda/&amp;lt;path snipped&amp;gt;

echo start job `date`

srun -n 576 ./lmp_cab.mgpt -in input.shear &amp;gt; output

echo &quot;job ended: `date`&quot;

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</comment>
                            <comment id="94680" author="bobijam" created="Tue, 23 Sep 2014 01:23:10 +0000"  >&lt;p&gt;For the record, example 3 expected size is 36473, real size is 1053224.&lt;/p&gt;

&lt;p&gt;it&apos;s stripe size is 1048576, obj 1 size is 36473 (expected size), obj 2 size is 4648,   and its real size comes from stripe size (1048576) + obj2 size (4648).&lt;/p&gt;</comment>
                            <comment id="94681" author="bobijam" created="Tue, 23 Sep 2014 01:51:09 +0000"  >&lt;p&gt;Is it possible that you have a reliable reproducer and grab -1 logs during that process, seems that file size update has some problem there.&lt;/p&gt;</comment>
                            <comment id="94687" author="green" created="Tue, 23 Sep 2014 03:07:32 +0000"  >&lt;p&gt;Just a few notes here:&lt;br/&gt;
If you suspect double linked object - having the object id you should be able to find the other file referencing it to prove it (with e.g. lfs find -stripe-index + getstripe combination to list all files owned by this user and see if one of them happen to use this object too).&lt;/p&gt;

&lt;p&gt;I see there&apos;s trusted.fid (and trusted.lma), I wonder if the link back at MDS points to the same file as the one you found it in? ll_decode_filter_fid should decode it, but I guess it only works on mounted (ldiskfs?) filesystems, I am not sure if you can make it work with a zfs mount.&lt;/p&gt;
</comment>
                            <comment id="94694" author="nedbass" created="Tue, 23 Sep 2014 05:39:16 +0000"  >&lt;p&gt;Zhenyu Xu, we don&apos;t have a reliable reproducer.  The cases we know about all happened on the same day, Saturday Sept. 20.  I&apos;m not sure what was going on with our systems at that time that may have contributed, but nothing obvious turned up in the system logs.&lt;/p&gt;

&lt;p&gt;Oleg, I&apos;ll investigate those avenues tomorrow.  Regarding trusted.{lma,fid} consistency, I noticed those attributes in the two objects I looked at in &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-5648?focusedCommentId=94669&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-94669&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this comment&lt;/a&gt; differ from each other. Should all objects belonging to the same file have the same values for those attributes?&lt;/p&gt;</comment>
                            <comment id="94721" author="green" created="Tue, 23 Sep 2014 15:33:57 +0000"  >&lt;p&gt;the values should not be the same since they contain the object fid and the parent fid. Only parent fid is the same for objects belonging to the same file.&lt;/p&gt;

&lt;p&gt;Also if all problems happened on about the same time - I wonder if there was a recovery/restart event on the ost or MDT at about then? Evictions?&lt;/p&gt;</comment>
                            <comment id="94725" author="jhammond" created="Tue, 23 Sep 2014 15:57:26 +0000"  >&lt;p&gt;Ned, can you see if &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200013e7f:0xec72:0x0&amp;#93;&lt;/span&gt; exists? And if so the would you report its attributes, its striping, and the output of zdb for its objects?&lt;/p&gt;</comment>
                            <comment id="94731" author="nedbass" created="Tue, 23 Sep 2014 16:50:19 +0000"  >&lt;p&gt;John, it doesn&apos;t exist.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oslic1:bass6]# lfs fid2path /p/lscratchv/ &apos;[0x200013e7f:0xec72:0x0]&apos;
fid2path error: No such file or directory
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="94789" author="nedbass" created="Tue, 23 Sep 2014 22:57:13 +0000"  >&lt;blockquote&gt;&lt;p&gt;Also if all problems happened on about the same time - I wonder if there was a recovery/restart event on the ost or MDT at about then? Evictions?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The MDS was crashed and rebooted on Friday Sept. 19 around 4pm, and recovery was manually aborted after about 15 minutes, and 316 of 416 clients were evicted:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2014-09-19 16:15:22 LustreError: 14196:0:(mdt_handler.c:6068:mdt_iocontrol()) fsv-MDT0000: Aborting recovery for device
2014-09-19 16:15:22 Lustre: fsv-MDT0000: Aborting recovery
2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted:  req@ffff880fd75cf800 x1478966131560584/t0(126365419020) o36-&amp;gt;a9203ee7-d515-f8b0-489f-93e7c500d1e6@172.20.17.40@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1
2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted:  req@ffff880febf24c00 x1478965666564300/t0(126365419031) o36-&amp;gt;8d2dd478-7b5b-c326-aee6-9a80b0f770d4@172.20.17.20@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1
2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted:  req@ffff881ff7a11800 x1476428833129976/t0(0) o101-&amp;gt;80581121-e767-c582-2846-f150ac18daf0@172.16.65.5@tcp:0/0 lens 328/0 e 9 to 0 dl 1411168567 ref 1 fl Complete:/40/ffffffff rc 0/-1
2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted:  req@ffff880fe116f800 x1476428340347788/t0(0) o101-&amp;gt;5e8e948f-4936-d26c-98de-80bffcc595b9@172.16.65.6@tcp:0/0 lens 328/0 e 12 to 0 dl 1411168537 ref 1 fl Complete:/40/ffffffff rc 0/-1
2014-09-19 16:15:22 Lustre: fsv-MDT0000: Recovery over after 17:09, of 416 clients 100 recovered and 316 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The earliest report we have of a corrupted file (reported this morning) was Friday 9/19 at 16:43, so shortly after the MDS was back up.  The others we know of happened on Saturday morning.  To throw out some speculation, we are running a backport of the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5039&quot; title=&quot;MDS mount hangs on orphan recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5039&quot;&gt;&lt;del&gt;LU-5039&lt;/del&gt;&lt;/a&gt; patch to do orphan cleanup asynchronously.  Perhaps there could be a race where orphaned objects get reused.  We have seen it take several hours to complete orphan cleanup, so it&apos;s possible that it was still running during the time when all the corrupt files were written. Just another angle to consider.&lt;/p&gt;</comment>
                            <comment id="94792" author="nedbass" created="Wed, 24 Sep 2014 00:34:37 +0000"  >&lt;p&gt;I have observed a correlation between OST object ownership and object creation time. Unexpected ownership indicates an object which may contain extra data belonging to another user.  I systematically dumped zdb output for all objects belonging to a list of ~4000 corrupt files (all created on morning of Saturday 9/20 by the same untar process).  Many of those objects are owned by some non-root user other than the file owner.  Each such object with a bad owner, without exception, has a creation time on Friday 9/19 before the MDS was rebooted (not all the same time, but within a few hours of each other).  Objects owned by the expected user have a mixture of creation times, some from before the MDS reboot, but others have a creation time that matches when the file was created.&lt;/p&gt;</comment>
                            <comment id="94808" author="green" created="Wed, 24 Sep 2014 04:27:28 +0000"  >&lt;p&gt;I am less certain about lu-5039 patch impact here as all affected files would be deleted already either way.&lt;/p&gt;

&lt;p&gt;On the other hand if you have corruption in precreate accounting (lastid) such that MDS would try to reuse objects already allocated on OSTs, that would likely produce similar effect to what you are seeing.&lt;/p&gt;

&lt;p&gt;I am still interested if there are any objects claimed by multiple files at the same time. I take it these objects are spread all across multiple OSTs, not just one?&lt;/p&gt;

&lt;p&gt;Why did the MDS crash, I wonder? I remember at one time LLNL was running MDS disk backends in some dangerous caching mode that needed a full fsck after every controller crash, though that was in ldiskfs days and such. Hopefully none of that is still true (and zfs protects everything tightly)?&lt;/p&gt;</comment>
                            <comment id="94830" author="nedbass" created="Wed, 24 Sep 2014 15:03:21 +0000"  >&lt;p&gt;Oleg, I haven&apos;t found any evidence of multiply linked objects yet, but I haven&apos;t done an exhaustive search.&lt;/p&gt;

&lt;p&gt;All of the objects are spread across multiple OSTs.&lt;/p&gt;

&lt;p&gt;I think the MDS crashed because of some sort of dm-multipath driver lockup, and the node eventually OOM-ed. No, we&apos;re no longer using a dangerous caching mode, and we have integrity checking from ZFS.&lt;/p&gt;</comment>
                            <comment id="94840" author="green" created="Wed, 24 Sep 2014 15:47:37 +0000"  >&lt;p&gt;Additionally, are the MDS and OSTs x86 nodes? I think PPC is only used on the clients?&lt;/p&gt;</comment>
                            <comment id="94846" author="nedbass" created="Wed, 24 Sep 2014 16:16:37 +0000"  >&lt;p&gt;Yes.&lt;/p&gt;</comment>
                            <comment id="94911" author="nedbass" created="Wed, 24 Sep 2014 23:08:36 +0000"  >&lt;p&gt;This subset of OSTs all logged &apos;deleting orphan object&apos; messages following the MDS reboot on Friday 9/19:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;vesta[2-4,6,8-12,14-16,18,21-22,24-27,29-30,32-37,40-45,49-50,53-58,63-64,66-68,70,72,75,77,81-82,85,91,93-96]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This subset of OSTs have objects owned by a non-root user who is &lt;em&gt;not&lt;/em&gt; the owner of the Lustre file:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;vesta[1,5,7,13,17,19-20,23,28,31,38-39,46-48,51-52,59-62,65,69,71,73-74,76,78-80,83-84,86-90,92]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the sets do not intersect and together comprise all OSS nodes in vesta.  So it seems that the OSTs owning the bad objects did not complete orphan cleanup for some reason.&lt;/p&gt;</comment>
                            <comment id="94943" author="green" created="Thu, 25 Sep 2014 06:58:24 +0000"  >&lt;p&gt;hm, to me this sounds like last_id corruption/update miss on MDS side.&lt;/p&gt;

&lt;p&gt;Potential path (as a speculation at this point):&lt;br/&gt;
MDS has trouble writing updates to disk, but keeps creating files (in memory), requesting OST objects and such. Clients write the files.&lt;br/&gt;
Eventually MDS crashes, lots of actual writes are lost on MDS, but objects on OSTs are populated.&lt;br/&gt;
Recovery is aborted (so the files are not recreated)&lt;br/&gt;
MDS does del orphan request to OST. and in ofd_create() we have this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                if (-diff &amp;gt; OST_MAX_PRECREATE) {
                        /* FIXME: should reset precreate_next_id on MDS */
                        rc = 0;
                } else if (diff &amp;lt; 0) {
                        rc = ofd_orphans_destroy(env, exp, ofd, oa);
                        oseq-&amp;gt;os_destroys_in_progress = 0;
                } else {
                        /* XXX: Used by MDS for the first time! */
                        oseq-&amp;gt;os_destroys_in_progress = 0;
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Basically my reading of that is: is MDS created way too many objects, we hit the first condition (-diff &amp;gt; OST_MAX_PRECREATE) and do nothing. There&apos;s a comment about readjustment, but I don&apos;t see how it happens really, unlike the cleanup orphan case where we actively skip used objects with ostid_set_id().&lt;br/&gt;
The objects are not destroyed.&lt;br/&gt;
Not yet clear how, but if OST would then start precreating from the point MDS thinks is the starting point, we&apos;ll reallocate all those objects and the creating function ofd_precreate_objects() happily accepts already existing objects:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                if (unlikely(ofd_object_exists(fo))) {
                        /* object may exist being re-created by write replay */
                        CDEBUG(D_INODE, &quot;object &quot;LPX64&quot;/&quot;LPX64&quot; exists: &quot;
                               DFID&quot;\n&quot;, ostid_seq(&amp;amp;oseq-&amp;gt;os_oi), id,
                               PFID(&amp;amp;info-&amp;gt;fti_fid));
                        continue;
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Since clients already wrote to those objects previously, stale data would be present there. But replay on the MDS did not happen, so they might be really attached to any other files than the new ones.&lt;br/&gt;
Did OSTs (or some of them) restart as well at around then?&lt;/p&gt;

&lt;p&gt;Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?&lt;/p&gt;</comment>
                            <comment id="94973" author="nedbass" created="Thu, 25 Sep 2014 16:50:28 +0000"  >&lt;blockquote&gt;&lt;p&gt;Did OSTs (or some of them) restart as well at around then?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I&apos;ll double-check, but I don&apos;t think so.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It appears so, but it&apos;s hard to say for sure because the &lt;tt&gt;trusted.fid&lt;/tt&gt; can&apos;t be trusted (ha ha) so we can&apos;t easily go from OST object ID to Lustre file and determine the expected owner.  My data is based on a list of about 4000 corrupt files provided by a user who untarred a tar file. So there were gaps between the object IDs, but they were close together. Also, as I mentioned above, they all have a ZFS creation timestamp from Friday morning to afternoon before the MDS reboot.&lt;/p&gt;</comment>
                            <comment id="94996" author="nedbass" created="Thu, 25 Sep 2014 19:40:57 +0000"  >&lt;p&gt;Oleg, that certainly seems like a plausible scenario.  In particular, &lt;tt&gt;ofd_create()&lt;/tt&gt; here assumes that the MDS can&apos;t lose more than &lt;tt&gt;OST_MAX_PRECREATE&lt;/tt&gt; object IDs after a crash. &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                if (-diff &amp;gt; OST_MAX_PRECREATE) {
                        /* FIXME: should reset precreate_next_id on MDS */
                        rc = 0;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt; But I think it &lt;em&gt;is&lt;/em&gt; possible for more than one lastid update to happen within the same ZFS transaction group on the MDS. Suppose there were two lastid updates in the open transaction group when the MDS crashes. Then it could be up to &lt;tt&gt;2 * OST_MAX_PRECREATE&lt;/tt&gt; behind the OST on restart, and we would execute the code above and skip orphan cleanup. To prevent this, maybe OSP should not issue more precreate RPCs until the most recent update is stable on disk.&lt;/p&gt;

&lt;p&gt;Also, do you know why the FIXME comment was never implemented?&lt;/p&gt;</comment>
                            <comment id="95002" author="green" created="Thu, 25 Sep 2014 23:17:59 +0000"  >&lt;p&gt;I don&apos;t know why it was never implemented.&lt;/p&gt;

&lt;p&gt;Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict.&lt;/p&gt;

&lt;p&gt;Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I&apos;ll try to construct a test case to verify this theory.&lt;/p&gt;

&lt;p&gt;As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).&lt;/p&gt;</comment>
                            <comment id="95003" author="nedbass" created="Thu, 25 Sep 2014 23:53:59 +0000"  >&lt;blockquote&gt;&lt;p&gt;As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner.  For example, some of the following sequential objects on a given OST are linked to files reported as corrupt.  All of those &lt;em&gt;not&lt;/em&gt; owned by uid 40161 or root are associated with corrupt files.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;11541167 -rw-rw-rw- 1 40161 40161 1294675 May  6 09:47 O/0/d27/7756091
11541168 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d28/7756092
11541170 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d29/7756093
11541171 -rw-rw-rw- 1 40161 40161 2097152 May  6 09:47 O/0/d30/7756094
11541172 -rw-rw-rw- 1 40161 40161 11682 May  6 09:46 O/0/d31/7756095
11541173 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d0/7756096
11541174 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d1/7756097
11541175 -rw-rw-rw- 1 40161 40161 5061 May  6 09:46 O/0/d2/7756098
11541176 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d3/7756099
11541177 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d4/7756100
11541178 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d5/7756101
11541179 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d6/7756102
11541180 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d7/7756103
11541181 -rw-rw-rw- 1 40161 40161 120 May  6 09:46 O/0/d8/7756104
11541182 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d9/7756105
11541183 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d10/7756106
11541184 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d11/7756107
11541185 -rw-rw-rw- 1 40161 40161 5470 May  6 09:46 O/0/d12/7756108
11541186 -rw-rw-rw- 1 40161 40161 1048576 May  6 09:46 O/0/d13/7756109
11541187 -rw-rw-rw- 1 40161 40161 6586 May  6 09:46 O/0/d14/7756110
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="95008" author="green" created="Fri, 26 Sep 2014 01:13:28 +0000"  >&lt;p&gt;the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course).&lt;br/&gt;
Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.&lt;/p&gt;</comment>
                            <comment id="95011" author="green" created="Fri, 26 Sep 2014 01:55:40 +0000"  >&lt;p&gt;Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct.&lt;/p&gt;

&lt;p&gt;To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
index 7e21806..a0e2729 100644
--- a/lustre/ofd/ofd_obd.c
+++ b/lustre/ofd/ofd_obd.c
@@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
 		diff = ostid_id(&amp;amp;oa-&amp;gt;o_oi) - ofd_seq_last_oid(oseq);
 		CDEBUG(D_HA, &quot;ofd_last_id() = &quot;LPU64&quot; -&amp;gt; diff = %d\n&quot;,
 			ofd_seq_last_oid(oseq), diff);
-		if (-diff &amp;gt; OST_MAX_PRECREATE) {
+		if (-diff &amp;gt; 2 /*OST_MAX_PRECREATE*/) {
 			/* FIXME: should reset precreate_next_id on MDS */
 			rc = 0;
 		} else if (diff &amp;lt; 0) {
diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
index ed2531d..a24a446 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -678,6 +678,19 @@ test_32() {
 }
 run_test 32 &quot;close() notices client eviction; close() after client eviction&quot;
 
+test_1234() {
+mkdir $DIR/dir1234
+mkdir $DIR/dir1234-2
+touch $DIR/dir1234/0
+replay_barrier $SINGLEMDS
+for i in `seq 1000` ; do echo aaaa$i &amp;gt; $DIR/dir1234/file$i ; done
+fail_abort $SINGLEMDS
+for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
+test -s $DIR/dir1234-2/file20 &amp;amp;&amp;amp; return 5
+exit
+}
+run_test 1234 &quot;test something&quot;
+
 test_33a() {
     createmany -o $DIR/$tfile-%d 10
     replay_barrier_nosync $SINGLEMDS
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After that run e.g. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;REFORMAT=yes ONLY=1234 sh replay-single.sh
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It&apos;ll fail with some nonsensical error message about first df failed that I don&apos;t want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you&apos;ll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition.&lt;br/&gt;
When cattign the files we also see stale garbage there.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/
...
-rw-r--r-- 1 root root 8 Sep 25 21:48 file979
-rw-r--r-- 1 root root 0 Sep 25 21:48 file98
-rw-r--r-- 1 root root 0 Sep 25 21:48 file980
-rw-r--r-- 1 root root 8 Sep 25 21:48 file981
-rw-r--r-- 1 root root 0 Sep 25 21:48 file982
-rw-r--r-- 1 root root 8 Sep 25 21:48 file983
-rw-r--r-- 1 root root 0 Sep 25 21:48 file984
-rw-r--r-- 1 root root 8 Sep 25 21:48 file985
-rw-r--r-- 1 root root 0 Sep 25 21:48 file986
-rw-r--r-- 1 root root 8 Sep 25 21:48 file987
-rw-r--r-- 1 root root 0 Sep 25 21:48 file988
-rw-r--r-- 1 root root 8 Sep 25 21:48 file989
-rw-r--r-- 1 root root 8 Sep 25 21:48 file99
-rw-r--r-- 1 root root 0 Sep 25 21:48 file990
-rw-r--r-- 1 root root 8 Sep 25 21:48 file991
-rw-r--r-- 1 root root 0 Sep 25 21:48 file992
-rw-r--r-- 1 root root 8 Sep 25 21:48 file993
-rw-r--r-- 1 root root 0 Sep 25 21:48 file994
-rw-r--r-- 1 root root 8 Sep 25 21:48 file995
-rw-r--r-- 1 root root 0 Sep 25 21:48 file996
-rw-r--r-- 1 root root 8 Sep 25 21:48 file997
-rw-r--r-- 1 root root 0 Sep 25 21:48 file998
-rw-r--r-- 1 root root 9 Sep 25 21:48 file999
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989
aaaa990
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999
aaaa1000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="95013" author="green" created="Fri, 26 Sep 2014 02:52:34 +0000"  >&lt;p&gt;Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/12067&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12067&lt;/a&gt; - This one will let MDS know that it needs to advance it&apos;s last id if we do not want to honor the delorphan request since it&apos;s way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/12068&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12068&lt;/a&gt; - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the &quot;past&quot;, we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.&lt;/p&gt;</comment>
                            <comment id="95021" author="bzzz" created="Fri, 26 Sep 2014 07:51:21 +0000"  >&lt;p&gt;sounds like lots of the clients were evicted at some point? that&apos;s the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don&apos;t think the case here).&lt;/p&gt;</comment>
                            <comment id="95038" author="green" created="Fri, 26 Sep 2014 13:21:49 +0000"  >&lt;p&gt;The recovery was aborted, so yes, clients were evicted.&lt;/p&gt;</comment>
                            <comment id="95101" author="green" created="Fri, 26 Sep 2014 23:31:08 +0000"  >&lt;p&gt;I was wrong about direct appplicability of master patchces to b2_4, btw.&lt;/p&gt;

&lt;p&gt;Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/12081&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12081&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/12082&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12082&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="95108" author="nedbass" created="Fri, 26 Sep 2014 23:59:26 +0000"  >&lt;p&gt;Thanks Oleg, nice work.  I verified the patches using your reproducer.  It would be great to get a test case based on your reproducer into the Lustre regression test suite.&lt;/p&gt;</comment>
                            <comment id="95109" author="adilger" created="Sat, 27 Sep 2014 00:08:31 +0000"  >&lt;p&gt;I think it is worthwhile to mention that LLNL is running the &quot;debug&quot; patch &lt;a href=&quot;http://review.whamcloud.com/7761&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7761&lt;/a&gt; on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS.  That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client.  I haven&apos;t looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.&lt;/p&gt;</comment>
                            <comment id="95114" author="green" created="Sat, 27 Sep 2014 01:17:27 +0000"  >&lt;p&gt;oh! I did not realize that.&lt;/p&gt;

&lt;p&gt;I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it&apos;s only 32 bits).&lt;br/&gt;
I was kind of expecting that all the sync logic will make sure that this cannot happen, but...&lt;br/&gt;
If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we&apos;ll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we&apos;ll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations).&lt;br/&gt;
I wonder how we better handle it then?&lt;/p&gt;</comment>
                            <comment id="95115" author="nedbass" created="Sat, 27 Sep 2014 01:42:01 +0000"  >&lt;blockquote&gt;&lt;p&gt;with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;To clarify, -1 is the default value which preserves the safe full sync behavior.&lt;/p&gt;

&lt;p&gt;We run our MDS with &lt;tt&gt;osd_txg_sync_delay_us=-1&lt;/tt&gt; and &lt;tt&gt;osd_object_sync_delay_us=0&lt;/tt&gt;.  This mainly prevents client fsyncs from forcing a pool-wide sync.  &lt;tt&gt;dt_trans_stop()&lt;/tt&gt; still safely honors the sync behavior on our systems, so operations using that interfaces will be consistent after a crash.&lt;/p&gt;</comment>
                            <comment id="98807" author="adilger" created="Mon, 10 Nov 2014 20:42:42 +0000"  >&lt;p&gt;I&apos;m wondering if osd-zfs OSTs may be more susceptible to this problem than osd-ldiskfs.  With osd-ldiskfs, whenever there are writes to precreated objects they will trigger a journal commit, but with osd-zfs there may be a large number of writes in cache without committing the TXG.  While it is possible to hit a similar scenario on osd-ldiskfs OSTs just precreating a large number of empty objects and not forcing a commit, this wouldn&apos;t actually lead to the data corruption since unwritten objects are still anonymous.&lt;/p&gt;

&lt;p&gt;As mentioned in &lt;a href=&quot;http://review.whamcloud.com/12067&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12067&lt;/a&gt; I think it makes sense to track the number of uncommitted precreates with a commit callback, and if this grows too large (&lt;tt&gt;OST_MAX_PRECREATE / 2&lt;/tt&gt; or so) it should &lt;em&gt;start&lt;/em&gt; a transaction commit.  I expect this wouldn&apos;t happen very often, so a precreate operation shouldn&apos;t be marked sync unless it really needs to be.&lt;/p&gt;</comment>
                            <comment id="102246" author="gerrit" created="Tue, 23 Dec 2014 12:44:36 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12067/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12067/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5648&quot; title=&quot;corrupt files contain extra data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5648&quot;&gt;&lt;del&gt;LU-5648&lt;/del&gt;&lt;/a&gt; ofd: In destroy orphan case always let MDS know last id&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: da5773e2b498a4edacc26fbf610d0b7628818d93&lt;/p&gt;</comment>
                            <comment id="102355" author="jlevi" created="Mon, 29 Dec 2014 14:18:43 +0000"  >&lt;p&gt;Patch landed to Master.&lt;/p&gt;</comment>
                            <comment id="102477" author="pjones" created="Fri, 2 Jan 2015 19:09:31 +0000"  >&lt;p&gt;Does the second patch need to land also? &lt;a href=&quot;http://review.whamcloud.com/#/c/12068/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/12068/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="102478" author="pjones" created="Fri, 2 Jan 2015 19:18:10 +0000"  >&lt;p&gt;Oleg confirms that it does&lt;/p&gt;</comment>
                            <comment id="102790" author="jlevi" created="Wed, 7 Jan 2015 19:04:58 +0000"  >&lt;p&gt;Emoly,&lt;br/&gt;
Could you please complete the test based on Oleg&apos;s comments found here:  &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-5648?focusedCommentId=95011&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-95011&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-5648?focusedCommentId=95011&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-95011&lt;/a&gt;&lt;br/&gt;
Thank you!&lt;/p&gt;</comment>
                            <comment id="102839" author="gerrit" created="Thu, 8 Jan 2015 05:16:35 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12068/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12068/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5648&quot; title=&quot;corrupt files contain extra data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5648&quot;&gt;&lt;del&gt;LU-5648&lt;/del&gt;&lt;/a&gt; ofd: Reject precreate requests below last_id&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 65a17be2d9e1acdec5d9aebaed007d4cb6d0ca11&lt;/p&gt;</comment>
                            <comment id="102982" author="gerrit" created="Fri, 9 Jan 2015 10:09:08 +0000"  >&lt;p&gt;Emoly Liu (emoly.liu@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13309&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13309&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5648&quot; title=&quot;corrupt files contain extra data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5648&quot;&gt;&lt;del&gt;LU-5648&lt;/del&gt;&lt;/a&gt; test: add test_101 in replay-single.sh&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3e3939795114db6639153139935a5cd12d3b0cdd&lt;/p&gt;</comment>
                            <comment id="102993" author="jlevi" created="Fri, 9 Jan 2015 13:35:11 +0000"  >&lt;p&gt;Lowering priority for adding test.&lt;/p&gt;</comment>
                            <comment id="104451" author="gerrit" created="Fri, 23 Jan 2015 01:52:15 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/13309/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13309/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5648&quot; title=&quot;corrupt files contain extra data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5648&quot;&gt;&lt;del&gt;LU-5648&lt;/del&gt;&lt;/a&gt; tests: a new test to avoid reallocated object IDs&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5a914ccee212c8bf4920145a6b81fad83688e1a0&lt;/p&gt;</comment>
                            <comment id="104470" author="pjones" created="Fri, 23 Jan 2015 05:18:44 +0000"  >&lt;p&gt;Landed for 2.7&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="66076">LU-15009</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwwv3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15831</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>