Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.4.2
    • lustre-2.4.2-14chaos, ZFS OSD
    • 3
    • 15831

    Description

      Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.

      In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin.

      The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:

      Example # Expected Size Actual Size
      1 2000000 2097152
      2 1008829 2097152
      3 36473 1053224
      4 1008829 1441432

      In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.

                                                                               
      1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
      1999856   D   +   U   1   j   q   g   ;   7   J   r   {   "   j   )   D            
      1999872 enq   *   C   `   =   o   C   &   K   \   a   1   D   v   k  ht            
      1999888   !   A   ;  ff   2   "   G   i   m   9   e dle   $  si   T   )            
      1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
      1999920   1   4   $   c   W   l   M   D   &   3   U   J   B   )   t   {            
      1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
      1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
      1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   "   #   J   7            
      1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   >            
      2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
      2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
      2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
      2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
      2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
      2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
      2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
      2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
      2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
      2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0            
      

      In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2:

                                                                               
      1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
      1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
      1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
      1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
      1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
      *                                                                                  
      1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
      1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
      1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
      1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
      1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
      1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
      1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
      1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
      1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
      1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
      1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
      1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4            
      

      In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.

      Attachments

        Issue Links

          Activity

            [LU-5648] corrupt files contain extra data

            sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).

            bzzz Alex Zhuravlev added a comment - sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).
            green Oleg Drokin added a comment -

            Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same):

            http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand.

            http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.

            green Oleg Drokin added a comment - Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same): http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand. http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.
            green Oleg Drokin added a comment -

            Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct.

            To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched):

            diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
            index 7e21806..a0e2729 100644
            --- a/lustre/ofd/ofd_obd.c
            +++ b/lustre/ofd/ofd_obd.c
            @@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
             		diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq);
             		CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n",
             			ofd_seq_last_oid(oseq), diff);
            -		if (-diff > OST_MAX_PRECREATE) {
            +		if (-diff > 2 /*OST_MAX_PRECREATE*/) {
             			/* FIXME: should reset precreate_next_id on MDS */
             			rc = 0;
             		} else if (diff < 0) {
            diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
            index ed2531d..a24a446 100755
            --- a/lustre/tests/replay-single.sh
            +++ b/lustre/tests/replay-single.sh
            @@ -678,6 +678,19 @@ test_32() {
             }
             run_test 32 "close() notices client eviction; close() after client eviction"
             
            +test_1234() {
            +mkdir $DIR/dir1234
            +mkdir $DIR/dir1234-2
            +touch $DIR/dir1234/0
            +replay_barrier $SINGLEMDS
            +for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done
            +fail_abort $SINGLEMDS
            +for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
            +test -s $DIR/dir1234-2/file20 && return 5
            +exit
            +}
            +run_test 1234 "test something"
            +
             test_33a() {
                 createmany -o $DIR/$tfile-%d 10
                 replay_barrier_nosync $SINGLEMDS
            

            After that run e.g.

            REFORMAT=yes ONLY=1234 sh replay-single.sh
            

            It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition.
            When cattign the files we also see stale garbage there.

            [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/
            ...
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file979
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file98
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file980
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file981
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file982
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file983
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file984
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file985
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file986
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file987
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file988
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file989
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file99
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file990
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file991
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file992
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file993
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file994
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file995
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file996
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file997
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file998
            -rw-r--r-- 1 root root 9 Sep 25 21:48 file999
            [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989
            aaaa990
            [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999
            aaaa1000
            
            green Oleg Drokin added a comment - Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct. To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched): diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c index 7e21806..a0e2729 100644 --- a/lustre/ofd/ofd_obd.c +++ b/lustre/ofd/ofd_obd.c @@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp, diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq); CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n", ofd_seq_last_oid(oseq), diff); - if (-diff > OST_MAX_PRECREATE) { + if (-diff > 2 /*OST_MAX_PRECREATE*/) { /* FIXME: should reset precreate_next_id on MDS */ rc = 0; } else if (diff < 0) { diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh index ed2531d..a24a446 100755 --- a/lustre/tests/replay-single.sh +++ b/lustre/tests/replay-single.sh @@ -678,6 +678,19 @@ test_32() { } run_test 32 "close() notices client eviction; close() after client eviction" +test_1234() { +mkdir $DIR/dir1234 +mkdir $DIR/dir1234-2 +touch $DIR/dir1234/0 +replay_barrier $SINGLEMDS +for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done +fail_abort $SINGLEMDS +for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done +test -s $DIR/dir1234-2/file20 && return 5 +exit +} +run_test 1234 "test something" + test_33a() { createmany -o $DIR/$tfile-%d 10 replay_barrier_nosync $SINGLEMDS After that run e.g. REFORMAT=yes ONLY=1234 sh replay-single.sh It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition. When cattign the files we also see stale garbage there. [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/ ... -rw-r--r-- 1 root root 8 Sep 25 21:48 file979 -rw-r--r-- 1 root root 0 Sep 25 21:48 file98 -rw-r--r-- 1 root root 0 Sep 25 21:48 file980 -rw-r--r-- 1 root root 8 Sep 25 21:48 file981 -rw-r--r-- 1 root root 0 Sep 25 21:48 file982 -rw-r--r-- 1 root root 8 Sep 25 21:48 file983 -rw-r--r-- 1 root root 0 Sep 25 21:48 file984 -rw-r--r-- 1 root root 8 Sep 25 21:48 file985 -rw-r--r-- 1 root root 0 Sep 25 21:48 file986 -rw-r--r-- 1 root root 8 Sep 25 21:48 file987 -rw-r--r-- 1 root root 0 Sep 25 21:48 file988 -rw-r--r-- 1 root root 8 Sep 25 21:48 file989 -rw-r--r-- 1 root root 8 Sep 25 21:48 file99 -rw-r--r-- 1 root root 0 Sep 25 21:48 file990 -rw-r--r-- 1 root root 8 Sep 25 21:48 file991 -rw-r--r-- 1 root root 0 Sep 25 21:48 file992 -rw-r--r-- 1 root root 8 Sep 25 21:48 file993 -rw-r--r-- 1 root root 0 Sep 25 21:48 file994 -rw-r--r-- 1 root root 8 Sep 25 21:48 file995 -rw-r--r-- 1 root root 0 Sep 25 21:48 file996 -rw-r--r-- 1 root root 8 Sep 25 21:48 file997 -rw-r--r-- 1 root root 0 Sep 25 21:48 file998 -rw-r--r-- 1 root root 9 Sep 25 21:48 file999 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989 aaaa990 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999 aaaa1000
            green Oleg Drokin added a comment -

            the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course).
            Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.

            green Oleg Drokin added a comment - the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course). Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.
            nedbass Ned Bass (Inactive) added a comment - - edited

            As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially?

            Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files.

            11541167 -rw-rw-rw- 1 40161 40161 1294675 May  6 09:47 O/0/d27/7756091
            11541168 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d28/7756092
            11541170 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d29/7756093
            11541171 -rw-rw-rw- 1 40161 40161 2097152 May  6 09:47 O/0/d30/7756094
            11541172 -rw-rw-rw- 1 40161 40161 11682 May  6 09:46 O/0/d31/7756095
            11541173 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d0/7756096
            11541174 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d1/7756097
            11541175 -rw-rw-rw- 1 40161 40161 5061 May  6 09:46 O/0/d2/7756098
            11541176 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d3/7756099
            11541177 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d4/7756100
            11541178 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d5/7756101
            11541179 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d6/7756102
            11541180 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d7/7756103
            11541181 -rw-rw-rw- 1 40161 40161 120 May  6 09:46 O/0/d8/7756104
            11541182 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d9/7756105
            11541183 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d10/7756106
            11541184 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d11/7756107
            11541185 -rw-rw-rw- 1 40161 40161 5470 May  6 09:46 O/0/d12/7756108
            11541186 -rw-rw-rw- 1 40161 40161 1048576 May  6 09:46 O/0/d13/7756109
            11541187 -rw-rw-rw- 1 40161 40161 6586 May  6 09:46 O/0/d14/7756110
            
            nedbass Ned Bass (Inactive) added a comment - - edited As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files. 11541167 -rw-rw-rw- 1 40161 40161 1294675 May 6 09:47 O/0/d27/7756091 11541168 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d28/7756092 11541170 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d29/7756093 11541171 -rw-rw-rw- 1 40161 40161 2097152 May 6 09:47 O/0/d30/7756094 11541172 -rw-rw-rw- 1 40161 40161 11682 May 6 09:46 O/0/d31/7756095 11541173 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d0/7756096 11541174 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d1/7756097 11541175 -rw-rw-rw- 1 40161 40161 5061 May 6 09:46 O/0/d2/7756098 11541176 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d3/7756099 11541177 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d4/7756100 11541178 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d5/7756101 11541179 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d6/7756102 11541180 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d7/7756103 11541181 -rw-rw-rw- 1 40161 40161 120 May 6 09:46 O/0/d8/7756104 11541182 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d9/7756105 11541183 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d10/7756106 11541184 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d11/7756107 11541185 -rw-rw-rw- 1 40161 40161 5470 May 6 09:46 O/0/d12/7756108 11541186 -rw-rw-rw- 1 40161 40161 1048576 May 6 09:46 O/0/d13/7756109 11541187 -rw-rw-rw- 1 40161 40161 6586 May 6 09:46 O/0/d14/7756110
            green Oleg Drokin added a comment -

            I don't know why it was never implemented.

            Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict.

            Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory.

            As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

            green Oleg Drokin added a comment - I don't know why it was never implemented. Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict. Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory. As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

            Oleg, that certainly seems like a plausible scenario. In particular, ofd_create() here assumes that the MDS can't lose more than OST_MAX_PRECREATE object IDs after a crash.

                            if (-diff > OST_MAX_PRECREATE) {
                                    /* FIXME: should reset precreate_next_id on MDS */
                                    rc = 0;
            

            But I think it is possible for more than one lastid update to happen within the same ZFS transaction group on the MDS. Suppose there were two lastid updates in the open transaction group when the MDS crashes. Then it could be up to 2 * OST_MAX_PRECREATE behind the OST on restart, and we would execute the code above and skip orphan cleanup. To prevent this, maybe OSP should not issue more precreate RPCs until the most recent update is stable on disk.

            Also, do you know why the FIXME comment was never implemented?

            nedbass Ned Bass (Inactive) added a comment - Oleg, that certainly seems like a plausible scenario. In particular, ofd_create() here assumes that the MDS can't lose more than OST_MAX_PRECREATE object IDs after a crash. if (-diff > OST_MAX_PRECREATE) { /* FIXME: should reset precreate_next_id on MDS */ rc = 0; But I think it is possible for more than one lastid update to happen within the same ZFS transaction group on the MDS. Suppose there were two lastid updates in the open transaction group when the MDS crashes. Then it could be up to 2 * OST_MAX_PRECREATE behind the OST on restart, and we would execute the code above and skip orphan cleanup. To prevent this, maybe OSP should not issue more precreate RPCs until the most recent update is stable on disk. Also, do you know why the FIXME comment was never implemented?

            Did OSTs (or some of them) restart as well at around then?

            I'll double-check, but I don't think so.

            Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?

            It appears so, but it's hard to say for sure because the trusted.fid can't be trusted (ha ha) so we can't easily go from OST object ID to Lustre file and determine the expected owner. My data is based on a list of about 4000 corrupt files provided by a user who untarred a tar file. So there were gaps between the object IDs, but they were close together. Also, as I mentioned above, they all have a ZFS creation timestamp from Friday morning to afternoon before the MDS reboot.

            nedbass Ned Bass (Inactive) added a comment - Did OSTs (or some of them) restart as well at around then? I'll double-check, but I don't think so. Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered? It appears so, but it's hard to say for sure because the trusted.fid can't be trusted (ha ha) so we can't easily go from OST object ID to Lustre file and determine the expected owner. My data is based on a list of about 4000 corrupt files provided by a user who untarred a tar file. So there were gaps between the object IDs, but they were close together. Also, as I mentioned above, they all have a ZFS creation timestamp from Friday morning to afternoon before the MDS reboot.
            green Oleg Drokin added a comment -

            hm, to me this sounds like last_id corruption/update miss on MDS side.

            Potential path (as a speculation at this point):
            MDS has trouble writing updates to disk, but keeps creating files (in memory), requesting OST objects and such. Clients write the files.
            Eventually MDS crashes, lots of actual writes are lost on MDS, but objects on OSTs are populated.
            Recovery is aborted (so the files are not recreated)
            MDS does del orphan request to OST. and in ofd_create() we have this:

                            if (-diff > OST_MAX_PRECREATE) {
                                    /* FIXME: should reset precreate_next_id on MDS */
                                    rc = 0;
                            } else if (diff < 0) {
                                    rc = ofd_orphans_destroy(env, exp, ofd, oa);
                                    oseq->os_destroys_in_progress = 0;
                            } else {
                                    /* XXX: Used by MDS for the first time! */
                                    oseq->os_destroys_in_progress = 0;
                            }
            

            Basically my reading of that is: is MDS created way too many objects, we hit the first condition (-diff > OST_MAX_PRECREATE) and do nothing. There's a comment about readjustment, but I don't see how it happens really, unlike the cleanup orphan case where we actively skip used objects with ostid_set_id().
            The objects are not destroyed.
            Not yet clear how, but if OST would then start precreating from the point MDS thinks is the starting point, we'll reallocate all those objects and the creating function ofd_precreate_objects() happily accepts already existing objects:

                            if (unlikely(ofd_object_exists(fo))) {
                                    /* object may exist being re-created by write replay */
                                    CDEBUG(D_INODE, "object "LPX64"/"LPX64" exists: "
                                           DFID"\n", ostid_seq(&oseq->os_oi), id,
                                           PFID(&info->fti_fid));
                                    continue;
                            }
            

            Since clients already wrote to those objects previously, stale data would be present there. But replay on the MDS did not happen, so they might be really attached to any other files than the new ones.
            Did OSTs (or some of them) restart as well at around then?

            Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?

            green Oleg Drokin added a comment - hm, to me this sounds like last_id corruption/update miss on MDS side. Potential path (as a speculation at this point): MDS has trouble writing updates to disk, but keeps creating files (in memory), requesting OST objects and such. Clients write the files. Eventually MDS crashes, lots of actual writes are lost on MDS, but objects on OSTs are populated. Recovery is aborted (so the files are not recreated) MDS does del orphan request to OST. and in ofd_create() we have this: if (-diff > OST_MAX_PRECREATE) { /* FIXME: should reset precreate_next_id on MDS */ rc = 0; } else if (diff < 0) { rc = ofd_orphans_destroy(env, exp, ofd, oa); oseq->os_destroys_in_progress = 0; } else { /* XXX: Used by MDS for the first time! */ oseq->os_destroys_in_progress = 0; } Basically my reading of that is: is MDS created way too many objects, we hit the first condition (-diff > OST_MAX_PRECREATE) and do nothing. There's a comment about readjustment, but I don't see how it happens really, unlike the cleanup orphan case where we actively skip used objects with ostid_set_id(). The objects are not destroyed. Not yet clear how, but if OST would then start precreating from the point MDS thinks is the starting point, we'll reallocate all those objects and the creating function ofd_precreate_objects() happily accepts already existing objects: if (unlikely(ofd_object_exists(fo))) { /* object may exist being re-created by write replay */ CDEBUG(D_INODE, "object "LPX64"/"LPX64" exists: " DFID"\n", ostid_seq(&oseq->os_oi), id, PFID(&info->fti_fid)); continue; } Since clients already wrote to those objects previously, stale data would be present there. But replay on the MDS did not happen, so they might be really attached to any other files than the new ones. Did OSTs (or some of them) restart as well at around then? Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?

            This subset of OSTs all logged 'deleting orphan object' messages following the MDS reboot on Friday 9/19:

            vesta[2-4,6,8-12,14-16,18,21-22,24-27,29-30,32-37,40-45,49-50,53-58,63-64,66-68,70,72,75,77,81-82,85,91,93-96]
            

            This subset of OSTs have objects owned by a non-root user who is not the owner of the Lustre file:

            vesta[1,5,7,13,17,19-20,23,28,31,38-39,46-48,51-52,59-62,65,69,71,73-74,76,78-80,83-84,86-90,92]
            

            Note that the sets do not intersect and together comprise all OSS nodes in vesta. So it seems that the OSTs owning the bad objects did not complete orphan cleanup for some reason.

            nedbass Ned Bass (Inactive) added a comment - This subset of OSTs all logged 'deleting orphan object' messages following the MDS reboot on Friday 9/19: vesta[2-4,6,8-12,14-16,18,21-22,24-27,29-30,32-37,40-45,49-50,53-58,63-64,66-68,70,72,75,77,81-82,85,91,93-96] This subset of OSTs have objects owned by a non-root user who is not the owner of the Lustre file: vesta[1,5,7,13,17,19-20,23,28,31,38-39,46-48,51-52,59-62,65,69,71,73-74,76,78-80,83-84,86-90,92] Note that the sets do not intersect and together comprise all OSS nodes in vesta. So it seems that the OSTs owning the bad objects did not complete orphan cleanup for some reason.

            People

              green Oleg Drokin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: