Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.4.2
    • lustre-2.4.2-14chaos, ZFS OSD
    • 3
    • 15831

    Description

      Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.

      In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin.

      The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:

      Example # Expected Size Actual Size
      1 2000000 2097152
      2 1008829 2097152
      3 36473 1053224
      4 1008829 1441432

      In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.

                                                                               
      1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
      1999856   D   +   U   1   j   q   g   ;   7   J   r   {   "   j   )   D            
      1999872 enq   *   C   `   =   o   C   &   K   \   a   1   D   v   k  ht            
      1999888   !   A   ;  ff   2   "   G   i   m   9   e dle   $  si   T   )            
      1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
      1999920   1   4   $   c   W   l   M   D   &   3   U   J   B   )   t   {            
      1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
      1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
      1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   "   #   J   7            
      1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   >            
      2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
      2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
      2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
      2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
      2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
      2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
      2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
      2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
      2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
      2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0            
      

      In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2:

                                                                               
      1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
      1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
      1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
      1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
      1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
      *                                                                                  
      1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
      1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
      1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
      1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
      1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
      1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
      1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
      1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
      1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
      1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
      1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
      1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4            
      

      In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.

      Attachments

        Issue Links

          Activity

            [LU-5648] corrupt files contain extra data
            green Oleg Drokin added a comment -

            oh! I did not realize that.

            I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits).
            I was kind of expecting that all the sync logic will make sure that this cannot happen, but...
            If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations).
            I wonder how we better handle it then?

            green Oleg Drokin added a comment - oh! I did not realize that. I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits). I was kind of expecting that all the sync logic will make sure that this cannot happen, but... If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations). I wonder how we better handle it then?

            I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

            adilger Andreas Dilger added a comment - I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

            Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

            nedbass Ned Bass (Inactive) added a comment - Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.
            green Oleg Drokin added a comment -

            I was wrong about direct appplicability of master patchces to b2_4, btw.

            Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried):

            http://review.whamcloud.com/12081
            http://review.whamcloud.com/12082

            green Oleg Drokin added a comment - I was wrong about direct appplicability of master patchces to b2_4, btw. Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried): http://review.whamcloud.com/12081 http://review.whamcloud.com/12082
            green Oleg Drokin added a comment -

            The recovery was aborted, so yes, clients were evicted.

            green Oleg Drokin added a comment - The recovery was aborted, so yes, clients were evicted.

            sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).

            bzzz Alex Zhuravlev added a comment - sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).
            green Oleg Drokin added a comment -

            Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same):

            http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand.

            http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.

            green Oleg Drokin added a comment - Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same): http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand. http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.
            green Oleg Drokin added a comment -

            Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct.

            To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched):

            diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
            index 7e21806..a0e2729 100644
            --- a/lustre/ofd/ofd_obd.c
            +++ b/lustre/ofd/ofd_obd.c
            @@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
             		diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq);
             		CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n",
             			ofd_seq_last_oid(oseq), diff);
            -		if (-diff > OST_MAX_PRECREATE) {
            +		if (-diff > 2 /*OST_MAX_PRECREATE*/) {
             			/* FIXME: should reset precreate_next_id on MDS */
             			rc = 0;
             		} else if (diff < 0) {
            diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
            index ed2531d..a24a446 100755
            --- a/lustre/tests/replay-single.sh
            +++ b/lustre/tests/replay-single.sh
            @@ -678,6 +678,19 @@ test_32() {
             }
             run_test 32 "close() notices client eviction; close() after client eviction"
             
            +test_1234() {
            +mkdir $DIR/dir1234
            +mkdir $DIR/dir1234-2
            +touch $DIR/dir1234/0
            +replay_barrier $SINGLEMDS
            +for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done
            +fail_abort $SINGLEMDS
            +for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
            +test -s $DIR/dir1234-2/file20 && return 5
            +exit
            +}
            +run_test 1234 "test something"
            +
             test_33a() {
                 createmany -o $DIR/$tfile-%d 10
                 replay_barrier_nosync $SINGLEMDS
            

            After that run e.g.

            REFORMAT=yes ONLY=1234 sh replay-single.sh
            

            It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition.
            When cattign the files we also see stale garbage there.

            [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/
            ...
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file979
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file98
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file980
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file981
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file982
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file983
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file984
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file985
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file986
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file987
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file988
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file989
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file99
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file990
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file991
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file992
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file993
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file994
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file995
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file996
            -rw-r--r-- 1 root root 8 Sep 25 21:48 file997
            -rw-r--r-- 1 root root 0 Sep 25 21:48 file998
            -rw-r--r-- 1 root root 9 Sep 25 21:48 file999
            [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989
            aaaa990
            [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999
            aaaa1000
            
            green Oleg Drokin added a comment - Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct. To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched): diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c index 7e21806..a0e2729 100644 --- a/lustre/ofd/ofd_obd.c +++ b/lustre/ofd/ofd_obd.c @@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp, diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq); CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n", ofd_seq_last_oid(oseq), diff); - if (-diff > OST_MAX_PRECREATE) { + if (-diff > 2 /*OST_MAX_PRECREATE*/) { /* FIXME: should reset precreate_next_id on MDS */ rc = 0; } else if (diff < 0) { diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh index ed2531d..a24a446 100755 --- a/lustre/tests/replay-single.sh +++ b/lustre/tests/replay-single.sh @@ -678,6 +678,19 @@ test_32() { } run_test 32 "close() notices client eviction; close() after client eviction" +test_1234() { +mkdir $DIR/dir1234 +mkdir $DIR/dir1234-2 +touch $DIR/dir1234/0 +replay_barrier $SINGLEMDS +for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done +fail_abort $SINGLEMDS +for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done +test -s $DIR/dir1234-2/file20 && return 5 +exit +} +run_test 1234 "test something" + test_33a() { createmany -o $DIR/$tfile-%d 10 replay_barrier_nosync $SINGLEMDS After that run e.g. REFORMAT=yes ONLY=1234 sh replay-single.sh It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition. When cattign the files we also see stale garbage there. [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/ ... -rw-r--r-- 1 root root 8 Sep 25 21:48 file979 -rw-r--r-- 1 root root 0 Sep 25 21:48 file98 -rw-r--r-- 1 root root 0 Sep 25 21:48 file980 -rw-r--r-- 1 root root 8 Sep 25 21:48 file981 -rw-r--r-- 1 root root 0 Sep 25 21:48 file982 -rw-r--r-- 1 root root 8 Sep 25 21:48 file983 -rw-r--r-- 1 root root 0 Sep 25 21:48 file984 -rw-r--r-- 1 root root 8 Sep 25 21:48 file985 -rw-r--r-- 1 root root 0 Sep 25 21:48 file986 -rw-r--r-- 1 root root 8 Sep 25 21:48 file987 -rw-r--r-- 1 root root 0 Sep 25 21:48 file988 -rw-r--r-- 1 root root 8 Sep 25 21:48 file989 -rw-r--r-- 1 root root 8 Sep 25 21:48 file99 -rw-r--r-- 1 root root 0 Sep 25 21:48 file990 -rw-r--r-- 1 root root 8 Sep 25 21:48 file991 -rw-r--r-- 1 root root 0 Sep 25 21:48 file992 -rw-r--r-- 1 root root 8 Sep 25 21:48 file993 -rw-r--r-- 1 root root 0 Sep 25 21:48 file994 -rw-r--r-- 1 root root 8 Sep 25 21:48 file995 -rw-r--r-- 1 root root 0 Sep 25 21:48 file996 -rw-r--r-- 1 root root 8 Sep 25 21:48 file997 -rw-r--r-- 1 root root 0 Sep 25 21:48 file998 -rw-r--r-- 1 root root 9 Sep 25 21:48 file999 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989 aaaa990 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999 aaaa1000
            green Oleg Drokin added a comment -

            the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course).
            Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.

            green Oleg Drokin added a comment - the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course). Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.
            nedbass Ned Bass (Inactive) added a comment - - edited

            As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially?

            Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files.

            11541167 -rw-rw-rw- 1 40161 40161 1294675 May  6 09:47 O/0/d27/7756091
            11541168 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d28/7756092
            11541170 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d29/7756093
            11541171 -rw-rw-rw- 1 40161 40161 2097152 May  6 09:47 O/0/d30/7756094
            11541172 -rw-rw-rw- 1 40161 40161 11682 May  6 09:46 O/0/d31/7756095
            11541173 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d0/7756096
            11541174 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d1/7756097
            11541175 -rw-rw-rw- 1 40161 40161 5061 May  6 09:46 O/0/d2/7756098
            11541176 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d3/7756099
            11541177 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d4/7756100
            11541178 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d5/7756101
            11541179 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d6/7756102
            11541180 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d7/7756103
            11541181 -rw-rw-rw- 1 40161 40161 120 May  6 09:46 O/0/d8/7756104
            11541182 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d9/7756105
            11541183 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d10/7756106
            11541184 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d11/7756107
            11541185 -rw-rw-rw- 1 40161 40161 5470 May  6 09:46 O/0/d12/7756108
            11541186 -rw-rw-rw- 1 40161 40161 1048576 May  6 09:46 O/0/d13/7756109
            11541187 -rw-rw-rw- 1 40161 40161 6586 May  6 09:46 O/0/d14/7756110
            
            nedbass Ned Bass (Inactive) added a comment - - edited As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files. 11541167 -rw-rw-rw- 1 40161 40161 1294675 May 6 09:47 O/0/d27/7756091 11541168 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d28/7756092 11541170 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d29/7756093 11541171 -rw-rw-rw- 1 40161 40161 2097152 May 6 09:47 O/0/d30/7756094 11541172 -rw-rw-rw- 1 40161 40161 11682 May 6 09:46 O/0/d31/7756095 11541173 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d0/7756096 11541174 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d1/7756097 11541175 -rw-rw-rw- 1 40161 40161 5061 May 6 09:46 O/0/d2/7756098 11541176 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d3/7756099 11541177 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d4/7756100 11541178 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d5/7756101 11541179 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d6/7756102 11541180 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d7/7756103 11541181 -rw-rw-rw- 1 40161 40161 120 May 6 09:46 O/0/d8/7756104 11541182 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d9/7756105 11541183 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d10/7756106 11541184 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d11/7756107 11541185 -rw-rw-rw- 1 40161 40161 5470 May 6 09:46 O/0/d12/7756108 11541186 -rw-rw-rw- 1 40161 40161 1048576 May 6 09:46 O/0/d13/7756109 11541187 -rw-rw-rw- 1 40161 40161 6586 May 6 09:46 O/0/d14/7756110
            green Oleg Drokin added a comment -

            I don't know why it was never implemented.

            Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict.

            Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory.

            As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

            green Oleg Drokin added a comment - I don't know why it was never implemented. Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict. Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory. As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

            People

              green Oleg Drokin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: