[LU-5648] corrupt files contain extra data - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.4.2
Labels:
- HB
- llnl
Environment:
lustre-2.4.2-14chaos, ZFS OSD

Severity:
3
Rank (Obsolete):
15831

Description

Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.

In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin.

The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:

Example #	Expected Size	Actual Size
1	2000000	2097152
2	1008829	2097152
3	36473	1053224
4	1008829	1441432

In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.

                                                                         
1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
1999856   D   +   U   1   j   q   g   ;   7   J   r   {   "   j   )   D            
1999872 enq   *   C   `   =   o   C   &   K   \   a   1   D   v   k  ht            
1999888   !   A   ;  ff   2   "   G   i   m   9   e dle   $  si   T   )            
1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
1999920   1   4   $   c   W   l   M   D   &   3   U   J   B   )   t   {            
1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   "   #   J   7            
1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   >            
2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0

In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2:

                                                                         
1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
*                                                                                  
1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4

In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.

Attachments

Issue Links

is related to

LU-15009 precreate should cleanup orphans upon error

Resolved

Activity

[LU-5648] corrupt files contain extra data

Oleg Drokin added a comment - 27/Sep/14 1:17 AM

oh! I did not realize that.

I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits).
I was kind of expecting that all the sync logic will make sure that this cannot happen, but...
If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations).
I wonder how we better handle it then?

Oleg Drokin added a comment - 27/Sep/14 1:17 AM oh! I did not realize that. I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits). I was kind of expecting that all the sync logic will make sure that this cannot happen, but... If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations). I wonder how we better handle it then?

Andreas Dilger added a comment - 27/Sep/14 12:08 AM

I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

Andreas Dilger added a comment - 27/Sep/14 12:08 AM I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

Ned Bass (Inactive) added a comment - 26/Sep/14 11:59 PM

Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

Ned Bass (Inactive) added a comment - 26/Sep/14 11:59 PM Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

Oleg Drokin added a comment - 26/Sep/14 11:31 PM

I was wrong about direct appplicability of master patchces to b2_4, btw.

Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried):

http://review.whamcloud.com/12081
http://review.whamcloud.com/12082

Oleg Drokin added a comment - 26/Sep/14 11:31 PM I was wrong about direct appplicability of master patchces to b2_4, btw. Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried): http://review.whamcloud.com/12081 http://review.whamcloud.com/12082

Oleg Drokin added a comment - 26/Sep/14 1:21 PM

The recovery was aborted, so yes, clients were evicted.

Oleg Drokin added a comment - 26/Sep/14 1:21 PM The recovery was aborted, so yes, clients were evicted.

Alex Zhuravlev added a comment - 26/Sep/14 7:51 AM

sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).

Alex Zhuravlev added a comment - 26/Sep/14 7:51 AM sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).

Oleg Drokin added a comment - 26/Sep/14 2:52 AM

Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same):

http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand.

http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.

Oleg Drokin added a comment - 26/Sep/14 2:52 AM Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same): http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand. http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.

Oleg Drokin added a comment - 26/Sep/14 1:55 AM

Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct.

To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched):

diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
index 7e21806..a0e2729 100644
--- a/lustre/ofd/ofd_obd.c
+++ b/lustre/ofd/ofd_obd.c
@@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
 		diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq);
 		CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n",
 			ofd_seq_last_oid(oseq), diff);
-		if (-diff > OST_MAX_PRECREATE) {
+		if (-diff > 2 /*OST_MAX_PRECREATE*/) {
 			/* FIXME: should reset precreate_next_id on MDS */
 			rc = 0;
 		} else if (diff < 0) {
diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
index ed2531d..a24a446 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -678,6 +678,19 @@ test_32() {
 }
 run_test 32 "close() notices client eviction; close() after client eviction"
 
+test_1234() {
+mkdir $DIR/dir1234
+mkdir $DIR/dir1234-2
+touch $DIR/dir1234/0
+replay_barrier $SINGLEMDS
+for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done
+fail_abort $SINGLEMDS
+for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
+test -s $DIR/dir1234-2/file20 && return 5
+exit
+}
+run_test 1234 "test something"
+
 test_33a() {
     createmany -o $DIR/$tfile-%d 10
     replay_barrier_nosync $SINGLEMDS

After that run e.g.

REFORMAT=yes ONLY=1234 sh replay-single.sh

It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition.
When cattign the files we also see stale garbage there.

[root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/
...
-rw-r--r-- 1 root root 8 Sep 25 21:48 file979
-rw-r--r-- 1 root root 0 Sep 25 21:48 file98
-rw-r--r-- 1 root root 0 Sep 25 21:48 file980
-rw-r--r-- 1 root root 8 Sep 25 21:48 file981
-rw-r--r-- 1 root root 0 Sep 25 21:48 file982
-rw-r--r-- 1 root root 8 Sep 25 21:48 file983
-rw-r--r-- 1 root root 0 Sep 25 21:48 file984
-rw-r--r-- 1 root root 8 Sep 25 21:48 file985
-rw-r--r-- 1 root root 0 Sep 25 21:48 file986
-rw-r--r-- 1 root root 8 Sep 25 21:48 file987
-rw-r--r-- 1 root root 0 Sep 25 21:48 file988
-rw-r--r-- 1 root root 8 Sep 25 21:48 file989
-rw-r--r-- 1 root root 8 Sep 25 21:48 file99
-rw-r--r-- 1 root root 0 Sep 25 21:48 file990
-rw-r--r-- 1 root root 8 Sep 25 21:48 file991
-rw-r--r-- 1 root root 0 Sep 25 21:48 file992
-rw-r--r-- 1 root root 8 Sep 25 21:48 file993
-rw-r--r-- 1 root root 0 Sep 25 21:48 file994
-rw-r--r-- 1 root root 8 Sep 25 21:48 file995
-rw-r--r-- 1 root root 0 Sep 25 21:48 file996
-rw-r--r-- 1 root root 8 Sep 25 21:48 file997
-rw-r--r-- 1 root root 0 Sep 25 21:48 file998
-rw-r--r-- 1 root root 9 Sep 25 21:48 file999
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989
aaaa990
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999
aaaa1000

Oleg Drokin added a comment - 26/Sep/14 1:55 AM Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct. To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched): diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c index 7e21806..a0e2729 100644 --- a/lustre/ofd/ofd_obd.c +++ b/lustre/ofd/ofd_obd.c @@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp, diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq); CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n", ofd_seq_last_oid(oseq), diff); - if (-diff > OST_MAX_PRECREATE) { + if (-diff > 2 /*OST_MAX_PRECREATE*/) { /* FIXME: should reset precreate_next_id on MDS */ rc = 0; } else if (diff < 0) { diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh index ed2531d..a24a446 100755 --- a/lustre/tests/replay-single.sh +++ b/lustre/tests/replay-single.sh @@ -678,6 +678,19 @@ test_32() { } run_test 32 "close() notices client eviction; close() after client eviction" +test_1234() { +mkdir $DIR/dir1234 +mkdir $DIR/dir1234-2 +touch $DIR/dir1234/0 +replay_barrier $SINGLEMDS +for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done +fail_abort $SINGLEMDS +for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done +test -s $DIR/dir1234-2/file20 && return 5 +exit +} +run_test 1234 "test something" + test_33a() { createmany -o $DIR/$tfile-%d 10 replay_barrier_nosync $SINGLEMDS After that run e.g. REFORMAT=yes ONLY=1234 sh replay-single.sh It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition. When cattign the files we also see stale garbage there. [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/ ... -rw-r--r-- 1 root root 8 Sep 25 21:48 file979 -rw-r--r-- 1 root root 0 Sep 25 21:48 file98 -rw-r--r-- 1 root root 0 Sep 25 21:48 file980 -rw-r--r-- 1 root root 8 Sep 25 21:48 file981 -rw-r--r-- 1 root root 0 Sep 25 21:48 file982 -rw-r--r-- 1 root root 8 Sep 25 21:48 file983 -rw-r--r-- 1 root root 0 Sep 25 21:48 file984 -rw-r--r-- 1 root root 8 Sep 25 21:48 file985 -rw-r--r-- 1 root root 0 Sep 25 21:48 file986 -rw-r--r-- 1 root root 8 Sep 25 21:48 file987 -rw-r--r-- 1 root root 0 Sep 25 21:48 file988 -rw-r--r-- 1 root root 8 Sep 25 21:48 file989 -rw-r--r-- 1 root root 8 Sep 25 21:48 file99 -rw-r--r-- 1 root root 0 Sep 25 21:48 file990 -rw-r--r-- 1 root root 8 Sep 25 21:48 file991 -rw-r--r-- 1 root root 0 Sep 25 21:48 file992 -rw-r--r-- 1 root root 8 Sep 25 21:48 file993 -rw-r--r-- 1 root root 0 Sep 25 21:48 file994 -rw-r--r-- 1 root root 8 Sep 25 21:48 file995 -rw-r--r-- 1 root root 0 Sep 25 21:48 file996 -rw-r--r-- 1 root root 8 Sep 25 21:48 file997 -rw-r--r-- 1 root root 0 Sep 25 21:48 file998 -rw-r--r-- 1 root root 9 Sep 25 21:48 file999 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989 aaaa990 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999 aaaa1000

Oleg Drokin added a comment - 26/Sep/14 1:13 AM

the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course).
Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.

Oleg Drokin added a comment - 26/Sep/14 1:13 AM the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course). Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.

Ned Bass (Inactive) added a comment - 25/Sep/14 11:53 PM - edited

As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially?

Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files.

11541167 -rw-rw-rw- 1 40161 40161 1294675 May  6 09:47 O/0/d27/7756091
11541168 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d28/7756092
11541170 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d29/7756093
11541171 -rw-rw-rw- 1 40161 40161 2097152 May  6 09:47 O/0/d30/7756094
11541172 -rw-rw-rw- 1 40161 40161 11682 May  6 09:46 O/0/d31/7756095
11541173 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d0/7756096
11541174 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d1/7756097
11541175 -rw-rw-rw- 1 40161 40161 5061 May  6 09:46 O/0/d2/7756098
11541176 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d3/7756099
11541177 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d4/7756100
11541178 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d5/7756101
11541179 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d6/7756102
11541180 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d7/7756103
11541181 -rw-rw-rw- 1 40161 40161 120 May  6 09:46 O/0/d8/7756104
11541182 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d9/7756105
11541183 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d10/7756106
11541184 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d11/7756107
11541185 -rw-rw-rw- 1 40161 40161 5470 May  6 09:46 O/0/d12/7756108
11541186 -rw-rw-rw- 1 40161 40161 1048576 May  6 09:46 O/0/d13/7756109
11541187 -rw-rw-rw- 1 40161 40161 6586 May  6 09:46 O/0/d14/7756110

Ned Bass (Inactive) added a comment - 25/Sep/14 11:53 PM - edited As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files. 11541167 -rw-rw-rw- 1 40161 40161 1294675 May 6 09:47 O/0/d27/7756091 11541168 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d28/7756092 11541170 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d29/7756093 11541171 -rw-rw-rw- 1 40161 40161 2097152 May 6 09:47 O/0/d30/7756094 11541172 -rw-rw-rw- 1 40161 40161 11682 May 6 09:46 O/0/d31/7756095 11541173 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d0/7756096 11541174 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d1/7756097 11541175 -rw-rw-rw- 1 40161 40161 5061 May 6 09:46 O/0/d2/7756098 11541176 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d3/7756099 11541177 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d4/7756100 11541178 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d5/7756101 11541179 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d6/7756102 11541180 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d7/7756103 11541181 -rw-rw-rw- 1 40161 40161 120 May 6 09:46 O/0/d8/7756104 11541182 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d9/7756105 11541183 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d10/7756106 11541184 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d11/7756107 11541185 -rw-rw-rw- 1 40161 40161 5470 May 6 09:46 O/0/d12/7756108 11541186 -rw-rw-rw- 1 40161 40161 1048576 May 6 09:46 O/0/d13/7756109 11541187 -rw-rw-rw- 1 40161 40161 6586 May 6 09:46 O/0/d14/7756110

Oleg Drokin added a comment - 25/Sep/14 11:17 PM

I don't know why it was never implemented.

Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict.

Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory.

As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

Oleg Drokin added a comment - 25/Sep/14 11:17 PM I don't know why it was never implemented. Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict. Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory. As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

People

Assignee:: Oleg Drokin

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 22/Sep/14 9:06 PM

Updated:: 20/Oct/22 3:28 PM

Resolved:: 23/Jan/15 5:18 AM