[LU-5648] corrupt files contain extra data Created: 22/Sep/14 Updated: 20/Oct/22 Resolved: 23/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, llnl | ||
| Environment: |
lustre-2.4.2-14chaos, ZFS OSD |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15831 | ||||||||
| Description |
|
Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written. In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin. The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:
In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.
1999840 ! esc nul del [ dc3 + b h \ can ; f h dc4 9
1999856 D + U 1 j q g ; 7 J r { " j ) D
1999872 enq * C ` = o C & K \ a 1 D v k ht
1999888 ! A ; ff 2 " G i m 9 e dle $ si T )
1999904 9 etb nl w bel N rs R * nul eot o v p y can
1999920 1 4 $ c W l M D & 3 U J B ) t {
1999936 A del s I M dc1 esc w dc1 sp g bs dle ` sp A
1999952 nak D % l 1 r 1 W % ack ! h 0 syn c r
1999968 nak W ; b h W Z z w B stx bs " # J 7
1999984 h o $ em b V p bel ] dc2 o cr ) S del >
2000000 sp 1 1 1 9 nl B o x sp f r a c A A
2000016 sp 6 4 0 sp 6 7 9 sp 4 0 sp 7 9 sp 1
2000032 1 0 0 sp 1 1 1 9 nl B o x sp f r a
2000048 c A A sp 6 8 0 sp 7 1 9 sp 4 0 sp 7
2000064 9 sp 1 1 0 0 sp 1 1 1 9 nl B o x sp
2000080 f r a c A A sp 7 2 0 sp 7 5 9 sp 4
2000096 0 sp 7 9 sp 1 1 0 0 sp 1 1 1 9 nl B
2000112 o x sp f r a c A A sp 7 6 0 sp 7 9
2000128 9 sp 4 0 sp 7 9 sp 1 1 0 0 sp 1 1 1
2000144 9 nl B o x sp f r a c A A sp 8 0 0
In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2: 1008768 g nl . / t e s t d i r / c h m o 1008784 d s t g nl . / t e s t d i r / l 1008800 s t o r a g e nl . / t e s t f i 1008816 l e 3 nl . / z z j u n k nl nul nul nul 1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 1048576 sp 5 9 9 nl B o x sp f r a c A A sp 1048592 2 0 0 sp 2 3 9 sp 2 0 0 sp 2 3 9 sp 1048608 5 8 0 sp 5 9 9 nl B o x sp f r a c 1048624 A A sp 2 4 0 sp 2 7 9 sp 2 0 0 sp 2 1048640 3 9 sp 5 8 0 sp 5 9 9 nl B o x sp f 1048656 r a c A A sp 2 8 0 sp 3 1 9 sp 2 0 1048672 0 sp 2 3 9 sp 5 8 0 sp 5 9 9 nl B o 1048688 x sp f r a c A A sp 3 2 0 sp 3 5 9 1048704 sp 2 0 0 sp 2 3 9 sp 5 8 0 sp 5 9 9 1048720 nl B o x sp f r a c A A sp 3 6 0 sp 1048736 3 9 9 sp 2 0 0 sp 2 3 9 sp 5 8 0 sp 1048752 5 9 9 nl B o x sp f r a c A A sp 4 In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data. |
| Comments |
| Comment by John Hammond [ 22/Sep/14 ] |
|
Hi Ned, > This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data. Based on the report I don't think that this is a necessary conclusion. It could be that this is old data from a previously unlinked file and that the object sizes were corrupted. (Or you might be correct. Or something else...) What does stat return for these files? The real size or the expected size? Are you also able to stat the OST objects? Does the output of getstripe look normal? > In examples 2-4 the file should have only one object. Do you mean that the files should have stripe count 1 but instead have stripe count 2? |
| Comment by Ned Bass [ 22/Sep/14 ] |
stat returns the real size.
I haven't done that yet. I'll post my findings when I do.
Yes, for example here the getstripe output for Example 3 above (expected size 36473). [root@oslic1:bass6]# ls -l /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
-rw------- 1 czapotts czapotts 1053224 Sep 20 09:20 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
[root@oslic1:bass6]# lfs getstripe -v /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20001484c
lmm_object_id: 0x7b50
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_stripe_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 17
obdidx objid objid group
17 7853811 0x77d6f3 0
51 7697453 0x75742d 0
Yes. |
| Comment by Ned Bass [ 22/Sep/14 ] |
Note the objects have different owners and timestamps. The contents of the objects match what I see in the file from a Lustre client. [root@vesta18:0]# pwd
/mnt/toss-2688/O/0
[root@vesta18:0]# ls -li d$(( 7853811 % 32))/7853811
11240968 -rw-rw-rw- 1 40161 40161 36473 Sep 20 09:20 d19/7853811
[root@vesta18:0]# zdb -dddd vesta18/fsv-ost0@toss-2688 11240968
Dataset vesta18/fsv-ost0@toss-2688 [ZPL], ID 306, cr_txg 70829659, 24.1T, 3872191 objects, rootbp DVA[0]=<2:aafb5c4bc00:200> DVA[1]=<0:aafcede7200:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=70829659L/70829659P fill=3872191 cksum=1cbcf56f2f:8d6df67891c:17fb1399c1fb3:2ed1492ff40927
Object lvl iblk dblk dsize lsize %full type
11240968 1 16K 128K 7.00K 128K 100.00 ZFS plain file
312 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path ???<object#11240968>
uid 40161
gid 40161
atime Sat Sep 20 09:20:12 2014
mtime Sat Sep 20 09:20:10 2014
ctime Sat Sep 20 09:20:10 2014
crtime Sat Sep 20 09:19:53 2014
gen 70763960
mode 100666
size 36473
parent 0
links 1
pflags 0
rdev 0x0000000000000000
SA xattrs: 168 bytes, 3 entries
trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\363\326w\000\000\000\000\000
trusted.fid = LH\001\000\002\000\000\000P{\000\000\000\000\000\000
trusted.version = 8\255\314\000\024\000\000\000
Note:
# printf 'LH\001\000\002\000\000\000P{\000\000\000\000\000\000' | hexdump -C
00000000 4c 48 01 00 02 00 00 00 50 7b 00 00 00 00 00 00 |LH......P{......|
00000010
## f_seq = 0x20001484c
## f_oid = 0x7b50
## f_stripe_index = 0
Object 2: [root@vesta52:~]# cd /mnt/toss-2688/O/0/
[root@vesta52:0]# ls -li d$(( 7697453 % 32 ))/7697453
11554471 -rw-rw-rw- 1 44793 44793 4648 Sep 19 14:55 d13/7697453
[root@vesta52:0]# zdb -dddd vesta52/fsv-ost0@toss-2688 11554471
Dataset vesta52/fsv-ost0@toss-2688 [ZPL], ID 327, cr_txg 74769068, 26.2T, 3803450 objects, rootbp DVA[0]=<1:b8de3f5f000:200> DVA[1]=<2:b8e5bdf5800:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=74769068L/74769068P fill=3803450 cksum=18f2760b51:7b49838026d:151711df02cf7:29c5f44d190860
Object lvl iblk dblk dsize lsize %full type
11554471 1 16K 128K 8K 128K 100.00 ZFS plain file
312 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path ???<object#11554471>
uid 44793
gid 44793
atime Wed Dec 31 16:00:00 1969
mtime Fri Sep 19 14:55:24 2014
ctime Fri Sep 19 14:55:24 2014
crtime Fri Sep 19 09:12:22 2014
gen 74681220
mode 100666
size 4648
parent 0
links 1
pflags 0
rdev 0x0000000000000000
SA xattrs: 168 bytes, 3 entries
trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000-tu\000\000\000\000\000
trusted.fid = \177>\001\000\002\000\000\000r\354\000\000\000\000\000\000
trusted.version = \032g\267\000\021\000\000\000
# printf '\177>\001\000\002\000\000\000r\354\000\000\000\000\000\000' | hexdump -C
00000000 7f 3e 01 00 02 00 00 00 72 ec 00 00 00 00 00 00 |.>......r.......|
00000010
## f_seq = 0x200013e7f
## f_oid = 0xec72
## f_stripe_index = 0
|
| Comment by John Fuchs-Chesney (Inactive) [ 22/Sep/14 ] |
|
I'm looking after this ticket for the time being. |
| Comment by Ned Bass [ 23/Sep/14 ] |
Please disregard my comments about the stripe count. The filesystem default stripe count is 2, so that is expected. |
| Comment by Ned Bass [ 23/Sep/14 ] |
|
I have another example, a corrupted plain text file where all data fits within the first object. It clearly appears that another user's data is appended to the end: [root@oslic1:bass6]# orig=/p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# copy=/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# ls -l $orig
-r-------- 1 czapotts czapotts 47 Jun 4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# ls -l $copy
-rw------- 1 czapotts czapotts 517 Jun 4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# cat $orig
K 12
svn:keywords
V 19
URL Author Date Rev
END
[root@oslic1:bass6]# cat $copy
K 12
svn:keywords
V 19
URL Author Date Rev
END
s
#MSUB -l partition=cab # explicitly say where to run
#MSUB -l walltime=15:29:00 # maximum total CPU time
#MSUB -q pbatch # queue
#MSUB -A pemwork # bank account
#MSUB -e error # send error to the same file as output
#MSUB -m be # send email when job begins/ends
#MSUB -N P0.0T415.v.00005 # job name
cd /p/lscratchd/zepeda/<path snipped>
echo start job `date`
srun -n 576 ./lmp_cab.mgpt -in input.shear > output
echo "job ended: `date`"
[root@oslic1:bass6]# lfs getstripe -v $copy
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20001484c
lmm_object_id: 0x4f68
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_stripe_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 77
obdidx objid objid group
77 7754996 0x7654f4 0
21 7699516 0x757c3c 0
Examining the object on the server: [root@vesta78:~]# zdb -dddd vesta78/fsv-ost0@a 11560703
Dataset vesta78/fsv-ost0@a [ZPL], ID 328, cr_txg 77338879, 25.4T, 3835180 objects, rootbp DVA[0]=<0:b630dd63600:200> DVA[1]=<1:b6271568c00:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=77338879L/77338879P fill=3835180 cksum=1cee422d3d:97fcac4292a:1b2c5923c2473:37578ea7aa912e
Object lvl iblk dblk dsize lsize %full type
11560703 1 16K 128K 5.00K 128K 100.00 ZFS plain file
312 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path ???<object#11560703>
uid 7714
gid 7714
atime Sat Sep 20 09:18:16 2014
mtime Wed Jun 4 12:24:23 2014
ctime Sat Sep 20 09:18:16 2014
crtime Fri Sep 19 12:46:01 2014
gen 77254223
mode 100666
size 517
parent 0
links 1
pflags 0
rdev 0x0000000000000000
SA xattrs: 168 bytes, 3 entries
trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\364Tv\000\000\000\000\000
trusted.fid = \3558\001\000\002\000\000\000\025\307\000\000\000\000\000\000
trusted.version = m\265\275\000\021\000\000\000
The OST object is owned by user 7714 (zepeda), not the owner of the original and copied files (czapotts). I looked for 517-byte files in zepeda's Lustre directory, and found two whose contents match the unexpected portion of the corrupt file. Note that the file is intact; it does not contain any data from the corrupt file: [root@oslic1:bass6]# cat /p/lscratchv/zepeda/<path snipped> #!/bin/bash #MSUB -l nodes=36 # number of nodes #MSUB -l partition=cab # explicitly say where to run #MSUB -l walltime=15:29:00 # maximum total CPU time #MSUB -q pbatch # queue #MSUB -A pemwork # bank account #MSUB -e error # send error to the same file as output #MSUB -m be # send email when job begins/ends #MSUB -N P0.0T415.v.00005 # job name cd /p/lscratchd/zepeda/<path snipped> echo start job `date` srun -n 576 ./lmp_cab.mgpt -in input.shear > output echo "job ended: `date`" |
| Comment by Zhenyu Xu [ 23/Sep/14 ] |
|
For the record, example 3 expected size is 36473, real size is 1053224. it's stripe size is 1048576, obj 1 size is 36473 (expected size), obj 2 size is 4648, and its real size comes from stripe size (1048576) + obj2 size (4648). |
| Comment by Zhenyu Xu [ 23/Sep/14 ] |
|
Is it possible that you have a reliable reproducer and grab -1 logs during that process, seems that file size update has some problem there. |
| Comment by Oleg Drokin [ 23/Sep/14 ] |
|
Just a few notes here: I see there's trusted.fid (and trusted.lma), I wonder if the link back at MDS points to the same file as the one you found it in? ll_decode_filter_fid should decode it, but I guess it only works on mounted (ldiskfs?) filesystems, I am not sure if you can make it work with a zfs mount. |
| Comment by Ned Bass [ 23/Sep/14 ] |
|
Zhenyu Xu, we don't have a reliable reproducer. The cases we know about all happened on the same day, Saturday Sept. 20. I'm not sure what was going on with our systems at that time that may have contributed, but nothing obvious turned up in the system logs. Oleg, I'll investigate those avenues tomorrow. Regarding trusted.{lma,fid} consistency, I noticed those attributes in the two objects I looked at in this comment differ from each other. Should all objects belonging to the same file have the same values for those attributes? |
| Comment by Oleg Drokin [ 23/Sep/14 ] |
|
the values should not be the same since they contain the object fid and the parent fid. Only parent fid is the same for objects belonging to the same file. Also if all problems happened on about the same time - I wonder if there was a recovery/restart event on the ost or MDT at about then? Evictions? |
| Comment by John Hammond [ 23/Sep/14 ] |
|
Ned, can you see if [0x200013e7f:0xec72:0x0] exists? And if so the would you report its attributes, its striping, and the output of zdb for its objects? |
| Comment by Ned Bass [ 23/Sep/14 ] |
|
John, it doesn't exist. [root@oslic1:bass6]# lfs fid2path /p/lscratchv/ '[0x200013e7f:0xec72:0x0]' fid2path error: No such file or directory |
| Comment by Ned Bass [ 23/Sep/14 ] |
The MDS was crashed and rebooted on Friday Sept. 19 around 4pm, and recovery was manually aborted after about 15 minutes, and 316 of 416 clients were evicted: 2014-09-19 16:15:22 LustreError: 14196:0:(mdt_handler.c:6068:mdt_iocontrol()) fsv-MDT0000: Aborting recovery for device 2014-09-19 16:15:22 Lustre: fsv-MDT0000: Aborting recovery 2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted: req@ffff880fd75cf800 x1478966131560584/t0(126365419020) o36->a9203ee7-d515-f8b0-489f-93e7c500d1e6@172.20.17.40@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1 2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted: req@ffff880febf24c00 x1478965666564300/t0(126365419031) o36->8d2dd478-7b5b-c326-aee6-9a80b0f770d4@172.20.17.20@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1 2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted: req@ffff881ff7a11800 x1476428833129976/t0(0) o101->80581121-e767-c582-2846-f150ac18daf0@172.16.65.5@tcp:0/0 lens 328/0 e 9 to 0 dl 1411168567 ref 1 fl Complete:/40/ffffffff rc 0/-1 2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted: req@ffff880fe116f800 x1476428340347788/t0(0) o101->5e8e948f-4936-d26c-98de-80bffcc595b9@172.16.65.6@tcp:0/0 lens 328/0 e 12 to 0 dl 1411168537 ref 1 fl Complete:/40/ffffffff rc 0/-1 2014-09-19 16:15:22 Lustre: fsv-MDT0000: Recovery over after 17:09, of 416 clients 100 recovered and 316 were evicted. The earliest report we have of a corrupted file (reported this morning) was Friday 9/19 at 16:43, so shortly after the MDS was back up. The others we know of happened on Saturday morning. To throw out some speculation, we are running a backport of the |
| Comment by Ned Bass [ 24/Sep/14 ] |
|
I have observed a correlation between OST object ownership and object creation time. Unexpected ownership indicates an object which may contain extra data belonging to another user. I systematically dumped zdb output for all objects belonging to a list of ~4000 corrupt files (all created on morning of Saturday 9/20 by the same untar process). Many of those objects are owned by some non-root user other than the file owner. Each such object with a bad owner, without exception, has a creation time on Friday 9/19 before the MDS was rebooted (not all the same time, but within a few hours of each other). Objects owned by the expected user have a mixture of creation times, some from before the MDS reboot, but others have a creation time that matches when the file was created. |
| Comment by Oleg Drokin [ 24/Sep/14 ] |
|
I am less certain about lu-5039 patch impact here as all affected files would be deleted already either way. On the other hand if you have corruption in precreate accounting (lastid) such that MDS would try to reuse objects already allocated on OSTs, that would likely produce similar effect to what you are seeing. I am still interested if there are any objects claimed by multiple files at the same time. I take it these objects are spread all across multiple OSTs, not just one? Why did the MDS crash, I wonder? I remember at one time LLNL was running MDS disk backends in some dangerous caching mode that needed a full fsck after every controller crash, though that was in ldiskfs days and such. Hopefully none of that is still true (and zfs protects everything tightly)? |
| Comment by Ned Bass [ 24/Sep/14 ] |
|
Oleg, I haven't found any evidence of multiply linked objects yet, but I haven't done an exhaustive search. All of the objects are spread across multiple OSTs. I think the MDS crashed because of some sort of dm-multipath driver lockup, and the node eventually OOM-ed. No, we're no longer using a dangerous caching mode, and we have integrity checking from ZFS. |
| Comment by Oleg Drokin [ 24/Sep/14 ] |
|
Additionally, are the MDS and OSTs x86 nodes? I think PPC is only used on the clients? |
| Comment by Ned Bass [ 24/Sep/14 ] |
|
Yes. |
| Comment by Ned Bass [ 24/Sep/14 ] |
|
This subset of OSTs all logged 'deleting orphan object' messages following the MDS reboot on Friday 9/19: vesta[2-4,6,8-12,14-16,18,21-22,24-27,29-30,32-37,40-45,49-50,53-58,63-64,66-68,70,72,75,77,81-82,85,91,93-96] This subset of OSTs have objects owned by a non-root user who is not the owner of the Lustre file: vesta[1,5,7,13,17,19-20,23,28,31,38-39,46-48,51-52,59-62,65,69,71,73-74,76,78-80,83-84,86-90,92] Note that the sets do not intersect and together comprise all OSS nodes in vesta. So it seems that the OSTs owning the bad objects did not complete orphan cleanup for some reason. |
| Comment by Oleg Drokin [ 25/Sep/14 ] |
|
hm, to me this sounds like last_id corruption/update miss on MDS side. Potential path (as a speculation at this point): if (-diff > OST_MAX_PRECREATE) {
/* FIXME: should reset precreate_next_id on MDS */
rc = 0;
} else if (diff < 0) {
rc = ofd_orphans_destroy(env, exp, ofd, oa);
oseq->os_destroys_in_progress = 0;
} else {
/* XXX: Used by MDS for the first time! */
oseq->os_destroys_in_progress = 0;
}
Basically my reading of that is: is MDS created way too many objects, we hit the first condition (-diff > OST_MAX_PRECREATE) and do nothing. There's a comment about readjustment, but I don't see how it happens really, unlike the cleanup orphan case where we actively skip used objects with ostid_set_id(). if (unlikely(ofd_object_exists(fo))) {
/* object may exist being re-created by write replay */
CDEBUG(D_INODE, "object "LPX64"/"LPX64" exists: "
DFID"\n", ostid_seq(&oseq->os_oi), id,
PFID(&info->fti_fid));
continue;
}
Since clients already wrote to those objects previously, stale data would be present there. But replay on the MDS did not happen, so they might be really attached to any other files than the new ones. Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered? |
| Comment by Ned Bass [ 25/Sep/14 ] |
I'll double-check, but I don't think so.
It appears so, but it's hard to say for sure because the trusted.fid can't be trusted (ha ha) so we can't easily go from OST object ID to Lustre file and determine the expected owner. My data is based on a list of about 4000 corrupt files provided by a user who untarred a tar file. So there were gaps between the object IDs, but they were close together. Also, as I mentioned above, they all have a ZFS creation timestamp from Friday morning to afternoon before the MDS reboot. |
| Comment by Ned Bass [ 25/Sep/14 ] |
|
Oleg, that certainly seems like a plausible scenario. In particular, ofd_create() here assumes that the MDS can't lose more than OST_MAX_PRECREATE object IDs after a crash. if (-diff > OST_MAX_PRECREATE) {
/* FIXME: should reset precreate_next_id on MDS */
rc = 0;
But I think it is possible for more than one lastid update to happen within the same ZFS transaction group on the MDS. Suppose there were two lastid updates in the open transaction group when the MDS crashes. Then it could be up to 2 * OST_MAX_PRECREATE behind the OST on restart, and we would execute the code above and skip orphan cleanup. To prevent this, maybe OSP should not issue more precreate RPCs until the most recent update is stable on disk. Also, do you know why the FIXME comment was never implemented? |
| Comment by Oleg Drokin [ 25/Sep/14 ] |
|
I don't know why it was never implemented. Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict. Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory. As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename). |
| Comment by Ned Bass [ 25/Sep/14 ] |
Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files. 11541167 -rw-rw-rw- 1 40161 40161 1294675 May 6 09:47 O/0/d27/7756091 11541168 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d28/7756092 11541170 -rwSrwSrw- 1 root root 0 May 6 09:47 O/0/d29/7756093 11541171 -rw-rw-rw- 1 40161 40161 2097152 May 6 09:47 O/0/d30/7756094 11541172 -rw-rw-rw- 1 40161 40161 11682 May 6 09:46 O/0/d31/7756095 11541173 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d0/7756096 11541174 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d1/7756097 11541175 -rw-rw-rw- 1 40161 40161 5061 May 6 09:46 O/0/d2/7756098 11541176 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d3/7756099 11541177 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d4/7756100 11541178 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d5/7756101 11541179 -rw-rw-rw- 1 36782 36782 1294675 May 6 09:46 O/0/d6/7756102 11541180 -rwSrwSrw- 1 root root 0 May 6 09:46 O/0/d7/7756103 11541181 -rw-rw-rw- 1 40161 40161 120 May 6 09:46 O/0/d8/7756104 11541182 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d9/7756105 11541183 -rw-rw-rw- 1 40161 40161 47 May 6 09:46 O/0/d10/7756106 11541184 -rw-rw-rw- 1 36782 36782 1048576 May 6 09:46 O/0/d11/7756107 11541185 -rw-rw-rw- 1 40161 40161 5470 May 6 09:46 O/0/d12/7756108 11541186 -rw-rw-rw- 1 40161 40161 1048576 May 6 09:46 O/0/d13/7756109 11541187 -rw-rw-rw- 1 40161 40161 6586 May 6 09:46 O/0/d14/7756110 |
| Comment by Oleg Drokin [ 26/Sep/14 ] |
|
the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course). |
| Comment by Oleg Drokin [ 26/Sep/14 ] |
|
Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct. To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched): diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
index 7e21806..a0e2729 100644
--- a/lustre/ofd/ofd_obd.c
+++ b/lustre/ofd/ofd_obd.c
@@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq);
CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n",
ofd_seq_last_oid(oseq), diff);
- if (-diff > OST_MAX_PRECREATE) {
+ if (-diff > 2 /*OST_MAX_PRECREATE*/) {
/* FIXME: should reset precreate_next_id on MDS */
rc = 0;
} else if (diff < 0) {
diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
index ed2531d..a24a446 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -678,6 +678,19 @@ test_32() {
}
run_test 32 "close() notices client eviction; close() after client eviction"
+test_1234() {
+mkdir $DIR/dir1234
+mkdir $DIR/dir1234-2
+touch $DIR/dir1234/0
+replay_barrier $SINGLEMDS
+for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done
+fail_abort $SINGLEMDS
+for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
+test -s $DIR/dir1234-2/file20 && return 5
+exit
+}
+run_test 1234 "test something"
+
test_33a() {
createmany -o $DIR/$tfile-%d 10
replay_barrier_nosync $SINGLEMDS
After that run e.g. REFORMAT=yes ONLY=1234 sh replay-single.sh It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition. [root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/ ... -rw-r--r-- 1 root root 8 Sep 25 21:48 file979 -rw-r--r-- 1 root root 0 Sep 25 21:48 file98 -rw-r--r-- 1 root root 0 Sep 25 21:48 file980 -rw-r--r-- 1 root root 8 Sep 25 21:48 file981 -rw-r--r-- 1 root root 0 Sep 25 21:48 file982 -rw-r--r-- 1 root root 8 Sep 25 21:48 file983 -rw-r--r-- 1 root root 0 Sep 25 21:48 file984 -rw-r--r-- 1 root root 8 Sep 25 21:48 file985 -rw-r--r-- 1 root root 0 Sep 25 21:48 file986 -rw-r--r-- 1 root root 8 Sep 25 21:48 file987 -rw-r--r-- 1 root root 0 Sep 25 21:48 file988 -rw-r--r-- 1 root root 8 Sep 25 21:48 file989 -rw-r--r-- 1 root root 8 Sep 25 21:48 file99 -rw-r--r-- 1 root root 0 Sep 25 21:48 file990 -rw-r--r-- 1 root root 8 Sep 25 21:48 file991 -rw-r--r-- 1 root root 0 Sep 25 21:48 file992 -rw-r--r-- 1 root root 8 Sep 25 21:48 file993 -rw-r--r-- 1 root root 0 Sep 25 21:48 file994 -rw-r--r-- 1 root root 8 Sep 25 21:48 file995 -rw-r--r-- 1 root root 0 Sep 25 21:48 file996 -rw-r--r-- 1 root root 8 Sep 25 21:48 file997 -rw-r--r-- 1 root root 0 Sep 25 21:48 file998 -rw-r--r-- 1 root root 9 Sep 25 21:48 file999 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989 aaaa990 [root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999 aaaa1000 |
| Comment by Oleg Drokin [ 26/Sep/14 ] |
|
Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same): http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand. http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this. |
| Comment by Alex Zhuravlev [ 26/Sep/14 ] |
|
sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here). |
| Comment by Oleg Drokin [ 26/Sep/14 ] |
|
The recovery was aborted, so yes, clients were evicted. |
| Comment by Oleg Drokin [ 26/Sep/14 ] |
|
I was wrong about direct appplicability of master patchces to b2_4, btw. Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried): http://review.whamcloud.com/12081 |
| Comment by Ned Bass [ 26/Sep/14 ] |
|
Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite. |
| Comment by Andreas Dilger [ 27/Sep/14 ] |
|
I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes. |
| Comment by Oleg Drokin [ 27/Sep/14 ] |
|
oh! I did not realize that. I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits). |
| Comment by Ned Bass [ 27/Sep/14 ] |
To clarify, -1 is the default value which preserves the safe full sync behavior. We run our MDS with osd_txg_sync_delay_us=-1 and osd_object_sync_delay_us=0. This mainly prevents client fsyncs from forcing a pool-wide sync. dt_trans_stop() still safely honors the sync behavior on our systems, so operations using that interfaces will be consistent after a crash. |
| Comment by Andreas Dilger [ 10/Nov/14 ] |
|
I'm wondering if osd-zfs OSTs may be more susceptible to this problem than osd-ldiskfs. With osd-ldiskfs, whenever there are writes to precreated objects they will trigger a journal commit, but with osd-zfs there may be a large number of writes in cache without committing the TXG. While it is possible to hit a similar scenario on osd-ldiskfs OSTs just precreating a large number of empty objects and not forcing a commit, this wouldn't actually lead to the data corruption since unwritten objects are still anonymous. As mentioned in http://review.whamcloud.com/12067 I think it makes sense to track the number of uncommitted precreates with a commit callback, and if this grows too large (OST_MAX_PRECREATE / 2 or so) it should start a transaction commit. I expect this wouldn't happen very often, so a precreate operation shouldn't be marked sync unless it really needs to be. |
| Comment by Gerrit Updater [ 23/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12067/ |
| Comment by Jodi Levi (Inactive) [ 29/Dec/14 ] |
|
Patch landed to Master. |
| Comment by Peter Jones [ 02/Jan/15 ] |
|
Does the second patch need to land also? http://review.whamcloud.com/#/c/12068/ |
| Comment by Peter Jones [ 02/Jan/15 ] |
|
Oleg confirms that it does |
| Comment by Jodi Levi (Inactive) [ 07/Jan/15 ] |
|
Emoly, |
| Comment by Gerrit Updater [ 08/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12068/ |
| Comment by Gerrit Updater [ 09/Jan/15 ] |
|
Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13309 |
| Comment by Jodi Levi (Inactive) [ 09/Jan/15 ] |
|
Lowering priority for adding test. |
| Comment by Gerrit Updater [ 23/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13309/ |
| Comment by Peter Jones [ 23/Jan/15 ] |
|
Landed for 2.7 |