[LU-5648] corrupt files contain extra data Created: 22/Sep/14  Updated: 20/Oct/22  Resolved: 23/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: HB, llnl
Environment:

lustre-2.4.2-14chaos, ZFS OSD


Issue Links:
Related
is related to LU-15009 precreate should cleanup orphans upon... Resolved
Severity: 3
Rank (Obsolete): 15831

 Description   

Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.

In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin.

The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:

Example # Expected Size Actual Size
1 2000000 2097152
2 1008829 2097152
3 36473 1053224
4 1008829 1441432

In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.

                                                                         
1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
1999856   D   +   U   1   j   q   g   ;   7   J   r   {   "   j   )   D            
1999872 enq   *   C   `   =   o   C   &   K   \   a   1   D   v   k  ht            
1999888   !   A   ;  ff   2   "   G   i   m   9   e dle   $  si   T   )            
1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
1999920   1   4   $   c   W   l   M   D   &   3   U   J   B   )   t   {            
1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   "   #   J   7            
1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   >            
2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0            

In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2:

                                                                         
1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
*                                                                                  
1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4            

In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.



 Comments   
Comment by John Hammond [ 22/Sep/14 ]

Hi Ned,

> This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.

Based on the report I don't think that this is a necessary conclusion. It could be that this is old data from a previously unlinked file and that the object sizes were corrupted. (Or you might be correct. Or something else...) What does stat return for these files? The real size or the expected size? Are you also able to stat the OST objects? Does the output of getstripe look normal?

> In examples 2-4 the file should have only one object.

Do you mean that the files should have stripe count 1 but instead have stripe count 2?

Comment by Ned Bass [ 22/Sep/14 ]

What does stat return for these files? The real size or the expected size?

stat returns the real size.

Are you also able to stat the OST objects?

I haven't done that yet. I'll post my findings when I do.

Does the output of getstripe look normal?

Yes, for example here the getstripe output for Example 3 above (expected size 36473).

[root@oslic1:bass6]# ls -l /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
-rw------- 1 czapotts czapotts 1053224 Sep 20 09:20 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
[root@oslic1:bass6]# lfs getstripe -v /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025.d
lmm_magic:          0x0BD10BD0
lmm_seq:            0x20001484c
lmm_object_id:      0x7b50
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
lmm_layout_gen:     0
lmm_stripe_offset:  17
        obdidx           objid           objid           group
            17         7853811       0x77d6f3                0
            51         7697453       0x75742d                0

Do you mean that the files should have stripe count 1 but instead have stripe count 2?

Yes.

Comment by Ned Bass [ 22/Sep/14 ]

Are you also able to stat the OST objects?

Note the objects have different owners and timestamps. The contents of the objects match what I see in the file from a Lustre client.
Object 1:

[root@vesta18:0]# pwd
/mnt/toss-2688/O/0
[root@vesta18:0]# ls -li d$(( 7853811 % 32))/7853811
11240968 -rw-rw-rw- 1 40161 40161 36473 Sep 20 09:20 d19/7853811
[root@vesta18:0]# zdb -dddd vesta18/fsv-ost0@toss-2688 11240968
Dataset vesta18/fsv-ost0@toss-2688 [ZPL], ID 306, cr_txg 70829659, 24.1T, 3872191 objects, rootbp DVA[0]=<2:aafb5c4bc00:200> DVA[1]=<0:aafcede7200:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=70829659L/70829659P fill=3872191 cksum=1cbcf56f2f:8d6df67891c:17fb1399c1fb3:2ed1492ff40927

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11240968    1    16K   128K  7.00K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    ???<object#11240968>
        uid     40161
        gid     40161
        atime   Sat Sep 20 09:20:12 2014
        mtime   Sat Sep 20 09:20:10 2014
        ctime   Sat Sep 20 09:20:10 2014
        crtime  Sat Sep 20 09:19:53 2014
        gen     70763960
        mode    100666
        size    36473
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\363\326w\000\000\000\000\000
                trusted.fid = LH\001\000\002\000\000\000P{\000\000\000\000\000\000
                trusted.version = 8\255\314\000\024\000\000\000

Note:
# printf 'LH\001\000\002\000\000\000P{\000\000\000\000\000\000' | hexdump -C
00000000  4c 48 01 00 02 00 00 00  50 7b 00 00 00 00 00 00  |LH......P{......|
00000010
## f_seq = 0x20001484c
## f_oid = 0x7b50
## f_stripe_index = 0

Object 2:

[root@vesta52:~]# cd /mnt/toss-2688/O/0/                                                           
[root@vesta52:0]# ls -li d$(( 7697453 % 32 ))/7697453
11554471 -rw-rw-rw- 1 44793 44793 4648 Sep 19 14:55 d13/7697453
[root@vesta52:0]# zdb -dddd vesta52/fsv-ost0@toss-2688 11554471
Dataset vesta52/fsv-ost0@toss-2688 [ZPL], ID 327, cr_txg 74769068, 26.2T, 3803450 objects, rootbp DVA[0]=<1:b8de3f5f000:200> DVA[1]=<2:b8e5bdf5800:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=74769068L/74769068P fill=3803450 cksum=18f2760b51:7b49838026d:151711df02cf7:29c5f44d190860

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11554471    1    16K   128K     8K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    ???<object#11554471>
        uid     44793
        gid     44793
        atime   Wed Dec 31 16:00:00 1969
        mtime   Fri Sep 19 14:55:24 2014
        ctime   Fri Sep 19 14:55:24 2014
        crtime  Fri Sep 19 09:12:22 2014
        gen     74681220
        mode    100666
        size    4648
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000-tu\000\000\000\000\000
                trusted.fid = \177>\001\000\002\000\000\000r\354\000\000\000\000\000\000
                trusted.version = \032g\267\000\021\000\000\000

# printf '\177>\001\000\002\000\000\000r\354\000\000\000\000\000\000' | hexdump -C
00000000  7f 3e 01 00 02 00 00 00  72 ec 00 00 00 00 00 00  |.>......r.......|
00000010
## f_seq = 0x200013e7f
## f_oid = 0xec72
## f_stripe_index = 0
Comment by John Fuchs-Chesney (Inactive) [ 22/Sep/14 ]

I'm looking after this ticket for the time being.
~ jfc.

Comment by Ned Bass [ 23/Sep/14 ]

Do you mean that the files should have stripe count 1 but instead have stripe count 2?

Please disregard my comments about the stripe count. The filesystem default stripe count is 2, so that is expected.

Comment by Ned Bass [ 23/Sep/14 ]

I have another example, a corrupted plain text file where all data fits within the first object. It clearly appears that another user's data is appended to the end:

[root@oslic1:bass6]# orig=/p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# copy=/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base

[root@oslic1:bass6]# ls -l $orig
-r-------- 1 czapotts czapotts 47 Jun  4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_1st_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
[root@oslic1:bass6]# ls -l $copy
-rw------- 1 czapotts czapotts 517 Jun  4 12:24 /p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base

[root@oslic1:bass6]# cat $orig
K 12
svn:keywords
V 19
URL Author Date Rev
END

[root@oslic1:bass6]# cat $copy
K 12
svn:keywords
V 19
URL Author Date Rev
END
s
#MSUB -l partition=cab          # explicitly say where to run
#MSUB -l walltime=15:29:00      # maximum total CPU time
#MSUB -q pbatch                 # queue
#MSUB -A pemwork                # bank account
#MSUB -e error                  # send error to the same file as output
#MSUB -m be                     # send email when job begins/ends
#MSUB -N P0.0T415.v.00005       # job name

cd /p/lscratchd/zepeda/<path snipped>

echo start job `date`

srun -n 576 ./lmp_cab.mgpt -in input.shear > output

echo "job ended: `date`"

[root@oslic1:bass6]# lfs getstripe -v $copy
/p/lscratchv/czapotts/divt_vulcanlac3/htar_2nd_28025/Projects/nft_v3.02.01/edvrapi/.svn/prop-base/EdvrCopyCommand2.java.svn-base
lmm_magic:          0x0BD10BD0
lmm_seq:            0x20001484c
lmm_object_id:      0x4f68
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
lmm_layout_gen:     0
lmm_stripe_offset:  77
        obdidx           objid           objid           group
            77         7754996       0x7654f4                0
            21         7699516       0x757c3c                0

Examining the object on the server:

[root@vesta78:~]# zdb -dddd vesta78/fsv-ost0@a 11560703
Dataset vesta78/fsv-ost0@a [ZPL], ID 328, cr_txg 77338879, 25.4T, 3835180 objects, rootbp DVA[0]=<0:b630dd63600:200> DVA[1]=<1:b6271568c00:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=77338879L/77338879P fill=3835180 cksum=1cee422d3d:97fcac4292a:1b2c5923c2473:37578ea7aa912e

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  11560703    1    16K   128K  5.00K   128K  100.00  ZFS plain file
                                        312   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    ???<object#11560703>
        uid     7714
        gid     7714
        atime   Sat Sep 20 09:18:16 2014
        mtime   Wed Jun  4 12:24:23 2014
        ctime   Sat Sep 20 09:18:16 2014
        crtime  Fri Sep 19 12:46:01 2014
        gen     77254223
        mode    100666
        size    517
        parent  0
        links   1
        pflags  0
        rdev    0x0000000000000000
        SA xattrs: 168 bytes, 3 entries

                trusted.lma = \000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\364Tv\000\000\000\000\000
                trusted.fid = \3558\001\000\002\000\000\000\025\307\000\000\000\000\000\000
                trusted.version = m\265\275\000\021\000\000\000

The OST object is owned by user 7714 (zepeda), not the owner of the original and copied files (czapotts). I looked for 517-byte files in zepeda's Lustre directory, and found two whose contents match the unexpected portion of the corrupt file. Note that the file is intact; it does not contain any data from the corrupt file:

[root@oslic1:bass6]# cat /p/lscratchv/zepeda/<path snipped>
#!/bin/bash
#MSUB -l nodes=36               # number of nodes
#MSUB -l partition=cab          # explicitly say where to run
#MSUB -l walltime=15:29:00      # maximum total CPU time
#MSUB -q pbatch                 # queue
#MSUB -A pemwork                # bank account
#MSUB -e error                  # send error to the same file as output
#MSUB -m be                     # send email when job begins/ends
#MSUB -N P0.0T415.v.00005       # job name

cd /p/lscratchd/zepeda/<path snipped>

echo start job `date`

srun -n 576 ./lmp_cab.mgpt -in input.shear > output

echo "job ended: `date`"

Comment by Zhenyu Xu [ 23/Sep/14 ]

For the record, example 3 expected size is 36473, real size is 1053224.

it's stripe size is 1048576, obj 1 size is 36473 (expected size), obj 2 size is 4648, and its real size comes from stripe size (1048576) + obj2 size (4648).

Comment by Zhenyu Xu [ 23/Sep/14 ]

Is it possible that you have a reliable reproducer and grab -1 logs during that process, seems that file size update has some problem there.

Comment by Oleg Drokin [ 23/Sep/14 ]

Just a few notes here:
If you suspect double linked object - having the object id you should be able to find the other file referencing it to prove it (with e.g. lfs find -stripe-index + getstripe combination to list all files owned by this user and see if one of them happen to use this object too).

I see there's trusted.fid (and trusted.lma), I wonder if the link back at MDS points to the same file as the one you found it in? ll_decode_filter_fid should decode it, but I guess it only works on mounted (ldiskfs?) filesystems, I am not sure if you can make it work with a zfs mount.

Comment by Ned Bass [ 23/Sep/14 ]

Zhenyu Xu, we don't have a reliable reproducer. The cases we know about all happened on the same day, Saturday Sept. 20. I'm not sure what was going on with our systems at that time that may have contributed, but nothing obvious turned up in the system logs.

Oleg, I'll investigate those avenues tomorrow. Regarding trusted.{lma,fid} consistency, I noticed those attributes in the two objects I looked at in this comment differ from each other. Should all objects belonging to the same file have the same values for those attributes?

Comment by Oleg Drokin [ 23/Sep/14 ]

the values should not be the same since they contain the object fid and the parent fid. Only parent fid is the same for objects belonging to the same file.

Also if all problems happened on about the same time - I wonder if there was a recovery/restart event on the ost or MDT at about then? Evictions?

Comment by John Hammond [ 23/Sep/14 ]

Ned, can you see if [0x200013e7f:0xec72:0x0] exists? And if so the would you report its attributes, its striping, and the output of zdb for its objects?

Comment by Ned Bass [ 23/Sep/14 ]

John, it doesn't exist.

[root@oslic1:bass6]# lfs fid2path /p/lscratchv/ '[0x200013e7f:0xec72:0x0]'
fid2path error: No such file or directory
Comment by Ned Bass [ 23/Sep/14 ]

Also if all problems happened on about the same time - I wonder if there was a recovery/restart event on the ost or MDT at about then? Evictions?

The MDS was crashed and rebooted on Friday Sept. 19 around 4pm, and recovery was manually aborted after about 15 minutes, and 316 of 416 clients were evicted:

2014-09-19 16:15:22 LustreError: 14196:0:(mdt_handler.c:6068:mdt_iocontrol()) fsv-MDT0000: Aborting recovery for device
2014-09-19 16:15:22 Lustre: fsv-MDT0000: Aborting recovery
2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted:  req@ffff880fd75cf800 x1478966131560584/t0(126365419020) o36->a9203ee7-d515-f8b0-489f-93e7c500d1e6@172.20.17.40@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1
2014-09-19 16:15:22 Lustre: 13669:0:(ldlm_lib.c:1428:abort_req_replay_queue()) @@@ aborted:  req@ffff880febf24c00 x1478965666564300/t0(126365419031) o36->8d2dd478-7b5b-c326-aee6-9a80b0f770d4@172.20.17.20@o2ib500:0/0 lens 488/0 e 6 to 0 dl 1411168533 ref 1 fl Complete:/4/ffffffff rc 0/-1
2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted:  req@ffff881ff7a11800 x1476428833129976/t0(0) o101->80581121-e767-c582-2846-f150ac18daf0@172.16.65.5@tcp:0/0 lens 328/0 e 9 to 0 dl 1411168567 ref 1 fl Complete:/40/ffffffff rc 0/-1
2014-09-19 16:15:22 LustreError: 13669:0:(ldlm_lib.c:1449:abort_lock_replay_queue()) @@@ aborted:  req@ffff880fe116f800 x1476428340347788/t0(0) o101->5e8e948f-4936-d26c-98de-80bffcc595b9@172.16.65.6@tcp:0/0 lens 328/0 e 12 to 0 dl 1411168537 ref 1 fl Complete:/40/ffffffff rc 0/-1
2014-09-19 16:15:22 Lustre: fsv-MDT0000: Recovery over after 17:09, of 416 clients 100 recovered and 316 were evicted.

The earliest report we have of a corrupted file (reported this morning) was Friday 9/19 at 16:43, so shortly after the MDS was back up. The others we know of happened on Saturday morning. To throw out some speculation, we are running a backport of the LU-5039 patch to do orphan cleanup asynchronously. Perhaps there could be a race where orphaned objects get reused. We have seen it take several hours to complete orphan cleanup, so it's possible that it was still running during the time when all the corrupt files were written. Just another angle to consider.

Comment by Ned Bass [ 24/Sep/14 ]

I have observed a correlation between OST object ownership and object creation time. Unexpected ownership indicates an object which may contain extra data belonging to another user. I systematically dumped zdb output for all objects belonging to a list of ~4000 corrupt files (all created on morning of Saturday 9/20 by the same untar process). Many of those objects are owned by some non-root user other than the file owner. Each such object with a bad owner, without exception, has a creation time on Friday 9/19 before the MDS was rebooted (not all the same time, but within a few hours of each other). Objects owned by the expected user have a mixture of creation times, some from before the MDS reboot, but others have a creation time that matches when the file was created.

Comment by Oleg Drokin [ 24/Sep/14 ]

I am less certain about lu-5039 patch impact here as all affected files would be deleted already either way.

On the other hand if you have corruption in precreate accounting (lastid) such that MDS would try to reuse objects already allocated on OSTs, that would likely produce similar effect to what you are seeing.

I am still interested if there are any objects claimed by multiple files at the same time. I take it these objects are spread all across multiple OSTs, not just one?

Why did the MDS crash, I wonder? I remember at one time LLNL was running MDS disk backends in some dangerous caching mode that needed a full fsck after every controller crash, though that was in ldiskfs days and such. Hopefully none of that is still true (and zfs protects everything tightly)?

Comment by Ned Bass [ 24/Sep/14 ]

Oleg, I haven't found any evidence of multiply linked objects yet, but I haven't done an exhaustive search.

All of the objects are spread across multiple OSTs.

I think the MDS crashed because of some sort of dm-multipath driver lockup, and the node eventually OOM-ed. No, we're no longer using a dangerous caching mode, and we have integrity checking from ZFS.

Comment by Oleg Drokin [ 24/Sep/14 ]

Additionally, are the MDS and OSTs x86 nodes? I think PPC is only used on the clients?

Comment by Ned Bass [ 24/Sep/14 ]

Yes.

Comment by Ned Bass [ 24/Sep/14 ]

This subset of OSTs all logged 'deleting orphan object' messages following the MDS reboot on Friday 9/19:

vesta[2-4,6,8-12,14-16,18,21-22,24-27,29-30,32-37,40-45,49-50,53-58,63-64,66-68,70,72,75,77,81-82,85,91,93-96]

This subset of OSTs have objects owned by a non-root user who is not the owner of the Lustre file:

vesta[1,5,7,13,17,19-20,23,28,31,38-39,46-48,51-52,59-62,65,69,71,73-74,76,78-80,83-84,86-90,92]

Note that the sets do not intersect and together comprise all OSS nodes in vesta. So it seems that the OSTs owning the bad objects did not complete orphan cleanup for some reason.

Comment by Oleg Drokin [ 25/Sep/14 ]

hm, to me this sounds like last_id corruption/update miss on MDS side.

Potential path (as a speculation at this point):
MDS has trouble writing updates to disk, but keeps creating files (in memory), requesting OST objects and such. Clients write the files.
Eventually MDS crashes, lots of actual writes are lost on MDS, but objects on OSTs are populated.
Recovery is aborted (so the files are not recreated)
MDS does del orphan request to OST. and in ofd_create() we have this:

                if (-diff > OST_MAX_PRECREATE) {
                        /* FIXME: should reset precreate_next_id on MDS */
                        rc = 0;
                } else if (diff < 0) {
                        rc = ofd_orphans_destroy(env, exp, ofd, oa);
                        oseq->os_destroys_in_progress = 0;
                } else {
                        /* XXX: Used by MDS for the first time! */
                        oseq->os_destroys_in_progress = 0;
                }

Basically my reading of that is: is MDS created way too many objects, we hit the first condition (-diff > OST_MAX_PRECREATE) and do nothing. There's a comment about readjustment, but I don't see how it happens really, unlike the cleanup orphan case where we actively skip used objects with ostid_set_id().
The objects are not destroyed.
Not yet clear how, but if OST would then start precreating from the point MDS thinks is the starting point, we'll reallocate all those objects and the creating function ofd_precreate_objects() happily accepts already existing objects:

                if (unlikely(ofd_object_exists(fo))) {
                        /* object may exist being re-created by write replay */
                        CDEBUG(D_INODE, "object "LPX64"/"LPX64" exists: "
                               DFID"\n", ostid_seq(&oseq->os_oi), id,
                               PFID(&info->fti_fid));
                        continue;
                }

Since clients already wrote to those objects previously, stale data would be present there. But replay on the MDS did not happen, so they might be really attached to any other files than the new ones.
Did OSTs (or some of them) restart as well at around then?

Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?

Comment by Ned Bass [ 25/Sep/14 ]

Did OSTs (or some of them) restart as well at around then?

I'll double-check, but I don't think so.

Also, the mismatched object ownership thing - are those objects happen to be in clusters sequentially numbered?

It appears so, but it's hard to say for sure because the trusted.fid can't be trusted (ha ha) so we can't easily go from OST object ID to Lustre file and determine the expected owner. My data is based on a list of about 4000 corrupt files provided by a user who untarred a tar file. So there were gaps between the object IDs, but they were close together. Also, as I mentioned above, they all have a ZFS creation timestamp from Friday morning to afternoon before the MDS reboot.

Comment by Ned Bass [ 25/Sep/14 ]

Oleg, that certainly seems like a plausible scenario. In particular, ofd_create() here assumes that the MDS can't lose more than OST_MAX_PRECREATE object IDs after a crash.

                if (-diff > OST_MAX_PRECREATE) {
                        /* FIXME: should reset precreate_next_id on MDS */
                        rc = 0;

But I think it is possible for more than one lastid update to happen within the same ZFS transaction group on the MDS. Suppose there were two lastid updates in the open transaction group when the MDS crashes. Then it could be up to 2 * OST_MAX_PRECREATE behind the OST on restart, and we would execute the code above and skip orphan cleanup. To prevent this, maybe OSP should not issue more precreate RPCs until the most recent update is stable on disk.

Also, do you know why the FIXME comment was never implemented?

Comment by Oleg Drokin [ 25/Sep/14 ]

I don't know why it was never implemented.

Just that lack of reset should not be too bad in itself, that would just break MDS precreate requests, I think. But OST always seems to be precreating from ofd_seq_last_oid() anyway. That one is not supposed to go backwards. So something else must happen on OST to also reset seq last_id. Even if we jumped to a next sequence, last_id there is stored separately so there should be no conflict.

Hm, actually in ofd_create in that case the diff for creates is always below 0 and we skip the create, but I am not seeing how we communicate the error back to MDS, so possibly it thinks we did allocate something up to where it was asked, and just reuses those objects (default rc value ois 0 which certainly helps that)? I'll try to construct a test case to verify this theory.

As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially? (I think even on ZFS we still use id as the filename).

Comment by Ned Bass [ 25/Sep/14 ]

As for sequences of objects, when you see those files owned by wrong non-root user - are they named sequentially?

Not exactly, the objects with a wrong owner are interleaved with objects owned by root or the expected owner. For example, some of the following sequential objects on a given OST are linked to files reported as corrupt. All of those not owned by uid 40161 or root are associated with corrupt files.

11541167 -rw-rw-rw- 1 40161 40161 1294675 May  6 09:47 O/0/d27/7756091
11541168 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d28/7756092
11541170 -rwSrwSrw- 1 root root 0 May  6 09:47 O/0/d29/7756093
11541171 -rw-rw-rw- 1 40161 40161 2097152 May  6 09:47 O/0/d30/7756094
11541172 -rw-rw-rw- 1 40161 40161 11682 May  6 09:46 O/0/d31/7756095
11541173 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d0/7756096
11541174 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d1/7756097
11541175 -rw-rw-rw- 1 40161 40161 5061 May  6 09:46 O/0/d2/7756098
11541176 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d3/7756099
11541177 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d4/7756100
11541178 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d5/7756101
11541179 -rw-rw-rw- 1 36782 36782 1294675 May  6 09:46 O/0/d6/7756102
11541180 -rwSrwSrw- 1 root root 0 May  6 09:46 O/0/d7/7756103
11541181 -rw-rw-rw- 1 40161 40161 120 May  6 09:46 O/0/d8/7756104
11541182 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d9/7756105
11541183 -rw-rw-rw- 1 40161 40161 47 May  6 09:46 O/0/d10/7756106
11541184 -rw-rw-rw- 1 36782 36782 1048576 May  6 09:46 O/0/d11/7756107
11541185 -rw-rw-rw- 1 40161 40161 5470 May  6 09:46 O/0/d12/7756108
11541186 -rw-rw-rw- 1 40161 40161 1048576 May  6 09:46 O/0/d13/7756109
11541187 -rw-rw-rw- 1 40161 40161 6586 May  6 09:46 O/0/d14/7756110
Comment by Oleg Drokin [ 26/Sep/14 ]

the root owned object means no write to it happened yet (the special permissions mask tells us this, also size is 0 of course).
Correct owner could be due to later chown? Or because the previous allocation did not write to the file, while the new one did.

Comment by Oleg Drokin [ 26/Sep/14 ]

Well, it seems I have a brute-force reproducer that I believe is the same as what happened to you. It also seems to imply that my reasoning for the bug cause is correct.

To cause the bug I had this patch applied (this is against your chaos tree, but probably will work with master too, except different ofd file will need to be patched):

diff --git a/lustre/ofd/ofd_obd.c b/lustre/ofd/ofd_obd.c
index 7e21806..a0e2729 100644
--- a/lustre/ofd/ofd_obd.c
+++ b/lustre/ofd/ofd_obd.c
@@ -1235,7 +1235,7 @@ int ofd_create(const struct lu_env *env, struct obd_export *exp,
 		diff = ostid_id(&oa->o_oi) - ofd_seq_last_oid(oseq);
 		CDEBUG(D_HA, "ofd_last_id() = "LPU64" -> diff = %d\n",
 			ofd_seq_last_oid(oseq), diff);
-		if (-diff > OST_MAX_PRECREATE) {
+		if (-diff > 2 /*OST_MAX_PRECREATE*/) {
 			/* FIXME: should reset precreate_next_id on MDS */
 			rc = 0;
 		} else if (diff < 0) {
diff --git a/lustre/tests/replay-single.sh b/lustre/tests/replay-single.sh
index ed2531d..a24a446 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -678,6 +678,19 @@ test_32() {
 }
 run_test 32 "close() notices client eviction; close() after client eviction"
 
+test_1234() {
+mkdir $DIR/dir1234
+mkdir $DIR/dir1234-2
+touch $DIR/dir1234/0
+replay_barrier $SINGLEMDS
+for i in `seq 1000` ; do echo aaaa$i > $DIR/dir1234/file$i ; done
+fail_abort $SINGLEMDS
+for i in `seq 1000` ; do touch $DIR/dir1234-2/file$i ; done
+test -s $DIR/dir1234-2/file20 && return 5
+exit
+}
+run_test 1234 "test something"
+
 test_33a() {
     createmany -o $DIR/$tfile-%d 10
     replay_barrier_nosync $SINGLEMDS

After that run e.g.

REFORMAT=yes ONLY=1234 sh replay-single.sh

It'll fail with some nonsensical error message about first df failed that I don't want to dig into. The most important part is - when you do ls -l /mnt/lustre/dir1234-2 you'll see that every other file is non-zero size! (why every other? I have no idea t this point as both OSTs should have entered this condition.
When cattign the files we also see stale garbage there.

[root@centos6-0 ~]# ls -l /mnt/lustre/dir1234-2/
...
-rw-r--r-- 1 root root 8 Sep 25 21:48 file979
-rw-r--r-- 1 root root 0 Sep 25 21:48 file98
-rw-r--r-- 1 root root 0 Sep 25 21:48 file980
-rw-r--r-- 1 root root 8 Sep 25 21:48 file981
-rw-r--r-- 1 root root 0 Sep 25 21:48 file982
-rw-r--r-- 1 root root 8 Sep 25 21:48 file983
-rw-r--r-- 1 root root 0 Sep 25 21:48 file984
-rw-r--r-- 1 root root 8 Sep 25 21:48 file985
-rw-r--r-- 1 root root 0 Sep 25 21:48 file986
-rw-r--r-- 1 root root 8 Sep 25 21:48 file987
-rw-r--r-- 1 root root 0 Sep 25 21:48 file988
-rw-r--r-- 1 root root 8 Sep 25 21:48 file989
-rw-r--r-- 1 root root 8 Sep 25 21:48 file99
-rw-r--r-- 1 root root 0 Sep 25 21:48 file990
-rw-r--r-- 1 root root 8 Sep 25 21:48 file991
-rw-r--r-- 1 root root 0 Sep 25 21:48 file992
-rw-r--r-- 1 root root 8 Sep 25 21:48 file993
-rw-r--r-- 1 root root 0 Sep 25 21:48 file994
-rw-r--r-- 1 root root 8 Sep 25 21:48 file995
-rw-r--r-- 1 root root 0 Sep 25 21:48 file996
-rw-r--r-- 1 root root 8 Sep 25 21:48 file997
-rw-r--r-- 1 root root 0 Sep 25 21:48 file998
-rw-r--r-- 1 root root 9 Sep 25 21:48 file999
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file989
aaaa990
[root@centos6-0 ~]# cat /mnt/lustre/dir1234-2/file999
aaaa1000
Comment by Oleg Drokin [ 26/Sep/14 ]

Ok, I have two patches (against master, you can trivially apply them to your 2.4 tree by changing the file name in the patch from ofd_dev.c to ofd_obd.c, the code is the same):

http://review.whamcloud.com/12067 - This one will let MDS know that it needs to advance it's last id if we do not want to honor the delorphan request since it's way too far in the past (note this, even before the patch, also leads to leaking those objects. Though lfsck should hopefully find and fix that). This should fix your problem at hand.

http://review.whamcloud.com/12068 - This one is fixing the other uncovered problem - shold the MDS last_id somehow become corrupted and point into the "past", we need to reject such precreate requests instead of honoring these thus promoting double usage of objects and other inconsistencies. I am open for debate on this one if somebody sees a better way of addressing this.

Comment by Alex Zhuravlev [ 26/Sep/14 ]

sounds like lots of the clients were evicted at some point? that's the only way to get very old last used id after recovery (except an internal issue with ZFS, which I don't think the case here).

Comment by Oleg Drokin [ 26/Sep/14 ]

The recovery was aborted, so yes, clients were evicted.

Comment by Oleg Drokin [ 26/Sep/14 ]

I was wrong about direct appplicability of master patchces to b2_4, btw.

Here are two b2_5 patches of the same that are directly cherrypickable to b2_4 without any changes (I tried):

http://review.whamcloud.com/12081
http://review.whamcloud.com/12082

Comment by Ned Bass [ 26/Sep/14 ]

Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

Comment by Andreas Dilger [ 27/Sep/14 ]

I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

Comment by Oleg Drokin [ 27/Sep/14 ]

oh! I did not realize that.

I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits).
I was kind of expecting that all the sync logic will make sure that this cannot happen, but...
If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations).
I wonder how we better handle it then?

Comment by Ned Bass [ 27/Sep/14 ]

with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS.

To clarify, -1 is the default value which preserves the safe full sync behavior.

We run our MDS with osd_txg_sync_delay_us=-1 and osd_object_sync_delay_us=0. This mainly prevents client fsyncs from forcing a pool-wide sync. dt_trans_stop() still safely honors the sync behavior on our systems, so operations using that interfaces will be consistent after a crash.

Comment by Andreas Dilger [ 10/Nov/14 ]

I'm wondering if osd-zfs OSTs may be more susceptible to this problem than osd-ldiskfs. With osd-ldiskfs, whenever there are writes to precreated objects they will trigger a journal commit, but with osd-zfs there may be a large number of writes in cache without committing the TXG. While it is possible to hit a similar scenario on osd-ldiskfs OSTs just precreating a large number of empty objects and not forcing a commit, this wouldn't actually lead to the data corruption since unwritten objects are still anonymous.

As mentioned in http://review.whamcloud.com/12067 I think it makes sense to track the number of uncommitted precreates with a commit callback, and if this grows too large (OST_MAX_PRECREATE / 2 or so) it should start a transaction commit. I expect this wouldn't happen very often, so a precreate operation shouldn't be marked sync unless it really needs to be.

Comment by Gerrit Updater [ 23/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12067/
Subject: LU-5648 ofd: In destroy orphan case always let MDS know last id
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: da5773e2b498a4edacc26fbf610d0b7628818d93

Comment by Jodi Levi (Inactive) [ 29/Dec/14 ]

Patch landed to Master.

Comment by Peter Jones [ 02/Jan/15 ]

Does the second patch need to land also? http://review.whamcloud.com/#/c/12068/

Comment by Peter Jones [ 02/Jan/15 ]

Oleg confirms that it does

Comment by Jodi Levi (Inactive) [ 07/Jan/15 ]

Emoly,
Could you please complete the test based on Oleg's comments found here: https://jira.hpdd.intel.com/browse/LU-5648?focusedCommentId=95011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-95011
Thank you!

Comment by Gerrit Updater [ 08/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12068/
Subject: LU-5648 ofd: Reject precreate requests below last_id
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 65a17be2d9e1acdec5d9aebaed007d4cb6d0ca11

Comment by Gerrit Updater [ 09/Jan/15 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13309
Subject: LU-5648 test: add test_101 in replay-single.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e3939795114db6639153139935a5cd12d3b0cdd

Comment by Jodi Levi (Inactive) [ 09/Jan/15 ]

Lowering priority for adding test.

Comment by Gerrit Updater [ 23/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13309/
Subject: LU-5648 tests: a new test to avoid reallocated object IDs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5a914ccee212c8bf4920145a6b81fad83688e1a0

Comment by Peter Jones [ 23/Jan/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:53:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.