[LU-4380] data corruption when copy a file to a new directory (sles11sp2 only) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
server: centos 2.1.5 server OR centos 2.4.1 server
client: sles11sp2 2.4.1 client

Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC.

Severity:
3
Rank (Obsolete):
12006

Description

Users reported a data corruption problem. We have a test script to reproduce the problem.

When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS.

— cut here for test5.sh —
#!/bin/sh

host=${1:-endeavour2}
rm -fr zz hosts
cp /etc/hosts hosts
#fsync hosts
ssh $host "cd $PWD && mkdir -p zz && cp hosts zz/"
sum hosts zz/hosts
— cut here —

Good result:
./test5.sh r301i0n0
61609 41 hosts
61609 41 zz/hosts

Bad result:
./test5.sh r401i0n2
61609 41 hosts
00000 41 zz/hosts

Notes:

If the copied file is small enough (e.g., /etc/motd), the script succeeds.
If you uncomment the fsync, the script succeeds.
When it fails, stat reports no blocks have been allocated to the zz/hosts file:

$ stat zz/hosts
File: `zz/hosts'
Size: 41820 Blocks: 0 IO Block: 2097152 regular file
Device: 914ef3a8h/2437870504d Inode: 163153538715835056 Links: 1
Access: (0644/~~rw-r~~r-) Uid: (10491/dtalcott) Gid: ( 1179/ cstaff)
Access: 2013-12-12 09:24:46.000000000 -0800
Modify: 2013-12-12 09:24:46.000000000 -0800
Change: 2013-12-12 09:24:46.000000000 -0800

If you run in an NFS file system, the script usually succeeds, but sometimes reports a no such file error on the sum of zz/hosts. After a few seconds, though, the file appears, with the correct sum. (Typical NFS behavior.)
Acts the same on nbp7 and nbp8.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU4380.dbg.20121230.resend.tgz
2.17 MB
31/Dec/13 6:33 PM
LU4380.dbg.20121230.tgz
2.17 MB
30/Dec/13 8:33 PM
LU4380.dbg.20131224
2.76 MB
24/Dec/13 9:03 PM
LU-4380-debug.patch
0.5 kB
24/Dec/13 2:32 AM

Issue Links

duplicates

LU-3219 FIEMAP does not sync data or return cached pages

Resolved

Activity

[LU-4380] data corruption when copy a file to a new directory (sles11sp2 only)

Jay Lan (Inactive) added a comment - 18/Dec/13 9:37 PM

Here is the second part of Dale's reply in response to Andreas' strace request. I did not include the second part in first attemp. He actually did try with --sparse=never.

== quote on ==
So, there are two bugs here. First, Lustre did not update st_blocks for the source file soon enough. Second, sles11sp2's cp is too "smart" for its own good.

FWIW:

I used the sles11sp1 version of cp under sles11sp2 and it produced a correct copy, in spite of the bad st_blocks value.

I tried adding the --sparse=never option to cp to see if I could get it to ignore st_blocks. That made it even stupider: It copied the 7 MiB as before, then explicitly filled the rest of st_size with zeros.
== quote off ==

Jay Lan (Inactive) added a comment - 18/Dec/13 9:37 PM Here is the second part of Dale's reply in response to Andreas' strace request. I did not include the second part in first attemp. He actually did try with --sparse=never. == quote on == So, there are two bugs here. First, Lustre did not update st_blocks for the source file soon enough. Second, sles11sp2's cp is too "smart" for its own good. FWIW: I used the sles11sp1 version of cp under sles11sp2 and it produced a correct copy, in spite of the bad st_blocks value. I tried adding the --sparse=never option to cp to see if I could get it to ignore st_blocks. That made it even stupider: It copied the 7 MiB as before, then explicitly filled the rest of st_size with zeros. == quote off ==

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:26 PM

just to collect some additional data could you please add the --sparse=never option to your cp commands, see if that avoids failures, and again get straces on the cp.

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:26 PM just to collect some additional data could you please add the --sparse=never option to your cp commands, see if that avoids failures, and again get straces on the cp.

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:15 PM

all the sles versions I'm testing with are newer than that. I have some 3.0.93-0.5, the most recent version of sles11sp2 we build and ship on, and some 3.0.101-05, the most recent kernel update version for sles11sp2.

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:15 PM all the sles versions I'm testing with are newer than that. I have some 3.0.93-0.5, the most recent version of sles11sp2 we build and ship on, and some 3.0.101-05, the most recent kernel update version for sles11sp2.

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:10 PM

I believe the ioctl(3, 0xc020660b, 0x7fffffffd390) shown in the strace output is a FS_IOC_FIEMAP ioctl. suspect that is where cp is getting the sizes to read/write. interesting that it matches the file allocation size of 512b blocks reported in the st_blocks of the stat call and is smaller than the st_size reported in the stat call.

Bob Glossman (Inactive) added a comment - 18/Dec/13 9:10 PM I believe the ioctl(3, 0xc020660b, 0x7fffffffd390) shown in the strace output is a FS_IOC_FIEMAP ioctl. suspect that is where cp is getting the sizes to read/write. interesting that it matches the file allocation size of 512b blocks reported in the st_blocks of the stat call and is smaller than the st_size reported in the stat call.

Jay Lan (Inactive) added a comment - 18/Dec/13 8:59 PM

The sles11sp2 version we are running in production is 3.0.74-0.6.6.2.

Jay Lan (Inactive) added a comment - 18/Dec/13 8:59 PM The sles11sp2 version we are running in production is 3.0.74-0.6.6.2.

Bob Glossman (Inactive) added a comment - 18/Dec/13 8:46 PM

What is your specific kernel version in your sles11sp2? I ask because over time there have been several. Wondering if maybe different vfs level changes in the version you are using could explain why I'm not seeing the problem reproducing.

Bob Glossman (Inactive) added a comment - 18/Dec/13 8:46 PM What is your specific kernel version in your sles11sp2? I ask because over time there have been several. Wondering if maybe different vfs level changes in the version you are using could explain why I'm not seeing the problem reproducing.

Jay Lan (Inactive) added a comment - 18/Dec/13 8:26 PM

Data from our admin:

== quote on ==
This gets very interesting. Here is the good stuff from the strace from the cp that happens on the remote host:

stat("zz/",

{st_dev=makedev(3827, 595112), st_ino=163703357997847957, st_mode=S_ IFDIR|0755, st_nlink=2, st_uid=10491, st_gid=1179, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

) = 0
stat("hosts",

{st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

) = 0
stat("zz/hosts", 0x7fffffffe6c0) = -1 ENOENT (No such file or directory)
open("hosts", O_RDONLY) = 3
fstat(3,

) = 0
open("zz/hosts", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4,

{st_dev=makedev(3827, 595112), st_ino=163703357997847959, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=0, st_size=0, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

) = 0
mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffec518000
ioctl(3, 0xc020660b, 0x7fffffffd390) = 0
read(3, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304
write(4, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304
read(3, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728
write(4, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728
ftruncate(4, 8037670) = 0
close(4) = 0
close(3) = 0

Now, if you study this, you see that cp did a read/write of 4 MiB and then a read/wrete of 3 MiB, and then uses ftruncate to set the size of the destination file to the st_size reported by the fstat(3, ...) call. Where did cp come up with 7 MiB as the amount to copy? From the st_blocks field in the fstat call. Apparently, the sles11sp2 cp has been "upgraded" to pay attention to st_blocks, rather than just do the copy.

== quote off ==

Jay Lan (Inactive) added a comment - 18/Dec/13 8:26 PM Data from our admin: == quote on == This gets very interesting. Here is the good stuff from the strace from the cp that happens on the remote host: stat("zz/", {st_dev=makedev(3827, 595112), st_ino=163703357997847957, st_mode=S_ IFDIR|0755, st_nlink=2, st_uid=10491, st_gid=1179, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 stat("hosts", {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 stat("zz/hosts", 0x7fffffffe6c0) = -1 ENOENT (No such file or directory) open("hosts", O_RDONLY) = 3 fstat(3, {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 open("zz/hosts", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4 fstat(4, {st_dev=makedev(3827, 595112), st_ino=163703357997847959, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=0, st_size=0, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffec518000 ioctl(3, 0xc020660b, 0x7fffffffd390) = 0 read(3, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304 write(4, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304 read(3, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728 write(4, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728 ftruncate(4, 8037670) = 0 close(4) = 0 close(3) = 0 Now, if you study this, you see that cp did a read/write of 4 MiB and then a read/wrete of 3 MiB, and then uses ftruncate to set the size of the destination file to the st_size reported by the fstat(3, ...) call. Where did cp come up with 7 MiB as the amount to copy? From the st_blocks field in the fstat call. Apparently, the sles11sp2 cp has been "upgraded" to pay attention to st_blocks, rather than just do the copy. == quote off ==

Jay Lan (Inactive) added a comment - 18/Dec/13 8:00 PM

I passed along Andreas' request to the tester.

Jay Lan (Inactive) added a comment - 18/Dec/13 8:00 PM I passed along Andreas' request to the tester.

Bob Glossman (Inactive) added a comment - 18/Dec/13 6:44 PM

Continued efforts to reproduce the problem locally haven't panned out. went to bigger test file, went to 2.4.1 clients, went to launching from one sles11sp2 client onto another as described. all cases succeeded, no failures seen. I must be doing something significantly different, but not sure what.

Bob Glossman (Inactive) added a comment - 18/Dec/13 6:44 PM Continued efforts to reproduce the problem locally haven't panned out. went to bigger test file, went to 2.4.1 clients, went to launching from one sles11sp2 client onto another as described. all cases succeeded, no failures seen. I must be doing something significantly different, but not sure what.

Andreas Dilger added a comment - 18/Dec/13 6:19 PM

Jay, could you please run strace as part of your reproducer and attach the output from a failed run, to see whether the cp is using FIEMAP, and what results it gets back. It may be that cp is not even trying to copy the data if it gets back a result that indicates the file is sparse or something.

Andreas Dilger added a comment - 18/Dec/13 6:19 PM Jay, could you please run strace as part of your reproducer and attach the output from a failed run, to see whether the cp is using FIEMAP, and what results it gets back. It may be that cp is not even trying to copy the data if it gets back a result that indicates the file is sparse or something.

Jay Lan (Inactive) added a comment - 17/Dec/13 12:06 AM

Hi Bob,

Here is more input from our admin who came up with the reproducer. In the following
testing he used an adjacent sles11sp2 as local node and remote node.

== quote on ==
I'm not surprised they could not reproduce.

I used various hosts as local or remote targets. It always failed when the remote host was sles11sp2 and succeeded in other cases.

I have one additional failure variant:

$ ssh r401i0n0 "cd $PWD && ./test5.sh" r401i0n1
61393 1670 hosts
48836 1670 zz/hosts

$ cmp -l hosts zz/hosts | head
1048577 151 0
1048578 61 0
1048579 156 0
1048580 66 0
1048581 55 0
1048582 151 0
1048583 142 0
1048584 60 0
1048585 40 0
1048586 162 0

Here, the local and remote hosts are adjacent sles11sp2 nodes. Instead of the second copy of the file being completely empty, the missing blocks start after exactly 1 MiB. I tried tweaking the stripe sizes of the source and destination directories, but the result was the same.

I then used a bigger file. The result is that all 1 MiB chunks but the last, partial one get copied okay. But, remember that if the source file is very small, it gets copied completely okay also.
== quote off ==

Jay Lan (Inactive) added a comment - 17/Dec/13 12:06 AM Hi Bob, Here is more input from our admin who came up with the reproducer. In the following testing he used an adjacent sles11sp2 as local node and remote node. == quote on == I'm not surprised they could not reproduce. I used various hosts as local or remote targets. It always failed when the remote host was sles11sp2 and succeeded in other cases. I have one additional failure variant: $ ssh r401i0n0 "cd $PWD && ./test5.sh" r401i0n1 61393 1670 hosts 48836 1670 zz/hosts $ cmp -l hosts zz/hosts | head 1048577 151 0 1048578 61 0 1048579 156 0 1048580 66 0 1048581 55 0 1048582 151 0 1048583 142 0 1048584 60 0 1048585 40 0 1048586 162 0 Here, the local and remote hosts are adjacent sles11sp2 nodes. Instead of the second copy of the file being completely empty, the missing blocks start after exactly 1 MiB. I tried tweaking the stripe sizes of the source and destination directories, but the result was the same. I then used a bigger file. The result is that all 1 MiB chunks but the last, partial one get copied okay. But, remember that if the source file is very small, it gets copied completely okay also. == quote off ==

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Dec/13 7:24 PM

Updated:: 13/Feb/14 6:35 PM

Resolved:: 14/Jan/14 11:12 PM