Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4380

data corruption when copy a file to a new directory (sles11sp2 only)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • server: centos 2.1.5 server OR centos 2.4.1 server
      client: sles11sp2 2.4.1 client

      Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC.
    • 3
    • 12006

    Description

      Users reported a data corruption problem. We have a test script to reproduce the problem.

      When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS.

      — cut here for test5.sh —
      #!/bin/sh

      host=${1:-endeavour2}
      rm -fr zz hosts
      cp /etc/hosts hosts
      #fsync hosts
      ssh $host "cd $PWD && mkdir -p zz && cp hosts zz/"
      sum hosts zz/hosts
      — cut here —

      Good result:
      ./test5.sh r301i0n0
      61609 41 hosts
      61609 41 zz/hosts

      Bad result:
      ./test5.sh r401i0n2
      61609 41 hosts
      00000 41 zz/hosts

      Notes:

      • If the copied file is small enough (e.g., /etc/motd), the script succeeds.
      • If you uncomment the fsync, the script succeeds.
      • When it fails, stat reports no blocks have been allocated to the zz/hosts file:

      $ stat zz/hosts
      File: `zz/hosts'
      Size: 41820 Blocks: 0 IO Block: 2097152 regular file
      Device: 914ef3a8h/2437870504d Inode: 163153538715835056 Links: 1
      Access: (0644/rw-rr-) Uid: (10491/dtalcott) Gid: ( 1179/ cstaff)
      Access: 2013-12-12 09:24:46.000000000 -0800
      Modify: 2013-12-12 09:24:46.000000000 -0800
      Change: 2013-12-12 09:24:46.000000000 -0800

      • If you run in an NFS file system, the script usually succeeds, but sometimes reports a no such file error on the sum of zz/hosts. After a few seconds, though, the file appears, with the correct sum. (Typical NFS behavior.)
      • Acts the same on nbp7 and nbp8.

      Attachments

        1. LU4380.dbg.20121230.resend.tgz
          2.17 MB
        2. LU4380.dbg.20121230.tgz
          2.17 MB
        3. LU4380.dbg.20131224
          2.76 MB
        4. LU-4380-debug.patch
          0.5 kB

        Issue Links

          Activity

            [LU-4380] data corruption when copy a file to a new directory (sles11sp2 only)

            I have coreutils-8.12-6.25.29.1 on sles11sp2.

            bogl Bob Glossman (Inactive) added a comment - I have coreutils-8.12-6.25.29.1 on sles11sp2.

            I checked the source code of coreutils-8.12 from gnu.org, looks FIEMAP_FLAG_SYNC is always set for reading extent, not sure if there is any difference with the copy on sles11sp2. (not sure where to get the source code of coreutils for sles11sp2)

            Bob, I guess your coreutils version isn't same as Jay's, that's why you can't reproduce the problem. Could you try coreutils-8.12-6.23.1?

            niu Niu Yawei (Inactive) added a comment - I checked the source code of coreutils-8.12 from gnu.org, looks FIEMAP_FLAG_SYNC is always set for reading extent, not sure if there is any difference with the copy on sles11sp2. (not sure where to get the source code of coreutils for sles11sp2) Bob, I guess your coreutils version isn't same as Jay's, that's why you can't reproduce the problem. Could you try coreutils-8.12-6.23.1?

            This looks like the same problem as LU-2580.

            Some data of the source file 'host' is still cached on client but not flushed back to OST, so the st_blocks reproted by stat is less than actual file size, 'cp' then think that is a sparse file and tries to copy only the extents get by fiemap ioctl.

            So what we need to figure out is: If the 'cp' in sles11sp2 calls fiemap with FIEMAP_FLAG_SYNC flag to make sure all the cached data flush back before getting extents?

            niu Niu Yawei (Inactive) added a comment - This looks like the same problem as LU-2580 . Some data of the source file 'host' is still cached on client but not flushed back to OST, so the st_blocks reproted by stat is less than actual file size, 'cp' then think that is a sparse file and tries to copy only the extents get by fiemap ioctl. So what we need to figure out is: If the 'cp' in sles11sp2 calls fiemap with FIEMAP_FLAG_SYNC flag to make sure all the cached data flush back before getting extents?

            Hi Bob,

            '/bin/cp' command is packaged in coreutils in sles11sp2.
            My version is coreutils-8.12-6.23.1. What version is yours?

            jaylan Jay Lan (Inactive) added a comment - Hi Bob, '/bin/cp' command is packaged in coreutils in sles11sp2. My version is coreutils-8.12-6.23.1. What version is yours?

            Here is the second part of Dale's reply in response to Andreas' strace request. I did not include the second part in first attemp. He actually did try with --sparse=never.

            == quote on ==
            So, there are two bugs here. First, Lustre did not update st_blocks for the source file soon enough. Second, sles11sp2's cp is too "smart" for its own good.

            FWIW:

            • I used the sles11sp1 version of cp under sles11sp2 and it produced a correct copy, in spite of the bad st_blocks value.
            • I tried adding the --sparse=never option to cp to see if I could get it to ignore st_blocks. That made it even stupider: It copied the 7 MiB as before, then explicitly filled the rest of st_size with zeros.
              == quote off ==
            jaylan Jay Lan (Inactive) added a comment - Here is the second part of Dale's reply in response to Andreas' strace request. I did not include the second part in first attemp. He actually did try with --sparse=never. == quote on == So, there are two bugs here. First, Lustre did not update st_blocks for the source file soon enough. Second, sles11sp2's cp is too "smart" for its own good. FWIW: I used the sles11sp1 version of cp under sles11sp2 and it produced a correct copy, in spite of the bad st_blocks value. I tried adding the --sparse=never option to cp to see if I could get it to ignore st_blocks. That made it even stupider: It copied the 7 MiB as before, then explicitly filled the rest of st_size with zeros. == quote off ==

            just to collect some additional data could you please add the --sparse=never option to your cp commands, see if that avoids failures, and again get straces on the cp.

            bogl Bob Glossman (Inactive) added a comment - just to collect some additional data could you please add the --sparse=never option to your cp commands, see if that avoids failures, and again get straces on the cp.

            all the sles versions I'm testing with are newer than that. I have some 3.0.93-0.5, the most recent version of sles11sp2 we build and ship on, and some 3.0.101-05, the most recent kernel update version for sles11sp2.

            bogl Bob Glossman (Inactive) added a comment - all the sles versions I'm testing with are newer than that. I have some 3.0.93-0.5, the most recent version of sles11sp2 we build and ship on, and some 3.0.101-05, the most recent kernel update version for sles11sp2.

            I believe the ioctl(3, 0xc020660b, 0x7fffffffd390) shown in the strace output is a FS_IOC_FIEMAP ioctl. suspect that is where cp is getting the sizes to read/write. interesting that it matches the file allocation size of 512b blocks reported in the st_blocks of the stat call and is smaller than the st_size reported in the stat call.

            bogl Bob Glossman (Inactive) added a comment - I believe the ioctl(3, 0xc020660b, 0x7fffffffd390) shown in the strace output is a FS_IOC_FIEMAP ioctl. suspect that is where cp is getting the sizes to read/write. interesting that it matches the file allocation size of 512b blocks reported in the st_blocks of the stat call and is smaller than the st_size reported in the stat call.

            The sles11sp2 version we are running in production is 3.0.74-0.6.6.2.

            jaylan Jay Lan (Inactive) added a comment - The sles11sp2 version we are running in production is 3.0.74-0.6.6.2.

            What is your specific kernel version in your sles11sp2? I ask because over time there have been several. Wondering if maybe different vfs level changes in the version you are using could explain why I'm not seeing the problem reproducing.

            bogl Bob Glossman (Inactive) added a comment - What is your specific kernel version in your sles11sp2? I ask because over time there have been several. Wondering if maybe different vfs level changes in the version you are using could explain why I'm not seeing the problem reproducing.

            Data from our admin:

            == quote on ==
            This gets very interesting. Here is the good stuff from the strace from the cp that happens on the remote host:

            stat("zz/",

            {st_dev=makedev(3827, 595112), st_ino=163703357997847957, st_mode=S_ IFDIR|0755, st_nlink=2, st_uid=10491, st_gid=1179, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

            ) = 0
            stat("hosts",

            {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

            ) = 0
            stat("zz/hosts", 0x7fffffffe6c0) = -1 ENOENT (No such file or directory)
            open("hosts", O_RDONLY) = 3
            fstat(3,

            {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

            ) = 0
            open("zz/hosts", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
            fstat(4,

            {st_dev=makedev(3827, 595112), st_ino=163703357997847959, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=0, st_size=0, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}

            ) = 0
            mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffec518000
            ioctl(3, 0xc020660b, 0x7fffffffd390) = 0
            read(3, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304
            write(4, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304
            read(3, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728
            write(4, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728
            ftruncate(4, 8037670) = 0
            close(4) = 0
            close(3) = 0

            Now, if you study this, you see that cp did a read/write of 4 MiB and then a read/wrete of 3 MiB, and then uses ftruncate to set the size of the destination file to the st_size reported by the fstat(3, ...) call. Where did cp come up with 7 MiB as the amount to copy? From the st_blocks field in the fstat call. Apparently, the sles11sp2 cp has been "upgraded" to pay attention to st_blocks, rather than just do the copy.

            == quote off ==

            jaylan Jay Lan (Inactive) added a comment - Data from our admin: == quote on == This gets very interesting. Here is the good stuff from the strace from the cp that happens on the remote host: stat("zz/", {st_dev=makedev(3827, 595112), st_ino=163703357997847957, st_mode=S_ IFDIR|0755, st_nlink=2, st_uid=10491, st_gid=1179, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 stat("hosts", {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 stat("zz/hosts", 0x7fffffffe6c0) = -1 ENOENT (No such file or directory) open("hosts", O_RDONLY) = 3 fstat(3, {st_dev=makedev(3827, 595112), st_ino=163571199807331126, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=14336, st_size=8037670, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 open("zz/hosts", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4 fstat(4, {st_dev=makedev(3827, 595112), st_ino=163703357997847959, st_mode=S_IFREG|0644, st_nlink=1, st_uid=10491, st_gid=1179, st_blksize=4194304, st_blocks=0, st_size=0, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20} ) = 0 mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffec518000 ioctl(3, 0xc020660b, 0x7fffffffd390) = 0 read(3, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304 write(4, "\37\213\10\0\373\353\202R\2\3\324<\375W\333\306\262\375\25\375\25\33!\32p\20\376H \201\3249"..., 4194304) = 4194304 read(3, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728 write(4, "r\342B\316~\206g\324\35dn\263P\324.\302QAn\205\352\267\3640\370G\205L\222\17\242\327"..., 3145728) = 3145728 ftruncate(4, 8037670) = 0 close(4) = 0 close(3) = 0 Now, if you study this, you see that cp did a read/write of 4 MiB and then a read/wrete of 3 MiB, and then uses ftruncate to set the size of the destination file to the st_size reported by the fstat(3, ...) call. Where did cp come up with 7 MiB as the amount to copy? From the st_blocks field in the fstat call. Apparently, the sles11sp2 cp has been "upgraded" to pay attention to st_blocks, rather than just do the copy. == quote off ==

            People

              bogl Bob Glossman (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: