Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4380

data corruption when copy a file to a new directory (sles11sp2 only)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • server: centos 2.1.5 server OR centos 2.4.1 server
      client: sles11sp2 2.4.1 client

      Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC.
    • 3
    • 12006

    Description

      Users reported a data corruption problem. We have a test script to reproduce the problem.

      When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS.

      — cut here for test5.sh —
      #!/bin/sh

      host=${1:-endeavour2}
      rm -fr zz hosts
      cp /etc/hosts hosts
      #fsync hosts
      ssh $host "cd $PWD && mkdir -p zz && cp hosts zz/"
      sum hosts zz/hosts
      — cut here —

      Good result:
      ./test5.sh r301i0n0
      61609 41 hosts
      61609 41 zz/hosts

      Bad result:
      ./test5.sh r401i0n2
      61609 41 hosts
      00000 41 zz/hosts

      Notes:

      • If the copied file is small enough (e.g., /etc/motd), the script succeeds.
      • If you uncomment the fsync, the script succeeds.
      • When it fails, stat reports no blocks have been allocated to the zz/hosts file:

      $ stat zz/hosts
      File: `zz/hosts'
      Size: 41820 Blocks: 0 IO Block: 2097152 regular file
      Device: 914ef3a8h/2437870504d Inode: 163153538715835056 Links: 1
      Access: (0644/rw-rr-) Uid: (10491/dtalcott) Gid: ( 1179/ cstaff)
      Access: 2013-12-12 09:24:46.000000000 -0800
      Modify: 2013-12-12 09:24:46.000000000 -0800
      Change: 2013-12-12 09:24:46.000000000 -0800

      • If you run in an NFS file system, the script usually succeeds, but sometimes reports a no such file error on the sum of zz/hosts. After a few seconds, though, the file appears, with the correct sum. (Typical NFS behavior.)
      • Acts the same on nbp7 and nbp8.

      Attachments

        1. LU4380.dbg.20121230.resend.tgz
          2.17 MB
        2. LU4380.dbg.20121230.tgz
          2.17 MB
        3. LU4380.dbg.20131224
          2.76 MB
        4. LU-4380-debug.patch
          0.5 kB

        Issue Links

          Activity

            [LU-4380] data corruption when copy a file to a new directory (sles11sp2 only)
            jaylan Jay Lan (Inactive) added a comment - - edited

            I tried to run the test on an lustre filesystem that use older hardware but much less activities. I set debug file size to 2G.

            The problem was that "lctl debug_daemon stop" hanged until the 2G ran out. The debug file missed most part of the test. Same thing when I specified 1G

            jaylan Jay Lan (Inactive) added a comment - - edited I tried to run the test on an lustre filesystem that use older hardware but much less activities. I set debug file size to 2G. The problem was that "lctl debug_daemon stop" hanged until the 2G ran out. The debug file missed most part of the test. Same thing when I specified 1G

            I tried to run the test again, with debugging on OSS. The debug output did not contain lctl marks. The 300M specified in debug_daemon was not big enough.

            jaylan Jay Lan (Inactive) added a comment - I tried to run the test again, with debugging on OSS. The debug output did not contain lctl marks. The 300M specified in debug_daemon was not big enough.

            The tar gz file contains
            LU4380.dbg.rd.20131230
            LU4380.dbg.wr.20131230
            Hmm, hope I did not get it wrong. The rd file is the local file where command was executed and the wr file was meant to be the remote file where file was created.

            I did not include a debug trace for the OSS. The 'lfs getstripe zz/hosts' showed:
            zz/hosts
            lmm_magic: 0x0BD10BD0
            lmm_seq: 0x247f48d69
            lmm_object_id: 0x69e
            lmm_stripe_count: 1
            lmm_stripe_size: 1048576
            lmm_stripe_pattern: 1
            lmm_layout_gen: 0
            lmm_stripe_offset: 80
            obdidx objid objid group
            80 3494638 0x3552ee 0

            and there are 26 OSTs on that fs. So, does it fall on oss3, if it starts from oss1?

            Do I have to turn on +trace +dlmtrace +cache on the complete oss?

            jaylan Jay Lan (Inactive) added a comment - The tar gz file contains LU4380.dbg.rd.20131230 LU4380.dbg.wr.20131230 Hmm, hope I did not get it wrong. The rd file is the local file where command was executed and the wr file was meant to be the remote file where file was created. I did not include a debug trace for the OSS. The 'lfs getstripe zz/hosts' showed: zz/hosts lmm_magic: 0x0BD10BD0 lmm_seq: 0x247f48d69 lmm_object_id: 0x69e lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_stripe_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 80 obdidx objid objid group 80 3494638 0x3552ee 0 and there are 26 OSTs on that fs. So, does it fall on oss3, if it starts from oss1? Do I have to turn on +trace +dlmtrace +cache on the complete oss?


            Note that in our case, the writer is on one host and the reader is on a different one. Is this why FIEMAP_FLAG_SYNC has no effect: The _SYNC flag is on the reader host, but the cached data are on the writer host?

            Ah, I was thinking it's on same client, but fix of LU-3219 would force the writer to do flush before the reader calls fiemap. (Andreas & Bob mentioned it above)
            Then we may need logs on both clients and ost, could you rerun the test and collect logs on two clients and the osts (with the objects stripped on)? and please enable D_TRACE, D_DLMTRACE and D_CACHE this time.

            niu Niu Yawei (Inactive) added a comment - Note that in our case, the writer is on one host and the reader is on a different one. Is this why FIEMAP_FLAG_SYNC has no effect: The _SYNC flag is on the reader host, but the cached data are on the writer host? Ah, I was thinking it's on same client, but fix of LU-3219 would force the writer to do flush before the reader calls fiemap. (Andreas & Bob mentioned it above) Then we may need logs on both clients and ost, could you rerun the test and collect logs on two clients and the osts (with the objects stripped on)? and please enable D_TRACE, D_DLMTRACE and D_CACHE this time.
            jaylan Jay Lan (Inactive) added a comment - - edited

            I was asked to check with you guys if "to have Lustre not implement the FIEMAP ioctl" can be a good quick workaround?

            Note that in our case, the writer is on one host and the reader is on a different one. Is this why FIEMAP_FLAG_SYNC has no effect: The _SYNC flag is on the reader host, but the cached data are on the writer host?

            jaylan Jay Lan (Inactive) added a comment - - edited I was asked to check with you guys if "to have Lustre not implement the FIEMAP ioctl" can be a good quick workaround? Note that in our case, the writer is on one host and the reader is on a different one. Is this why FIEMAP_FLAG_SYNC has no effect: The _SYNC flag is on the reader host, but the cached data are on the writer host?

            Attached is the debug output Niu requested. I did not run the test with Niu's patch though since I need to get authorization to put in new binary into production system.

            jaylan Jay Lan (Inactive) added a comment - Attached is the debug output Niu requested. I did not run the test with Niu's patch though since I need to get authorization to put in new binary into production system.

            It's better to have this patch applied when collecting debug logs.

            niu Niu Yawei (Inactive) added a comment - It's better to have this patch applied when collecting debug logs.

            Jay, could you try to reproduce with D_TRACE log enabled, let's try to see if sync flag is specified in fiemap call from the lustre log?

            • echo +trace > /proc/sys/lnet/debug
            • lctl debug_daemon start $tmpfile 300
            • lctl mark "=== cp test ==="
            • cp test
            • lctl mark "=== cp test end ==="
            • lctl debug_daemon stop
            • lctl debug_file $tmpfile $logfile
            • attach the $logfile in this ticket.
            niu Niu Yawei (Inactive) added a comment - Jay, could you try to reproduce with D_TRACE log enabled, let's try to see if sync flag is specified in fiemap call from the lustre log? echo +trace > /proc/sys/lnet/debug lctl debug_daemon start $tmpfile 300 lctl mark "=== cp test ===" cp test lctl mark "=== cp test end ===" lctl debug_daemon stop lctl debug_file $tmpfile $logfile attach the $logfile in this ticket.

            Niu, LU-2580 refered to fixes to LU-2267 and LU-2286. We have both patches in our 2.4.1 branch.

            jaylan Jay Lan (Inactive) added a comment - Niu, LU-2580 refered to fixes to LU-2267 and LU-2286 . We have both patches in our 2.4.1 branch.

            Tried backing down to the -6.23.1 coreutils version. Still couldn't make the problem happen. Looks like the binary cp is identical between the 2 versions anyway, I checked.
            Package diffs must be elsewhere.

            bogl Bob Glossman (Inactive) added a comment - Tried backing down to the -6.23.1 coreutils version. Still couldn't make the problem happen. Looks like the binary cp is identical between the 2 versions anyway, I checked. Package diffs must be elsewhere.

            I have coreutils-8.12-6.25.29.1 on sles11sp2.

            bogl Bob Glossman (Inactive) added a comment - I have coreutils-8.12-6.25.29.1 on sles11sp2.

            People

              bogl Bob Glossman (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: