Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4380

data corruption when copy a file to a new directory (sles11sp2 only)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • server: centos 2.1.5 server OR centos 2.4.1 server
      client: sles11sp2 2.4.1 client

      Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC.
    • 3
    • 12006

    Description

      Users reported a data corruption problem. We have a test script to reproduce the problem.

      When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS.

      — cut here for test5.sh —
      #!/bin/sh

      host=${1:-endeavour2}
      rm -fr zz hosts
      cp /etc/hosts hosts
      #fsync hosts
      ssh $host "cd $PWD && mkdir -p zz && cp hosts zz/"
      sum hosts zz/hosts
      — cut here —

      Good result:
      ./test5.sh r301i0n0
      61609 41 hosts
      61609 41 zz/hosts

      Bad result:
      ./test5.sh r401i0n2
      61609 41 hosts
      00000 41 zz/hosts

      Notes:

      • If the copied file is small enough (e.g., /etc/motd), the script succeeds.
      • If you uncomment the fsync, the script succeeds.
      • When it fails, stat reports no blocks have been allocated to the zz/hosts file:

      $ stat zz/hosts
      File: `zz/hosts'
      Size: 41820 Blocks: 0 IO Block: 2097152 regular file
      Device: 914ef3a8h/2437870504d Inode: 163153538715835056 Links: 1
      Access: (0644/rw-rr-) Uid: (10491/dtalcott) Gid: ( 1179/ cstaff)
      Access: 2013-12-12 09:24:46.000000000 -0800
      Modify: 2013-12-12 09:24:46.000000000 -0800
      Change: 2013-12-12 09:24:46.000000000 -0800

      • If you run in an NFS file system, the script usually succeeds, but sometimes reports a no such file error on the sum of zz/hosts. After a few seconds, though, the file appears, with the correct sum. (Typical NFS behavior.)
      • Acts the same on nbp7 and nbp8.

      Attachments

        1. LU4380.dbg.20121230.resend.tgz
          2.17 MB
        2. LU4380.dbg.20121230.tgz
          2.17 MB
        3. LU4380.dbg.20131224
          2.76 MB
        4. LU-4380-debug.patch
          0.5 kB

        Issue Links

          Activity

            [LU-4380] data corruption when copy a file to a new directory (sles11sp2 only)
            pjones Peter Jones added a comment -

            ok - thanks Jay!

            pjones Peter Jones added a comment - ok - thanks Jay!

            We tested the 2.1.5 server with LU-3219 patch and the problem went away.

            Since somehow we are no longer able to reproduce the problem with our 2.4.0
            server (yes, LU-3219 was included in 2.4.0 release), we can close this ticket. Thanks for your help!

            jaylan Jay Lan (Inactive) added a comment - We tested the 2.1.5 server with LU-3219 patch and the problem went away. Since somehow we are no longer able to reproduce the problem with our 2.4.0 server (yes, LU-3219 was included in 2.4.0 release), we can close this ticket. Thanks for your help!
            00000010:00000001:0.0:1389375898.914586:0:15540:0:(ost_handler.c:1261:ost_get_info()) Process leaving (rc=0 : 0 : 0)
            

            Mahmoud, looks your OST is running on 2.1.5 and it doesn't have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 (LU-3219) applied, so data corruption is expected.

            niu Niu Yawei (Inactive) added a comment - 00000010:00000001:0.0:1389375898.914586:0:15540:0:(ost_handler.c:1261:ost_get_info()) Process leaving (rc=0 : 0 : 0) Mahmoud, looks your OST is running on 2.1.5 and it doesn't have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 ( LU-3219 ) applied, so data corruption is expected.

            I have gathered clean debug logs from localclient, remoteclient and oss. The files are to large to attach here. I have uploaded it to your ftp site 'ftp://ftp.whamcloud.com/uploads'

            The filename is "LU_4380.debug.tgz"


            $ tar tzvf LU_4380.debug.tgz
            rw-rr- root/root 215807901 2014-01-10 09:45 lu-4380.out.LOCALHOST
            rw-rr- root/root 1198791 2014-01-10 09:45 lu-4380.out.OSS
            rw-rr- root/root 135327548 2014-01-10 09:45 lu-4380.out.REMOTEHOST


            mhanafi Mahmoud Hanafi added a comment - I have gathered clean debug logs from localclient, remoteclient and oss. The files are to large to attach here. I have uploaded it to your ftp site 'ftp://ftp.whamcloud.com/uploads' The filename is "LU_4380.debug.tgz" $ tar tzvf LU_4380.debug.tgz rw-r r - root/root 215807901 2014-01-10 09:45 lu-4380.out.LOCALHOST rw-r r - root/root 1198791 2014-01-10 09:45 lu-4380.out.OSS rw-r r - root/root 135327548 2014-01-10 09:45 lu-4380.out.REMOTEHOST

            Which of the above could have changed the outcome?

            All the patches seems not related to this problem, and I don't see why mds upgrading can change the outcome (I think this is a problem related only to client and OST). Could you verify the clients and OSS version? Do they all have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 (LU-3219)?

            Also, do you expect it to work correctly when running 2.4.1 client against 2.1.5 server? I am still able to reproduce against 2.1.5 server.

            Does the 2.1.5 server have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 applied?

            niu Niu Yawei (Inactive) added a comment - Which of the above could have changed the outcome? All the patches seems not related to this problem, and I don't see why mds upgrading can change the outcome (I think this is a problem related only to client and OST). Could you verify the clients and OSS version? Do they all have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 ( LU-3219 )? Also, do you expect it to work correctly when running 2.4.1 client against 2.1.5 server? I am still able to reproduce against 2.1.5 server. Does the 2.1.5 server have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 applied?

            I was wrong in saying that the reproducer can be run against 2.4.1 centos server. It actually was 2.4.0 server with patches. The branch was nas-2.4.0-1 and tag was 2.4.0-3nasS.

            We recently updated the 2.4.0 mds (for testing LU-4403). Well, I am not able to reproduce the problem any more. The patches I picked up were:
            LU-4179 mdt: skip open lock enqueue during resent
            LU-3992 libcfs: Fix NUMA emulated mode
            LU-4139 quota: improve write performance when over softlimit
            LU-4336 quota: improper assert in osc_quota_chkdq()
            LU-4403 mds: extra lock during resend lock lookup
            LU-4028 quota: improve lfs quota output

            Which of the above could have changed the outcome?

            Also, do you expect it to work correctly when running 2.4.1 client against 2.1.5 server? I am still able to reproduce against 2.1.5 server.

            jaylan Jay Lan (Inactive) added a comment - I was wrong in saying that the reproducer can be run against 2.4.1 centos server. It actually was 2.4.0 server with patches. The branch was nas-2.4.0-1 and tag was 2.4.0-3nasS. We recently updated the 2.4.0 mds (for testing LU-4403 ). Well, I am not able to reproduce the problem any more. The patches I picked up were: LU-4179 mdt: skip open lock enqueue during resent LU-3992 libcfs: Fix NUMA emulated mode LU-4139 quota: improve write performance when over softlimit LU-4336 quota: improper assert in osc_quota_chkdq() LU-4403 mds: extra lock during resend lock lookup LU-4028 quota: improve lfs quota output Which of the above could have changed the outcome? Also, do you expect it to work correctly when running 2.4.1 client against 2.1.5 server? I am still able to reproduce against 2.1.5 server.

            I had a problem that I was not able to stop debug_daemon until good data were flushed out of the debug file at the OST side. You need to tell me how to address that problem so that I can produce OST log for you.

            You can try to execute 'lctl clear' on OSS to clear the debug buffer before testing.

            niu Niu Yawei (Inactive) added a comment - I had a problem that I was not able to stop debug_daemon until good data were flushed out of the debug file at the OST side. You need to tell me how to address that problem so that I can produce OST log for you. You can try to execute 'lctl clear' on OSS to clear the debug buffer before testing.

            People

              bogl Bob Glossman (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: