[LU-4380] data corruption when copy a file to a new directory (sles11sp2 only) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
server: centos 2.1.5 server OR centos 2.4.1 server
client: sles11sp2 2.4.1 client

Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC.

Severity:
3
Rank (Obsolete):
12006

Description

Users reported a data corruption problem. We have a test script to reproduce the problem.

When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS.

— cut here for test5.sh —
#!/bin/sh

host=${1:-endeavour2}
rm -fr zz hosts
cp /etc/hosts hosts
#fsync hosts
ssh $host "cd $PWD && mkdir -p zz && cp hosts zz/"
sum hosts zz/hosts
— cut here —

Good result:
./test5.sh r301i0n0
61609 41 hosts
61609 41 zz/hosts

Bad result:
./test5.sh r401i0n2
61609 41 hosts
00000 41 zz/hosts

Notes:

If the copied file is small enough (e.g., /etc/motd), the script succeeds.
If you uncomment the fsync, the script succeeds.
When it fails, stat reports no blocks have been allocated to the zz/hosts file:

$ stat zz/hosts
File: `zz/hosts'
Size: 41820 Blocks: 0 IO Block: 2097152 regular file
Device: 914ef3a8h/2437870504d Inode: 163153538715835056 Links: 1
Access: (0644/~~rw-r~~r-) Uid: (10491/dtalcott) Gid: ( 1179/ cstaff)
Access: 2013-12-12 09:24:46.000000000 -0800
Modify: 2013-12-12 09:24:46.000000000 -0800
Change: 2013-12-12 09:24:46.000000000 -0800

If you run in an NFS file system, the script usually succeeds, but sometimes reports a no such file error on the sum of zz/hosts. After a few seconds, though, the file appears, with the correct sum. (Typical NFS behavior.)
Acts the same on nbp7 and nbp8.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU4380.dbg.20121230.resend.tgz
2.17 MB
31/Dec/13 6:33 PM
LU4380.dbg.20121230.tgz
2.17 MB
30/Dec/13 8:33 PM
LU4380.dbg.20131224
2.76 MB
24/Dec/13 9:03 PM
LU-4380-debug.patch
0.5 kB
24/Dec/13 2:32 AM

Issue Links

duplicates

LU-3219 FIEMAP does not sync data or return cached pages

Resolved

Activity

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Dec/13 7:24 PM

Updated:: 13/Feb/14 6:35 PM

Resolved:: 14/Jan/14 11:12 PM