[LU-4380] data corruption when copy a file to a new directory (sles11sp2 only) Created: 12/Dec/13 Updated: 13/Feb/14 Resolved: 14/Jan/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jay Lan (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: centos 2.1.5 server OR centos 2.4.1 server Source can be found at github.com/jlan/lustre-nas. The tag for the client is 2.4.1-1nasC. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12006 | ||||||||
| Description |
|
Users reported a data corruption problem. We have a test script to reproduce the problem. When run in a Lustre file system with a sles11sp2 host as the remote host, the script fails (sum reports 00000). It works if the remote host is running sles11sp1 or CentOS. — cut here for test5.sh — host=${1:-endeavour2} Good result: Bad result: Notes:
$ stat zz/hosts
|
| Comments |
| Comment by Jay Lan (Inactive) [ 12/Dec/13 ] |
|
More date point. These client worked correctly: |
| Comment by Peter Jones [ 12/Dec/13 ] |
|
Bob Could you please try and reproduce this issue? Thanks Peter |
| Comment by Andreas Dilger [ 14/Dec/13 ] |
|
Is the "cp" on SLES11SP2 using the FIEMAP ioctl to determine if the file has data in it? This sounds like an old bug ( |
| Comment by Bob Glossman (Inactive) [ 16/Dec/13 ] |
|
Looking at the tagged version source 2.4.1-1nasC from github.com/jlan/lustre-nas mentioned it appears the fix for So far I haven't been able to reproduce this in a small test environment. I suspect the /etc/hosts file I have isn't nearly big enough to show the problem. I think I will need a bigger test file. It seems to me a text file like /etc/hosts should be really full of continuous text and wouldn't trigger problems of FIEMAP related to sparse files. Wouldn't expect such a file to be sparse. |
| Comment by Bob Glossman (Inactive) [ 16/Dec/13 ] |
|
went to a 2.4.1 client, still can't reproduce the problem. You specify the remote client, the one your script ssh's to, is sles11sp2. You never mention what version of lustre or kernel the local client, the one your script starts out on and does local cp's on, is. |
| Comment by Jay Lan (Inactive) [ 17/Dec/13 ] |
|
Hi Bob, Here is more input from our admin who came up with the reproducer. In the following == quote on == I used various hosts as local or remote targets. It always failed when the remote host was sles11sp2 and succeeded in other cases. I have one additional failure variant: $ ssh r401i0n0 "cd $PWD && ./test5.sh" r401i0n1 $ cmp -l hosts zz/hosts | head Here, the local and remote hosts are adjacent sles11sp2 nodes. Instead of the second copy of the file being completely empty, the missing blocks start after exactly 1 MiB. I tried tweaking the stripe sizes of the source and destination directories, but the result was the same. I then used a bigger file. The result is that all 1 MiB chunks but the last, partial one get copied okay. But, remember that if the source file is very small, it gets copied completely okay also. |
| Comment by Andreas Dilger [ 18/Dec/13 ] |
|
Jay, could you please run strace as part of your reproducer and attach the output from a failed run, to see whether the cp is using FIEMAP, and what results it gets back. It may be that cp is not even trying to copy the data if it gets back a result that indicates the file is sparse or something. |
| Comment by Bob Glossman (Inactive) [ 18/Dec/13 ] |
|
Continued efforts to reproduce the problem locally haven't panned out. went to bigger test file, went to 2.4.1 clients, went to launching from one sles11sp2 client onto another as described. all cases succeeded, no failures seen. I must be doing something significantly different, but not sure what. |
| Comment by Jay Lan (Inactive) [ 18/Dec/13 ] |
|
I passed along Andreas' request to the tester. |
| Comment by Jay Lan (Inactive) [ 18/Dec/13 ] |
|
Data from our admin: == quote on == stat("zz/", {st_dev=makedev(3827, 595112), st_ino=163703357997847957, st_mode=S_ IFDIR|0755, st_nlink=2, st_uid=10491, st_gid=1179, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2013/12/18-11:30:20, st_mtime=2013/12/18-11:30:20, st_ctime=2013/12/18-11:30:20}) = 0 ) = 0 ) = 0 ) = 0 Now, if you study this, you see that cp did a read/write of 4 MiB and then a read/wrete of 3 MiB, and then uses ftruncate to set the size of the destination file to the st_size reported by the fstat(3, ...) call. Where did cp come up with 7 MiB as the amount to copy? From the st_blocks field in the fstat call. Apparently, the sles11sp2 cp has been "upgraded" to pay attention to st_blocks, rather than just do the copy. == quote off == |
| Comment by Bob Glossman (Inactive) [ 18/Dec/13 ] |
|
What is your specific kernel version in your sles11sp2? I ask because over time there have been several. Wondering if maybe different vfs level changes in the version you are using could explain why I'm not seeing the problem reproducing. |
| Comment by Jay Lan (Inactive) [ 18/Dec/13 ] |
|
The sles11sp2 version we are running in production is 3.0.74-0.6.6.2. |
| Comment by Bob Glossman (Inactive) [ 18/Dec/13 ] |
|
I believe the ioctl(3, 0xc020660b, 0x7fffffffd390) shown in the strace output is a FS_IOC_FIEMAP ioctl. suspect that is where cp is getting the sizes to read/write. interesting that it matches the file allocation size of 512b blocks reported in the st_blocks of the stat call and is smaller than the st_size reported in the stat call. |
| Comment by Bob Glossman (Inactive) [ 18/Dec/13 ] |
|
all the sles versions I'm testing with are newer than that. I have some 3.0.93-0.5, the most recent version of sles11sp2 we build and ship on, and some 3.0.101-05, the most recent kernel update version for sles11sp2. |
| Comment by Bob Glossman (Inactive) [ 18/Dec/13 ] |
|
just to collect some additional data could you please add the --sparse=never option to your cp commands, see if that avoids failures, and again get straces on the cp. |
| Comment by Jay Lan (Inactive) [ 18/Dec/13 ] |
|
Here is the second part of Dale's reply in response to Andreas' strace request. I did not include the second part in first attemp. He actually did try with --sparse=never. == quote on == FWIW:
|
| Comment by Jay Lan (Inactive) [ 19/Dec/13 ] |
|
Hi Bob, '/bin/cp' command is packaged in coreutils in sles11sp2. |
| Comment by Niu Yawei (Inactive) [ 19/Dec/13 ] |
|
This looks like the same problem as Some data of the source file 'host' is still cached on client but not flushed back to OST, so the st_blocks reproted by stat is less than actual file size, 'cp' then think that is a sparse file and tries to copy only the extents get by fiemap ioctl. So what we need to figure out is: If the 'cp' in sles11sp2 calls fiemap with FIEMAP_FLAG_SYNC flag to make sure all the cached data flush back before getting extents? |
| Comment by Niu Yawei (Inactive) [ 19/Dec/13 ] |
|
I checked the source code of coreutils-8.12 from gnu.org, looks FIEMAP_FLAG_SYNC is always set for reading extent, not sure if there is any difference with the copy on sles11sp2. (not sure where to get the source code of coreutils for sles11sp2) Bob, I guess your coreutils version isn't same as Jay's, that's why you can't reproduce the problem. Could you try coreutils-8.12-6.23.1? |
| Comment by Bob Glossman (Inactive) [ 19/Dec/13 ] |
|
I have coreutils-8.12-6.25.29.1 on sles11sp2. |
| Comment by Bob Glossman (Inactive) [ 19/Dec/13 ] |
|
Tried backing down to the -6.23.1 coreutils version. Still couldn't make the problem happen. Looks like the binary cp is identical between the 2 versions anyway, I checked. |
| Comment by Jay Lan (Inactive) [ 19/Dec/13 ] |
|
Niu, |
| Comment by Niu Yawei (Inactive) [ 24/Dec/13 ] |
|
Jay, could you try to reproduce with D_TRACE log enabled, let's try to see if sync flag is specified in fiemap call from the lustre log?
|
| Comment by Niu Yawei (Inactive) [ 24/Dec/13 ] |
|
It's better to have this patch applied when collecting debug logs. |
| Comment by Jay Lan (Inactive) [ 24/Dec/13 ] |
|
Attached is the debug output Niu requested. I did not run the test with Niu's patch though since I need to get authorization to put in new binary into production system. |
| Comment by Jay Lan (Inactive) [ 24/Dec/13 ] |
|
I was asked to check with you guys if "to have Lustre not implement the FIEMAP ioctl" can be a good quick workaround? Note that in our case, the writer is on one host and the reader is on a different one. Is this why FIEMAP_FLAG_SYNC has no effect: The _SYNC flag is on the reader host, but the cached data are on the writer host? |
| Comment by Niu Yawei (Inactive) [ 25/Dec/13 ] |
Ah, I was thinking it's on same client, but fix of |
| Comment by Jay Lan (Inactive) [ 30/Dec/13 ] |
|
The tar gz file contains I did not include a debug trace for the OSS. The 'lfs getstripe zz/hosts' showed: and there are 26 OSTs on that fs. So, does it fall on oss3, if it starts from oss1? Do I have to turn on +trace +dlmtrace +cache on the complete oss? |
| Comment by Jay Lan (Inactive) [ 30/Dec/13 ] |
|
I tried to run the test again, with debugging on OSS. The debug output did not contain lctl marks. The 300M specified in debug_daemon was not big enough. |
| Comment by Jay Lan (Inactive) [ 30/Dec/13 ] |
|
I tried to run the test on an lustre filesystem that use older hardware but much less activities. I set debug file size to 2G. The problem was that "lctl debug_daemon stop" hanged until the 2G ran out. The debug file missed most part of the test. Same thing when I specified 1G |
| Comment by Niu Yawei (Inactive) [ 31/Dec/13 ] |
|
Thank you Jay. I guess wr is the client which execute the test script (which cp hosts file from /etc/hosts)? because I see the blocking ast from wr log: 00010000:00000001:28.0F:1388433400.002345:0:95598:0:(ldlm_lockd.c:1694:ldlm_handle_bl_callback()) Process entered 00000100:00000001:22.0:1388433400.002345:0:4805:0:(events.c:407:reply_out_callback()) Process leaving 00000100:00000001:29.0:1388433400.002346:0:54104:0:(service.c:1571:ptlrpc_server_hpreq_fini()) Process entered 00000100:00000001:29.0:1388433400.002346:0:54104:0:(service.c:1582:ptlrpc_server_hpreq_fini()) Process leaving 00000100:00000001:29.0:1388433400.002347:0:54104:0:(service.c:2078:ptlrpc_server_handle_request()) Process leaving (rc=1 : 1 : 1) 00000400:00000001:29.0:1388433400.002348:0:54104:0:(watchdog.c:448:lc_watchdog_disable()) Process entered 00010000:00010000:28.0:1388433400.002348:0:95598:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nbp8-OST0050-osc-ffff8807e657ec00 lock: ffff880aa9efb480/0x11c2317b4b200a63 lrc: 3/0,0 mode: PW/PW res: [0x3552ee:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 40960->18446744073709551615) flags: 0x420000000000 nid: local remote: 0x42e20173aa80c345 expref: -99 pid: 63286 timeout: 0 lvb_type: 1 00000400:00000001:29.0:1388433400.002349:0:54104:0:(watchdog.c:456:lc_watchdog_disable()) Process leaving 00010000:00010000:28.0:1388433400.002354:0:95598:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff880aa9efb480 already unused, calling callback (ffffffffa08f79e0) 00000020:00000001:28.0:1388433400.002372:0:95598:0:(cl_lock.c:357:cl_lock_get_trust()) acquiring trusted reference: 0 ffff88089dfea238 18446744072108337004 00000020:00000001:28.0:1388433400.002374:0:95598:0:(cl_lock.c:150:cl_lock_trace0()) got mutex: ffff88089dfea238@(1 ffff880f181f8340 1 5 0 0 0 0)(ffff880404adca70/1/1) at cl_lock_mutex_tail():668 00000020:00000001:28.0:1388433400.002377:0:95598:0:(cl_lock.c:1839:cl_lock_cancel()) Process entered 00000020:00010000:28.0:1388433400.002378:0:95598:0:(cl_lock.c:150:cl_lock_trace0()) cancel lock: ffff88089dfea238@(1 ffff880f181f8340 1 5 0 0 0 0)(ffff880404adca70/1/1) at cl_lock_cancel():1840 00000020:00000001:28.0:1388433400.002381:0:95598:0:(cl_lock.c:804:cl_lock_cancel0()) Process entered 00000008:00000001:28.0:1388433400.002382:0:95598:0:(osc_lock.c:1305:osc_lock_flush()) Process entered 00000008:00000001:28.0:1388433400.002383:0:95598:0:(osc_cache.c:2827:osc_cache_writeback_range()) Process entered 00000008:00000001:28.0:1388433400.002386:0:95598:0:(osc_cache.c:2770:osc_cache_wait_range()) Process entered 00000008:00000020:28.0:1388433400.002387:0:95598:0:(osc_cache.c:2807:osc_cache_wait_range()) obj ffff88105cc78408 ready 0|-|- wr 0|-|- rd 0|- sync file range. 00000008:00000001:28.0:1388433400.002388:0:95598:0:(osc_cache.c:2808:osc_cache_wait_range()) Process leaving (rc=0 : 0 : 0) 00000008:00000020:28.0:1388433400.002389:0:95598:0:(osc_cache.c:2923:osc_cache_writeback_range()) obj ffff88105cc78408 ready 0|-|- wr 0|-|- rd 0|- pageout [0, 18446744073709551615], 0. I think obj ffff88105cc78408 should be the source hosts on lustre, and when remote client try to cp it to zz directory, blocking ast should be sent to local client. The interesting thing is that I didn't see fiemap calls on remote client ('rd' client), maybe it did the copy by normal read. Anyway, I didn't see anything wrong from the log, did the test success or not? Since the remote client didn't call fiemap, we don't need ost log for now, thank you. |
| Comment by Jay Lan (Inactive) [ 31/Dec/13 ] |
|
Please discard LU4380.dbg.20121230.tgz. The two files contained in the tarball had confusing names. (besides it should be 2013 The fiemap was actually happened at the remote client, the one which actually did file creation and contents copying. I had a problem that I was not able to stop debug_daemon until good data were flushed out of the debug file at the OST side. You need to tell me how to address that problem so that I can produce OST log for you. |
| Comment by Niu Yawei (Inactive) [ 02/Jan/14 ] |
You can try to execute 'lctl clear' on OSS to clear the debug buffer before testing. |
| Comment by Jay Lan (Inactive) [ 07/Jan/14 ] |
|
I was wrong in saying that the reproducer can be run against 2.4.1 centos server. It actually was 2.4.0 server with patches. The branch was nas-2.4.0-1 and tag was 2.4.0-3nasS. We recently updated the 2.4.0 mds (for testing Which of the above could have changed the outcome? Also, do you expect it to work correctly when running 2.4.1 client against 2.1.5 server? I am still able to reproduce against 2.1.5 server. |
| Comment by Niu Yawei (Inactive) [ 08/Jan/14 ] |
All the patches seems not related to this problem, and I don't see why mds upgrading can change the outcome (I think this is a problem related only to client and OST). Could you verify the clients and OSS version? Do they all have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 (
Does the 2.1.5 server have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 applied? |
| Comment by Mahmoud Hanafi [ 10/Jan/14 ] |
|
I have gathered clean debug logs from localclient, remoteclient and oss. The files are to large to attach here. I have uploaded it to your ftp site 'ftp://ftp.whamcloud.com/uploads' The filename is "LU_4380.debug.tgz" $ tar tzvf LU_4380.debug.tgz |
| Comment by Niu Yawei (Inactive) [ 13/Jan/14 ] |
00000010:00000001:0.0:1389375898.914586:0:15540:0:(ost_handler.c:1261:ost_get_info()) Process leaving (rc=0 : 0 : 0)
Mahmoud, looks your OST is running on 2.1.5 and it doesn't have the patch 58444c4e9bc58e192f0bc0c163a5d51d42ba4255 ( |
| Comment by Jay Lan (Inactive) [ 14/Jan/14 ] |
|
We tested the 2.1.5 server with Since somehow we are no longer able to reproduce the problem with our 2.4.0 |
| Comment by Peter Jones [ 14/Jan/14 ] |
|
ok - thanks Jay! |