[LU-2994] mmap IO performance problem Created: 20/Mar/13 Updated: 08/Feb/18 Resolved: 08/Feb/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 2 | ||||
| Rank (Obsolete): | 7301 | ||||
| Description |
|
Our an customer has an interesting configuration with Lustre. They have VM environment with KVM(Kernel Virtual Machine). VM host node is RHEL6.2. This is Lustre client and mounting the Lustre. Guest OS's images are located on the Lustre. The hadoop is running on these guest OS and HDFS is crated on the VM's image. Mar 21 04:01:59 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041946112-18041950207] Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum 9f200f04 (type 2), server csum cb180f07 (type 2), client csum now ce430f5f Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff88086754e400 x1430178466264362/t4304523663 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806126 ref 1 fl Interpret:R/0/0 rc 0/0 Mar 21 04:02:34 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed on the client after we checksummed it - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041978880-18041991167] Mar 21 04:02:34 s08 kernel: LustreError: Skipped 4 previous similar messages Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum a32dae6e (type 2), server csum 991aae8f (type 2), client csum now 991aae8f Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) Skipped 4 previous similar messages Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff88086754e400 x1430178466359938/t4304619111 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806161 ref 1 fl Interpret:R/0/0 rc 0/0 Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) Skipped 4 previous similar messages And, we see a lot of timeout error messages for local disk's (VM image). This is reproduce-able and I've demonstrated same problem in our lab. I'm going to collect debug logs and attach on here. |
| Comments |
| Comment by Peter Jones [ 20/Mar/13 ] |
|
Ihara Which version of Lustre are they using? Peter |
| Comment by Jinshan Xiong (Inactive) [ 20/Mar/13 ] |
|
mmap write with checksum should have been fixed by introducing ll_page_mkwrite() method. Yes, debug log is the first step for this problem |
| Comment by Oleg Drokin [ 20/Mar/13 ] |
|
the error messages we have a silencing patch in 2.x I believe, commit f8995c83720e999b13f057739f8217822a3951fa The sad reality is lustre mmap is on the heavy side, and mmap performance is not super great with all the single-page io it usually encounters, esp. forced read-modify-write cycles too. I am not totally sure kvm itself is using mmap io for disk access, I just checked a copy of my kvm running and there's no disk in /proc/fd/map. It might depend on different disk drivers used too, so try playing with that. The other thing I was sure kvm is using mmap for is the system RAM image that I believed it stores in an unlinked file somewhere in /tmp, but I don't see any evidence of that either. If this does happen, converting /tmp to tmpfs-based filesystem might be prudent. So, sorry I am coming at this from a totally different angle, but I think it should be totally possible to reduce kvm usage of mmap an sidestep the problem in this way. |
| Comment by Oleg Drokin [ 20/Mar/13 ] |
|
Jinshan: I suspect this is some sort of 1.8 deployment, so page_mkwrite might not be there. |
| Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ] |
|
This is lustre-1.8.9.wc1. |
| Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ] |
|
debug files attached. |
| Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ] |
|
collected stack trace of qeme-kvm process before VM's node hang. No mmap io.. |
| Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ] |
|
sorry, qemu-kvm called mmap.. I will post collrect syslog and debug log. |
| Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ] |
|
This is my configuration and test configuration. sorry confusions, but it seems to be related to mmap calls on the lustre client. oss, mds - Lustre's OSS/MDS which is running lustre-1.8.9 hadoop sample job started on master node of hadoop (hadoop1). writing the 10GB file to HDFS. # hadoop jar /usr/lib/hadoop/hadoop-test.jar TestDFSIO -write -nrFiles 1 -fileSize 10GB 13/03/23 03:56:18 INFO fs.TestDFSIO: TestDFSIO.0.0.6 13/03/23 03:56:18 INFO fs.TestDFSIO: nrFiles = 1 13/03/23 03:56:18 INFO fs.TestDFSIO: fileSize (MB) = 10240.0 13/03/23 03:56:18 INFO fs.TestDFSIO: bufferSize = 1000000 13/03/23 03:56:18 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 13/03/23 03:56:18 INFO fs.TestDFSIO: creating control file: 10737418240 bytes, 1 files 13/03/23 03:56:18 INFO fs.TestDFSIO: created control files for: 1 files 13/03/23 03:56:19 INFO mapred.FileInputFormat: Total input paths to process : 1 13/03/23 03:56:20 INFO mapred.JobClient: Running job: job_201303240831_0029 13/03/23 03:56:21 INFO mapred.JobClient: map 0% reduce 0% 13/03/23 03:56:36 INFO mapred.JobClient: map 100% reduce 0% ... I monitored kvm process on KMV master node and it called mmap. # grep -i mmap strace-log.txt 03:56:34 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f309d1000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500dccc000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500c2f1000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f50075ff000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3f3fe000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3e0f4000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3c8ee000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3beed000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3b4ec000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3aaeb000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3a0ea000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f36ddb000 03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f363da000 03:58:48 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500dccc000 03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500c2f1000 03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f50075ff000 03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5005ffb000 04:02:49 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5004af8000 04:02:50 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3e0f4000 From Lustre client's debug log. 00000008:02020000:6:1363978813.043771:0:8798:0:(osc_request.c:1420:check_write_checksum()) 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 43/1177415599 object 738/0 extent [4158808064-4158820351] 00000008:00020000:6:1363978813.043772:0:8798:0:(osc_request.c:1423:check_write_checksum()) original client csum 7335b83c (type 2), server csum d105b83e (type 2), client csum now 2ee4b840 00000008:00000001:6:1363978813.043773:0:8798:0:(osc_request.c:1478:osc_brw_fini_request()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) VM side (hadoop3), there are a lot of device IO errors. Mar 23 03:57:31 hadoop3 kernel: end_request: I/O error, dev vda, sector 5689384 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582661 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582662 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582663 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582664 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: end_request: I/O error, dev vda, sector 54641960 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701733 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701734 Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0 Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701735 ... |
| Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ] |
|
all syslog and debug files are uploaded on /uploads/ Please have a look at them and let me know if you need futher informaiton. |
| Comment by Oleg Drokin [ 22/Mar/13 ] |
|
The mmap call you quote is "anonymous" mmap, meaning it does not really attach itself to any filesystem, it's just a fancy way of allocating memory. I wonder if the bad checksum is valid and you really have a network problem? do you get a similar bad checksum message on the server? |
| Comment by Shuichi Ihara (Inactive) [ 23/Mar/13 ] |
no, this is exactly happened at the customer site and we demonstrated same problem in lab here. |
| Comment by Shuichi Ihara (Inactive) [ 24/Mar/13 ] |
|
Few updates.. As fa as we find out qemu in detial, if cache=none to disk option of virtual disk for VMs, host opens the image with O_DIRECT which means no page cache on the Lustre client. <driver name='qemu' type='qcow2' cache='none'/> Howerver, "none" is NOT "none" caching on the VMs. VM has still small cache to emurate "write back" mode of disk. So, from host perspective, this is not true O_DIRECT mode. We changed cache mode, writeback and writethrough to disable O_DIRECT on the host and use async IO, instaed. At this moment, we haven't see problem so far... |
| Comment by Shuichi Ihara (Inactive) [ 24/Mar/13 ] |
|
with lustre-2.x, if VM's images are located on the Lustre, then start VMs, that doens't work (even can't start VMs) due to qemu-kvm got -EINVAL from ll_direct_IO_26() here.. /* FIXME: io smaller than PAGE_SIZE is broken on ia64 ??? */
if ((file_offset & ~CFS_PAGE_MASK) || (count & ~CFS_PAGE_MASK))
RETURN(-EINVAL);
|
| Comment by Oleg Drokin [ 24/Mar/13 ] |
|
Ah! Now this makes total sense! As far as EINVAL goes, can you confirm the count IS in fact a multiply of 4k and at offset that's multiply of 4k? |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
close old tickets |