[LU-2994] mmap IO performance problem Created: 20/Mar/13  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shuichi Ihara (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File debug.txt.gz     Text File strace-qemu-kvm.log    
Issue Links:
Related
Severity: 2
Rank (Obsolete): 7301

 Description   

Our an customer has an interesting configuration with Lustre.

They have VM environment with KVM(Kernel Virtual Machine). VM host node is RHEL6.2. This is Lustre client and mounting the Lustre. Guest OS's images are located on the Lustre.

The hadoop is running on these guest OS and HDFS is crated on the VM's image.
When we tested hadoop example codes (teragen), we see a lot of error messages on Lustre client(VM host nodes) below.

Mar 21 04:01:59 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041946112-18041950207]
Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum 9f200f04 (type 2), server csum cb180f07 (type 2), client csum now ce430f5f
Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff88086754e400 x1430178466264362/t4304523663 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806126 ref 1 fl Interpret:R/0/0 rc 0/0
Mar 21 04:02:34 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed on the client after we checksummed it - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041978880-18041991167]
Mar 21 04:02:34 s08 kernel: LustreError: Skipped 4 previous similar messages
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum a32dae6e (type 2), server csum 991aae8f (type 2), client csum now 991aae8f
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) Skipped 4 previous similar messages
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff88086754e400 x1430178466359938/t4304619111 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806161 ref 1 fl Interpret:R/0/0 rc 0/0
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) Skipped 4 previous similar messages

And, we see a lot of timeout error messages for local disk's (VM image). This is reproduce-able and I've demonstrated same problem in our lab.
This is similar to LU-2001 and we couldn't have performance regressions if it accesses to Lustre through the NFS.

I'm going to collect debug logs and attach on here.



 Comments   
Comment by Peter Jones [ 20/Mar/13 ]

Ihara

Which version of Lustre are they using?

Peter

Comment by Jinshan Xiong (Inactive) [ 20/Mar/13 ]

mmap write with checksum should have been fixed by introducing ll_page_mkwrite() method. Yes, debug log is the first step for this problem

Comment by Oleg Drokin [ 20/Mar/13 ]

the error messages we have a silencing patch in 2.x I believe, commit f8995c83720e999b13f057739f8217822a3951fa

The sad reality is lustre mmap is on the heavy side, and mmap performance is not super great with all the single-page io it usually encounters, esp. forced read-modify-write cycles too.
NFS hides that because it does not really cares about data consistency so could be more lacking in page tracking and such.

I am not totally sure kvm itself is using mmap io for disk access, I just checked a copy of my kvm running and there's no disk in /proc/fd/map. It might depend on different disk drivers used too, so try playing with that.

The other thing I was sure kvm is using mmap for is the system RAM image that I believed it stores in an unlinked file somewhere in /tmp, but I don't see any evidence of that either. If this does happen, converting /tmp to tmpfs-based filesystem might be prudent.

So, sorry I am coming at this from a totally different angle, but I think it should be totally possible to reduce kvm usage of mmap an sidestep the problem in this way.

Comment by Oleg Drokin [ 20/Mar/13 ]

Jinshan: I suspect this is some sort of 1.8 deployment, so page_mkwrite might not be there.

Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ]

This is lustre-1.8.9.wc1.

Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ]

debug files attached.

Comment by Shuichi Ihara (Inactive) [ 21/Mar/13 ]

collected stack trace of qeme-kvm process before VM's node hang. No mmap io..

Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ]

sorry, qemu-kvm called mmap.. I will post collrect syslog and debug log.

Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ]

This is my configuration and test configuration. sorry confusions, but it seems to be related to mmap calls on the lustre client.

oss, mds - Lustre's OSS/MDS which is running lustre-1.8.9
s08 - lustre client (lustre mounsted on /lustre) and master node of KVM
haoop1-3 - VMs which is running s08, these VM images are located on /lustre/images/

hadoop sample job started on master node of hadoop (hadoop1). writing the 10GB file to HDFS.
# hadoop jar /usr/lib/hadoop/hadoop-test.jar TestDFSIO -write -nrFiles 1 -fileSize 10GB
13/03/23 03:56:18 INFO fs.TestDFSIO: TestDFSIO.0.0.6
13/03/23 03:56:18 INFO fs.TestDFSIO: nrFiles = 1
13/03/23 03:56:18 INFO fs.TestDFSIO: fileSize (MB) = 10240.0
13/03/23 03:56:18 INFO fs.TestDFSIO: bufferSize = 1000000
13/03/23 03:56:18 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
13/03/23 03:56:18 INFO fs.TestDFSIO: creating control file: 10737418240 bytes, 1 files
13/03/23 03:56:18 INFO fs.TestDFSIO: created control files for: 1 files
13/03/23 03:56:19 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/23 03:56:20 INFO mapred.JobClient: Running job: job_201303240831_0029
13/03/23 03:56:21 INFO mapred.JobClient:  map 0% reduce 0%
13/03/23 03:56:36 INFO mapred.JobClient:  map 100% reduce 0%
...

I monitored kvm process on KMV master node and it called mmap.

# grep -i mmap strace-log.txt 
03:56:34 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f309d1000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500dccc000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500c2f1000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f50075ff000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3f3fe000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3e0f4000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3c8ee000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3beed000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3b4ec000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3aaeb000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3a0ea000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f36ddb000
03:57:32 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f363da000
03:58:48 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500dccc000
03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f500c2f1000
03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f50075ff000
03:58:57 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5005ffb000
04:02:49 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5004af8000
04:02:50 mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4f3e0f4000

From Lustre client's debug log.

00000008:02020000:6:1363978813.043771:0:8798:0:(osc_request.c:1420:check_write_checksum()) 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 43/1177415599 object 738/0 extent [4158808064-4158820351]
00000008:00020000:6:1363978813.043772:0:8798:0:(osc_request.c:1423:check_write_checksum()) original client csum 7335b83c (type 2), server csum d105b83e (type 2), client csum now 2ee4b840
00000008:00000001:6:1363978813.043773:0:8798:0:(osc_request.c:1478:osc_brw_fini_request()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)

VM side (hadoop3), there are a lot of device IO errors.

Mar 23 03:57:31 hadoop3 kernel: end_request: I/O error, dev vda, sector 5689384
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582661
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582662
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582663
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 582664
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: end_request: I/O error, dev vda, sector 54641960
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701733
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701734
Mar 23 03:57:31 hadoop3 kernel: lost page write due to I/O error on dm-0
Mar 23 03:57:31 hadoop3 kernel: Buffer I/O error on device dm-0, logical block 6701735
...
Comment by Shuichi Ihara (Inactive) [ 22/Mar/13 ]

all syslog and debug files are uploaded on /uploads/LU-2994/20130323 on ftp site.

Please have a look at them and let me know if you need futher informaiton.

Comment by Oleg Drokin [ 22/Mar/13 ]

The mmap call you quote is "anonymous" mmap, meaning it does not really attach itself to any filesystem, it's just a fancy way of allocating memory.

I wonder if the bad checksum is valid and you really have a network problem? do you get a similar bad checksum message on the server?

Comment by Shuichi Ihara (Inactive) [ 23/Mar/13 ]

I wonder if the bad checksum is valid and you really have a network problem? do you get a similar bad checksum message on the server?

no, this is exactly happened at the customer site and we demonstrated same problem in lab here.

Comment by Shuichi Ihara (Inactive) [ 24/Mar/13 ]

Few updates..

As fa as we find out qemu in detial, if cache=none to disk option of virtual disk for VMs, host opens the image with O_DIRECT which means no page cache on the Lustre client.

<driver name='qemu' type='qcow2' cache='none'/>

Howerver, "none" is NOT "none" caching on the VMs. VM has still small cache to emurate "write back" mode of disk. So, from host perspective, this is not true O_DIRECT mode.

We changed cache mode, writeback and writethrough to disable O_DIRECT on the host and use async IO, instaed. At this moment, we haven't see problem so far...

Comment by Shuichi Ihara (Inactive) [ 24/Mar/13 ]

with lustre-2.x, if VM's images are located on the Lustre, then start VMs, that doens't work (even can't start VMs) due to qemu-kvm got -EINVAL from ll_direct_IO_26() here..

       /* FIXME: io smaller than PAGE_SIZE is broken on ia64 ??? */
       if ((file_offset & ~CFS_PAGE_MASK) || (count & ~CFS_PAGE_MASK))
               RETURN(-EINVAL);
Comment by Oleg Drokin [ 24/Mar/13 ]

Ah! Now this makes total sense!
It seems we always assumed that mmap was the only way to change pages while IO was in-progress, but in reality O_DIRECT is very similar - the pages are under user control so parallel threads might still change them while the IO thread is blocked.

As far as EINVAL goes, can you confirm the count IS in fact a multiply of 4k and at offset that's multiply of 4k?

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Generated at Sat Feb 10 01:30:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.