Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 1.8.9
-
None
-
2
-
7301
Description
Our an customer has an interesting configuration with Lustre.
They have VM environment with KVM(Kernel Virtual Machine). VM host node is RHEL6.2. This is Lustre client and mounting the Lustre. Guest OS's images are located on the Lustre.
The hadoop is running on these guest OS and HDFS is crated on the VM's image.
When we tested hadoop example codes (teragen), we see a lot of error messages on Lustre client(VM host nodes) below.
Mar 21 04:01:59 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041946112-18041950207] Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum 9f200f04 (type 2), server csum cb180f07 (type 2), client csum now ce430f5f Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff88086754e400 x1430178466264362/t4304523663 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806126 ref 1 fl Interpret:R/0/0 rc 0/0 Mar 21 04:02:34 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed on the client after we checksummed it - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041978880-18041991167] Mar 21 04:02:34 s08 kernel: LustreError: Skipped 4 previous similar messages Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum a32dae6e (type 2), server csum 991aae8f (type 2), client csum now 991aae8f Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) Skipped 4 previous similar messages Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff88086754e400 x1430178466359938/t4304619111 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806161 ref 1 fl Interpret:R/0/0 rc 0/0 Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) Skipped 4 previous similar messages
And, we see a lot of timeout error messages for local disk's (VM image). This is reproduce-able and I've demonstrated same problem in our lab.
This is similar to LU-2001 and we couldn't have performance regressions if it accesses to Lustre through the NFS.
I'm going to collect debug logs and attach on here.