Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.9
Labels:
None

Severity:
2
Rank (Obsolete):
7301

Our an customer has an interesting configuration with Lustre.

They have VM environment with KVM(Kernel Virtual Machine). VM host node is RHEL6.2. This is Lustre client and mounting the Lustre. Guest OS's images are located on the Lustre.

The hadoop is running on these guest OS and HDFS is crated on the VM's image.
When we tested hadoop example codes (teragen), we see a lot of error messages on Lustre client(VM host nodes) below.

Mar 21 04:01:59 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041946112-18041950207]
Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum 9f200f04 (type 2), server csum cb180f07 (type 2), client csum now ce430f5f
Mar 21 04:01:59 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff88086754e400 x1430178466264362/t4304523663 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806126 ref 1 fl Interpret:R/0/0 rc 0/0
Mar 21 04:02:34 s08 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed on the client after we checksummed it - likely false positive due to mmap IO (bug 11742): from 192.168.100.95@o2ib inum 22/1194173787 object 7/0 extent [18041978880-18041991167]
Mar 21 04:02:34 s08 kernel: LustreError: Skipped 4 previous similar messages
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) original client csum a32dae6e (type 2), server csum 991aae8f (type 2), client csum now 991aae8f
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1423:check_write_checksum()) Skipped 4 previous similar messages
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff88086754e400 x1430178466359938/t4304619111 o4->lustre-OST0001_UUID@192.168.100.95@o2ib:6/4 lens 448/608 e 0 to 1 dl 1363806161 ref 1 fl Interpret:R/0/0 rc 0/0
Mar 21 04:02:34 s08 kernel: LustreError: 3308:0:(osc_request.c:1652:osc_brw_redo_request()) Skipped 4 previous similar messages

And, we see a lot of timeout error messages for local disk's (VM image). This is reproduce-able and I've demonstrated same problem in our lab.
This is similar to ~~LU-2001~~ and we couldn't have performance regressions if it accesses to Lustre through the NFS.

I'm going to collect debug logs and attach on here.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

debug.txt.gz
3.17 MB
21/Mar/13 3:55 AM
strace-qemu-kvm.log
768 kB
21/Mar/13 5:59 PM

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 20/Mar/13 7:15 PM

Updated:: 08/Feb/18 6:30 PM

Resolved:: 08/Feb/18 6:30 PM

Details

Description

Attachments

Attachments

Activity

People

Dates