Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
Seeing data corruption on two IOR jobs while running lnet failover test with " Regression write/verify DNE2 PFL". Both the failures showup data from immediate previous run. Likely cache related issue. fsync related warnings are also logged.
CL_IOR_all_wr_20iter_1666K_rand
access = file-per-process pattern = strided (33 segments) ordering in a file = random offsets ordering inter file=constant task offsets = 1 clients = 48 (4 per node) repetitions = 20 xfersize = 1.63 MiB blocksize = 27.66 MiB aggregate filesize = 42.78 GiB read 1316.54 28322 1666.00 0.001729 33.27 0.010429 33.28 16 XXCEL Using Time Stamp 1594461626 (0x5f098dba) for Data Signature Commencing write performance test. Sat Jul 11 05:00:26 2020read 1852.87 28322 1666.00 0.001450 23.64 0.004490 23.64 17 XXCEL Using Time Stamp 1594461701 (0x5f098e05) for Data Signature Commencing write performance test. Sat Jul 11 05:01:41 2020WARNING: cannot perform fsync on file. WARNING: cannot perform fsync on file. write 197.50 28322 1666.00 0.001447 221.82 0.421875 221.82 18 XXCEL Verifying contents of the file(s) just written. Sat Jul 11 05:05:23 2020 [6] At transfer buffer #3, index #0 (file byte offset 235425792): [6] Expected: 0x0000000e5f098e05 [6] Actual: 0x0000000e5f098dba [6] At transfer buffer #3, index #2 (file byte offset 235425808): [6] Expected: 0x0000000e5f098e05 [6] Actual: 0x0000000e5f098dbaIOR job: -a POSIX -i 20 -w -r -W -t 1666K -b 28322K -C -e -k -vv -E -F -q -s 33 -x -z File in question:/lus/snx11205/ostest.vers/alsorun.20200711032404.2543.pollux-p4/CL_IOR_all_wr_20iter_1666K_rand.2.9t3jzC.1594459731/CL_IOR_all_wr_20iter_1666K_rand/IORfile.00000004
Landed for 2.14