Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
-
client catalyst: lustre-2.8.2_5.chaos-1.ch6.x86_64
server: porter lustre-2.10.5_2.chaos-3.ch6.x86_64
kernel-3.10.0-862.14.4.1chaos.ch6.x86_64 (RHEL 7.5 derivative)
-
2
-
9223372036854775807
Description
The apparent contents of a file change after dropping caches:
[root@catalyst110:toss-4371.umm1t]# ./proc6.olaf + dd if=/dev/urandom of=testfile20K.in bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.024565 s, 834 kB/s + dd if=testfile20K.in of=testfile20K.out bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.0451045 s, 454 kB/s ++ md5sum testfile20K.out + original_md5sum='1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out' + echo 3 ++ md5sum testfile20K.out + echo after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out [root@catalyst110:toss-4371.umm1t]# cat proc6.olaf #!/bin/bash set -x dd if=/dev/urandom of=testfile.in bs=10240 count=2 dd if=testfile.in of=testfile.out bs=10240 count=2 #dd if=/dev/urandom of=testfile.in bs=102400 count=2 #dd if=testfile.in of=testfile.out bs=102400 count=2 original_md5sum=$(md5sum testfile.out) echo 3 >/proc/sys/vm/drop_caches echo after drop_caches $original_md5sum $(md5sum testfile.out)
Oleg, I've uploaded lu-11663-2018-11-26.tgz which contains the test files and debug logs on both client and server during two tests; one iteration that reproduces the issue, on client catalyst101, and one where the corruption does not occur, on client catalyst106. There's a typescript file that shows the output of the test as it ran. In both cases the stripe index of the files is 0.
The node which fails the test takes much longer to write the data, consistent with the sync writes you saw in the last debug logs.
The file system where this is occurring is 28% full, with individual OSTs ranging from 25% full to 31% full.
The amount of data I personally have stored on each OST ranges from 23M to 308M; there are 80 OSTs. My total usage is 5.37G and total quota is 18T. lfs quota says total allocated block limit is 5T, and each OST reports a limit of 64G.