Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
-
client catalyst: lustre-2.8.2_5.chaos-1.ch6.x86_64
server: porter lustre-2.10.5_2.chaos-3.ch6.x86_64
kernel-3.10.0-862.14.4.1chaos.ch6.x86_64 (RHEL 7.5 derivative)
-
2
-
9223372036854775807
Description
The apparent contents of a file change after dropping caches:
[root@catalyst110:toss-4371.umm1t]# ./proc6.olaf + dd if=/dev/urandom of=testfile20K.in bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.024565 s, 834 kB/s + dd if=testfile20K.in of=testfile20K.out bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.0451045 s, 454 kB/s ++ md5sum testfile20K.out + original_md5sum='1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out' + echo 3 ++ md5sum testfile20K.out + echo after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out [root@catalyst110:toss-4371.umm1t]# cat proc6.olaf #!/bin/bash set -x dd if=/dev/urandom of=testfile.in bs=10240 count=2 dd if=testfile.in of=testfile.out bs=10240 count=2 #dd if=/dev/urandom of=testfile.in bs=102400 count=2 #dd if=testfile.in of=testfile.out bs=102400 count=2 original_md5sum=$(md5sum testfile.out) echo 3 >/proc/sys/vm/drop_caches echo after drop_caches $original_md5sum $(md5sum testfile.out)
I reviewed the -1 logs.
Interesting observations I have:
1. The before unmount, you had some strange grant shortage, what this means is every write was actually synchronous. You can track this by messages that say this:
2. so this leads to a sync write in a middle of a page, twice.
3. This is actually got both .in and .out file, but only .out file is somehow damaged, huh?
4. We know that we are writing the correct data to the server because we can observe both write requests, to .in and .out files and the checksum comes the same, see the "checksum at write origin" message repeated twice for the same request.
We cannot see if it's what was read, though, because the final read comes after readahead so all 4 pages are read in one go and the checksum is not comparable (interesting experiment would have been to disable readahead or do directio reads or some such to see if the bad data comes straight from the server, which I think it does, but we cannot be 100% sure).
Now looking at the successful iterations after remounts, there are two major differences there:
1. There's plenty of grant so no sync writes are happening.
2. The drop caches does nothing, there are NO write RPCs in those locks (grep for 'o4-' to confirm). There are no reads either (grep for 'checksum .* confirmed' you see only two requests with fffffff checksum, that's the empty read at EOF).
these two thigns combined mean that whatever corruption you had, even if it's happening, would not be seen.
Anyway my current conclusion is the corruption is actually happening on the server, it could be the disk or Lustre somewhere, I don't know about that, but the client seems to be doing everything ok.
As such I suspect we would need client+server logs of a reproduced case. Also please include both .in and .out files so we can compare them. It looks like to facilitate better reproducing you might want to dramatically shrink grant availability somehow (is the fs more prone to this is mostly full? quotas that are getting lowish in place?). I do wonder if the same thing happens when you use directio straight from dd, but since it's not page-aligned, that cannot happen and we have no easy way of triggering the sync io otherwise, huh.
I'll see if I can find a way to trigger sync io deliberately.