[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6
Affects Version/s: Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
Labels:
- llnl
Environment:
client catalyst: lustre-2.8.2_5.chaos-1.ch6.x86_64
server: porter lustre-2.10.5_2.chaos-3.ch6.x86_64
kernel-3.10.0-862.14.4.1chaos.ch6.x86_64 (RHEL 7.5 derivative)

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

The apparent contents of a file change after dropping caches:

[root@catalyst110:toss-4371.umm1t]# ./proc6.olaf
+ dd if=/dev/urandom of=testfile20K.in bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.024565 s, 834 kB/s
+ dd if=testfile20K.in of=testfile20K.out bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.0451045 s, 454 kB/s
++ md5sum testfile20K.out
+ original_md5sum='1060a4c01a415d7c38bdd00dcf09dd22  testfile20K.out'
+ echo 3
++ md5sum testfile20K.out
+ echo after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out
after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out                                                                        

[root@catalyst110:toss-4371.umm1t]# cat proc6.olaf
#!/bin/bash

set -x

dd if=/dev/urandom of=testfile.in bs=10240 count=2
dd if=testfile.in of=testfile.out bs=10240 count=2

#dd if=/dev/urandom of=testfile.in bs=102400 count=2
#dd if=testfile.in of=testfile.out bs=102400 count=2
original_md5sum=$(md5sum testfile.out)
echo 3 >/proc/sys/vm/drop_caches

echo after drop_caches $original_md5sum $(md5sum testfile.out)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

for-upload-lu-11663.tar.bz2
2.92 MB
16/Nov/18 5:34 PM
lu-11663-2018-11-26.tgz
3.55 MB
27/Nov/18 12:53 AM

Issue Links

is related to

LU-11697 BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums

Resolved

LU-11729 ARM: sanity test_810: BAD WRITE CHECKSUM with adler

Resolved

LU-10683 write checksum errors

Resolved

LU-11798 cur_grant goes to 0 and never increases with 2.8 client and 2.10 server

Resolved

Activity

[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10

Olaf Faaland added a comment - 19/Nov/18 10:24 PM

I'm still working on trying previous client versions. I should have at least one other version tested today.

For context, this issue has been observed on client cluster catalyst, which mounts three lustre file systems.

lustre3 hosted on porter. This is lustre 2.10.5 based.
lustre1 hosted on copper. This is lustre 2.8.2 based.
lscratchh hosted on zinc. This is lustre 2.8.2 based.

Connections are through routers. The routers in catalyst are the same version as the clients. All nodes are x86_64. I don't recall the IB-to-IP router nodes lustre or kernel versions but can find out.

catalyst-compute <~~> catalyst-router <~~> lustre3
catalyst-compute <~~> catalyst-router <~~> IB-to-IP-router <~~> IP-to-IB-router <~~> (lustre1 and lscratchh)

We have observed this issue only on lustre3 so far.

During testing this weekend I ran two 1000-iteration test sets on 20 dedicated catalyst nodes. During both sets:

one node, catalyst110, reproduced the problem > 95% of the time
a different node reproduced the problem about 15% of the time
fifteen nodes never reproduced the problem

In the first test set, I only ran the reproducer against lustre3, where the issue was first identified last week.
In the second test set, I ran the reproducer first against lustre3 and then against lustre1. The problem was reproduced only with lustre3, the 2.10 file system. It was never reproduced with lustre1.

Olaf Faaland added a comment - 19/Nov/18 10:24 PM I'm still working on trying previous client versions. I should have at least one other version tested today. For context, this issue has been observed on client cluster catalyst, which mounts three lustre file systems. lustre3 hosted on porter. This is lustre 2.10.5 based. lustre1 hosted on copper. This is lustre 2.8.2 based. lscratchh hosted on zinc. This is lustre 2.8.2 based. Connections are through routers. The routers in catalyst are the same version as the clients. All nodes are x86_64. I don't recall the IB-to-IP router nodes lustre or kernel versions but can find out. catalyst-compute < > catalyst-router < > lustre3 catalyst-compute < > catalyst-router < > IB-to-IP-router < > IP-to-IB-router < > (lustre1 and lscratchh) We have observed this issue only on lustre3 so far. During testing this weekend I ran two 1000-iteration test sets on 20 dedicated catalyst nodes. During both sets: one node, catalyst110, reproduced the problem > 95% of the time a different node reproduced the problem about 15% of the time fifteen nodes never reproduced the problem In the first test set, I only ran the reproducer against lustre3, where the issue was first identified last week. In the second test set, I ran the reproducer first against lustre3 and then against lustre1. The problem was reproduced only with lustre3, the 2.10 file system. It was never reproduced with lustre1.

Olaf Faaland added a comment - 16/Nov/18 5:43 PM

I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released.

I agree. I'll try that.

Olaf Faaland added a comment - 16/Nov/18 5:43 PM I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released. I agree. I'll try that.

Olaf Faaland added a comment - 16/Nov/18 5:41 PM

In testing since yesterday I'm sometimes finding the corruption does not occur - that is, if I run the same reproduce 60 times in a row on the same client, for example, it may show corruption 50 times in a row and then show no corruption for the last 10.

I attached for-upload-lu-11663.tar.bz2 which has -1 debug logs for 3 attempts, along with the terminal output when I ran the reproducer and an index matching the results to the log files. I run lctl dk before each attempt and after, so there are 6 log files.

After the first attempt, which shows the corruption, I umount all the lustre file systems and then mount them again. I then run the same reproducer twice and no corruption occurs. I'm not sure whether that's due to the umount/remount or not.

Olaf Faaland added a comment - 16/Nov/18 5:41 PM In testing since yesterday I'm sometimes finding the corruption does not occur - that is, if I run the same reproduce 60 times in a row on the same client, for example, it may show corruption 50 times in a row and then show no corruption for the last 10. I attached for-upload-lu-11663.tar.bz2 which has -1 debug logs for 3 attempts, along with the terminal output when I ran the reproducer and an index matching the results to the log files. I run lctl dk before each attempt and after, so there are 6 log files. After the first attempt, which shows the corruption, I umount all the lustre file systems and then mount them again. I then run the same reproducer twice and no corruption occurs. I'm not sure whether that's due to the umount/remount or not.

Andreas Dilger added a comment - 16/Nov/18 3:34 AM

I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released.

Andreas Dilger added a comment - 16/Nov/18 3:34 AM I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released.

Andreas Dilger added a comment - 16/Nov/18 3:29 AM

Olaf, as Sarah is having trouble to reproduce this, can you please run a test with -1 debug on the client? My first guess is that this is somehow related to the client IO stack. Given that there would only be a handful of operations in the log it shouldn't be too bad to look through.

Andreas Dilger added a comment - 16/Nov/18 3:29 AM Olaf, as Sarah is having trouble to reproduce this, can you please run a test with -1 debug on the client? My first guess is that this is somehow related to the client IO stack. Given that there would only be a handful of operations in the log it shouldn't be too bad to look through.

Olaf Faaland added a comment - 15/Nov/18 11:12 PM

Sarah,
If there's any information I can provide let me know. Thanks.

Olaf Faaland added a comment - 15/Nov/18 11:12 PM Sarah, If there's any information I can provide let me know. Thanks.

Sarah Liu added a comment - 15/Nov/18 10:16 PM - edited

cannot reproduce it with tip of master (build 3826 el7.5 . kernel-3.10.0-862.14.4.el7_lustre.x86_64) server and 2.8.0 client
2 MDS with 1 MDT on each; 1 OSS with 2 OSTs, ldiskfs
1 client

[root@trevis-60vm4 lustre]# ./rp.sh 
+ dd if=/dev/urandom of=testfile.in bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.00276562 s, 7.4 MB/s
+ dd if=testfile.in of=testfile.out bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.00142726 s, 14.3 MB/s
++ md5sum testfile.out
+ original_md5sum='f6bcdb9f1b674d29cd313a46a1c0cedb  testfile.out'
+ echo 3
[ 1748.385888] rp.sh (21490): drop_caches: 3
++ md5sum testfile.out
+ echo after drop_caches f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out
after drop_caches f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out
[root@trevis-60vm4 lustre]# ls

Sarah Liu added a comment - 15/Nov/18 10:16 PM - edited cannot reproduce it with tip of master (build 3826 el7.5 . kernel-3.10.0-862.14.4.el7_lustre.x86_64) server and 2.8.0 client 2 MDS with 1 MDT on each; 1 OSS with 2 OSTs, ldiskfs 1 client [root@trevis-60vm4 lustre]# ./rp.sh + dd if=/dev/urandom of=testfile.in bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.00276562 s, 7.4 MB/s + dd if=testfile.in of=testfile.out bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.00142726 s, 14.3 MB/s ++ md5sum testfile.out + original_md5sum='f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out' + echo 3 [ 1748.385888] rp.sh (21490): drop_caches: 3 ++ md5sum testfile.out + echo after drop_caches f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out after drop_caches f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out f6bcdb9f1b674d29cd313a46a1c0cedb testfile.out [root@trevis-60vm4 lustre]# ls

Olaf Faaland added a comment - 15/Nov/18 6:35 PM

I mounted the file system from one of the OSS nodes (porter), so that the client is the same version (lustre-2.10.5_2.chaos-3.ch6.x86_64) as all the servers and the client communicates directly with the servers, not through routers.
On catalyst, the lustre 2.8 compute cluster, I created a file using dd and bs=10240 as described above.

When I read the file from the client mounted on the OSS, I see the corrupted data.

This seems to me to indicate that the problem is occurring in the write path, not the read path. Does that make sense?

Olaf Faaland added a comment - 15/Nov/18 6:35 PM I mounted the file system from one of the OSS nodes (porter), so that the client is the same version (lustre-2.10.5_2.chaos-3.ch6.x86_64) as all the servers and the client communicates directly with the servers, not through routers. On catalyst, the lustre 2.8 compute cluster, I created a file using dd and bs=10240 as described above. When I read the file from the client mounted on the OSS, I see the corrupted data. This seems to me to indicate that the problem is occurring in the write path, not the read path. Does that make sense?

Olaf Faaland added a comment - 15/Nov/18 5:38 PM

I haven't found the objects on disk, going back to that in a minute. But from the client, with a sample 100k test file, copies made via dd with bs=10240 always have damage in the following extents (offsets, in hex). The actual content of the damaged areas is different every time.
0002800 - 0002fff
0007800 - 0007fff
000c800 - 000cfff
0011800 - 0011fff
0016800 - 0016fff

The rest of the file is correct.
PAGESIZE is 4096

Olaf Faaland added a comment - 15/Nov/18 5:38 PM I haven't found the objects on disk, going back to that in a minute. But from the client, with a sample 100k test file, copies made via dd with bs=10240 always have damage in the following extents (offsets, in hex). The actual content of the damaged areas is different every time. 0002800 - 0002fff 0007800 - 0007fff 000c800 - 000cfff 0011800 - 0011fff 0016800 - 0016fff The rest of the file is correct. PAGESIZE is 4096

Olaf Faaland added a comment - 15/Nov/18 6:06 AM

btw the script you are providing appears to be single node, but in the comment you say this requires two nodes. What's the second node for?

Originally we reproduced the problem using two nodes; one to write the data and another to read and checksum it, to detect the problem. Once we started dropping caches, we did not need a second node.

Olaf Faaland added a comment - 15/Nov/18 6:06 AM btw the script you are providing appears to be single node, but in the comment you say this requires two nodes. What's the second node for? Originally we reproduced the problem using two nodes; one to write the data and another to read and checksum it, to detect the problem. Once we started dropping caches, we did not need a second node.

Oleg Drokin added a comment - 15/Nov/18 12:24 AM

btw the script you are providing appears to be single node, but in the comment you say this requires two nodes. What's the second node for?

Oleg Drokin added a comment - 15/Nov/18 12:24 AM btw the script you are providing appears to be single node, but in the comment you say this requires two nodes. What's the second node for?

People

Assignee:: Alex Zhuravlev

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 14/Nov/18 1:24 AM

Updated:: 17/Dec/18 11:34 PM

Resolved:: 30/Nov/18 6:30 PM