[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6
Affects Version/s: Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
Labels:
- llnl
Environment:
client catalyst: lustre-2.8.2_5.chaos-1.ch6.x86_64
server: porter lustre-2.10.5_2.chaos-3.ch6.x86_64
kernel-3.10.0-862.14.4.1chaos.ch6.x86_64 (RHEL 7.5 derivative)

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

The apparent contents of a file change after dropping caches:

[root@catalyst110:toss-4371.umm1t]# ./proc6.olaf
+ dd if=/dev/urandom of=testfile20K.in bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.024565 s, 834 kB/s
+ dd if=testfile20K.in of=testfile20K.out bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.0451045 s, 454 kB/s
++ md5sum testfile20K.out
+ original_md5sum='1060a4c01a415d7c38bdd00dcf09dd22  testfile20K.out'
+ echo 3
++ md5sum testfile20K.out
+ echo after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out
after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out                                                                        

[root@catalyst110:toss-4371.umm1t]# cat proc6.olaf
#!/bin/bash

set -x

dd if=/dev/urandom of=testfile.in bs=10240 count=2
dd if=testfile.in of=testfile.out bs=10240 count=2

#dd if=/dev/urandom of=testfile.in bs=102400 count=2
#dd if=testfile.in of=testfile.out bs=102400 count=2
original_md5sum=$(md5sum testfile.out)
echo 3 >/proc/sys/vm/drop_caches

echo after drop_caches $original_md5sum $(md5sum testfile.out)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

for-upload-lu-11663.tar.bz2
2.92 MB
16/Nov/18 5:34 PM
lu-11663-2018-11-26.tgz
3.55 MB
27/Nov/18 12:53 AM

Issue Links

is related to

LU-11697 BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums

Resolved

LU-11729 ARM: sanity test_810: BAD WRITE CHECKSUM with adler

Resolved

LU-10683 write checksum errors

Resolved

LU-11798 cur_grant goes to 0 and never increases with 2.8 client and 2.10 server

Resolved

Activity

[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10

Olaf Faaland added a comment - 27/Nov/18 1:14 AM

Oleg, I've uploaded lu-11663-2018-11-26.tgz which contains the test files and debug logs on both client and server during two tests; one iteration that reproduces the issue, on client catalyst101, and one where the corruption does not occur, on client catalyst106. There's a typescript file that shows the output of the test as it ran. In both cases the stripe index of the files is 0.

The node which fails the test takes much longer to write the data, consistent with the sync writes you saw in the last debug logs.

The file system where this is occurring is 28% full, with individual OSTs ranging from 25% full to 31% full.
The amount of data I personally have stored on each OST ranges from 23M to 308M; there are 80 OSTs. My total usage is 5.37G and total quota is 18T. lfs quota says total allocated block limit is 5T, and each OST reports a limit of 64G.

Olaf Faaland added a comment - 27/Nov/18 1:14 AM Oleg, I've uploaded lu-11663-2018-11-26.tgz which contains the test files and debug logs on both client and server during two tests; one iteration that reproduces the issue, on client catalyst101, and one where the corruption does not occur, on client catalyst106. There's a typescript file that shows the output of the test as it ran. In both cases the stripe index of the files is 0. The node which fails the test takes much longer to write the data, consistent with the sync writes you saw in the last debug logs. The file system where this is occurring is 28% full, with individual OSTs ranging from 25% full to 31% full. The amount of data I personally have stored on each OST ranges from 23M to 308M; there are 80 OSTs. My total usage is 5.37G and total quota is 18T. lfs quota says total allocated block limit is 5T, and each OST reports a limit of 64G.

Oleg Drokin added a comment - 26/Nov/18 9:40 PM

I reviewed the -1 logs.

Interesting observations I have:
1. The before unmount, you had some strange grant shortage, what this means is every write was actually synchronous. You can track this by messages that say this:

00000008:00000020:37.0:1542388236.271202:0:6433:0:(osc_cache.c:1608:osc_enter_cache()) lustre3-OST002a-osc-ffff8a1c6ed62800: grant { dirty: 0/8192 dirty_pages: 0/16449536 dropped: 0 avail: 0, reserved: 0, flight: 0 }lru {in list: 0, left: 3, waiters: 0 }no grant space, fall back to sync i/o

2. so this leads to a sync write in a middle of a page, twice.
3. This is actually got both .in and .out file, but only .out file is somehow damaged, huh?
4. We know that we are writing the correct data to the server because we can observe both write requests, to .in and .out files and the checksum comes the same, see the "checksum at write origin" message repeated twice for the same request.
We cannot see if it's what was read, though, because the final read comes after readahead so all 4 pages are read in one go and the checksum is not comparable (interesting experiment would have been to disable readahead or do directio reads or some such to see if the bad data comes straight from the server, which I think it does, but we cannot be 100% sure).

Now looking at the successful iterations after remounts, there are two major differences there:
1. There's plenty of grant so no sync writes are happening.
2. The drop caches does nothing, there are NO write RPCs in those locks (grep for 'o4-' to confirm). There are no reads either (grep for 'checksum .* confirmed' you see only two requests with fffffff checksum, that's the empty read at EOF).

these two thigns combined mean that whatever corruption you had, even if it's happening, would not be seen.

Anyway my current conclusion is the corruption is actually happening on the server, it could be the disk or Lustre somewhere, I don't know about that, but the client seems to be doing everything ok.

As such I suspect we would need client+server logs of a reproduced case. Also please include both .in and .out files so we can compare them. It looks like to facilitate better reproducing you might want to dramatically shrink grant availability somehow (is the fs more prone to this is mostly full? quotas that are getting lowish in place?). I do wonder if the same thing happens when you use directio straight from dd, but since it's not page-aligned, that cannot happen and we have no easy way of triggering the sync io otherwise, huh.

I'll see if I can find a way to trigger sync io deliberately.

Oleg Drokin added a comment - 26/Nov/18 9:40 PM I reviewed the -1 logs. Interesting observations I have: 1. The before unmount, you had some strange grant shortage, what this means is every write was actually synchronous. You can track this by messages that say this: 00000008:00000020:37.0:1542388236.271202:0:6433:0:(osc_cache.c:1608:osc_enter_cache()) lustre3-OST002a-osc-ffff8a1c6ed62800: grant { dirty: 0/8192 dirty_pages: 0/16449536 dropped: 0 avail: 0, reserved: 0, flight: 0 }lru {in list: 0, left: 3, waiters: 0 }no grant space, fall back to sync i/o 2. so this leads to a sync write in a middle of a page, twice. 3. This is actually got both .in and .out file, but only .out file is somehow damaged, huh? 4. We know that we are writing the correct data to the server because we can observe both write requests, to .in and .out files and the checksum comes the same, see the "checksum at write origin" message repeated twice for the same request. We cannot see if it's what was read, though, because the final read comes after readahead so all 4 pages are read in one go and the checksum is not comparable (interesting experiment would have been to disable readahead or do directio reads or some such to see if the bad data comes straight from the server, which I think it does, but we cannot be 100% sure). Now looking at the successful iterations after remounts, there are two major differences there: 1. There's plenty of grant so no sync writes are happening. 2. The drop caches does nothing, there are NO write RPCs in those locks (grep for 'o4-' to confirm). There are no reads either (grep for 'checksum .* confirmed' you see only two requests with fffffff checksum, that's the empty read at EOF). these two thigns combined mean that whatever corruption you had, even if it's happening, would not be seen. Anyway my current conclusion is the corruption is actually happening on the server, it could be the disk or Lustre somewhere, I don't know about that, but the client seems to be doing everything ok. As such I suspect we would need client+server logs of a reproduced case. Also please include both .in and .out files so we can compare them. It looks like to facilitate better reproducing you might want to dramatically shrink grant availability somehow (is the fs more prone to this is mostly full? quotas that are getting lowish in place?). I do wonder if the same thing happens when you use directio straight from dd, but since it's not page-aligned, that cannot happen and we have no easy way of triggering the sync io otherwise, huh. I'll see if I can find a way to trigger sync io deliberately.

Olaf Faaland added a comment - 26/Nov/18 5:47 PM

Can you suggest information to capture from my clients where the problem is reproducing? As I mentioned, even on my cluster during testing over the weekend, some nodes reproduced reliably, but many never did.

Poke

Olaf Faaland added a comment - 26/Nov/18 5:47 PM Can you suggest information to capture from my clients where the problem is reproducing? As I mentioned, even on my cluster during testing over the weekend, some nodes reproduced reliably, but many never did. Poke

Olaf Faaland added a comment - 25/Nov/18 7:41 PM

Is there some pattern around which nodes can hit this issue vs those that don't?

Not that I've been able to find.

Olaf Faaland added a comment - 25/Nov/18 7:41 PM Is there some pattern around which nodes can hit this issue vs those that don't? Not that I've been able to find.

Peter Jones added a comment - 25/Nov/18 3:19 PM

Olaf

Is there some pattern around which nodes can hit this issue vs those that don't?

Peter

Peter Jones added a comment - 25/Nov/18 3:19 PM Olaf Is there some pattern around which nodes can hit this issue vs those that don't? Peter

Olaf Faaland added a comment - 21/Nov/18 7:06 AM

I've had issues getting other versions of the client to work due to changes in IB with map_on_demand, peer_credits, etc. in recent versions, but I think I'm past that.

Today I reproduced the issue with client version 2.8.2_2.chaos. I'll try with earlier and later clients tomorrow.

Olaf Faaland added a comment - 21/Nov/18 7:06 AM I've had issues getting other versions of the client to work due to changes in IB with map_on_demand, peer_credits, etc. in recent versions, but I think I'm past that. Today I reproduced the issue with client version 2.8.2_2.chaos. I'll try with earlier and later clients tomorrow.

Olaf Faaland added a comment - 20/Nov/18 8:42 PM

Thanks, for trying those, Sarah. Can you suggest information to capture from my clients where the problem is reproducing? As I mentioned, even on my cluster during testing over the weekend, some nodes reproduced reliably, but many never did.

Olaf Faaland added a comment - 20/Nov/18 8:42 PM Thanks, for trying those, Sarah. Can you suggest information to capture from my clients where the problem is reproducing? As I mentioned, even on my cluster during testing over the weekend, some nodes reproduced reliably, but many never did.

Sarah Liu added a comment - 20/Nov/18 3:41 AM - edited

I downloaded lustre-2.10.5_2.chaos.tar.gz and lustre-2.8.2_1.chaos.tar.gz (cannot find 2.8.2_5) from https://github.com/LLNL/lustre/releases, compile them and cannot reproduce.

Server: compile lustre-2.10.5_2.chaos on kernel 3.10.0-862.14.4.el7_lustre.x86_64
1 MDT, 1 OST on single node

[root@trevis-60vm1 utils]# uname -a
Linux trevis-60vm1.trevis.whamcloud.com 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Thu Nov 8 07:41:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@trevis-60vm1 utils]#
[root@trevis-60vm1 ~]# lu-11663/lustre-2.10.5_2.chaos/lustre/utils/lctl get_param version
version=2.10.5
[root@trevis-60vm1 ~]#

[root@trevis-60vm1 ~]# rpm -qa|grep zfs
libzfs2-0.7.9-1.el7.x86_64
kmod-lustre-osd-zfs-2.11.56_140_g2339e1b-1.el7.x86_64
libzfs2-devel-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.11.56_140_g2339e1b-1.el7.x86_64
zfs-0.7.9-1.el7.x86_64
...

Single client, compile lustre-2.8.2_1.chaos on kernel 3.10.0-327.3.1.el7.x86_64(2.8.0 kernel)

[root@trevis-60vm4 ~]# lu-11663/lustre-2.8.2_1.chaos/lustre/utils/lctl get_param version
version=lustre: 2.8.2
kernel: patchless_client
build:  2.8.2
[root@trevis-60vm4 ~]#
[root@trevis-60vm4 ~]# uname -a
Linux trevis-60vm4.trevis.whamcloud.com 3.10.0-327.3.1.el7.x86_64 #1 SMP Fri Nov 20 05:40:26 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@trevis-60vm4 ~]#

result

[root@trevis-60vm4 ~]# cd /mnt/lustre/
[root@trevis-60vm4 lustre]# sh foo.sh 
+ dd if=/dev/urandom of=testfile.in bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.00236147 s, 8.7 MB/s
+ dd if=testfile.in of=testfile.out bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.00111802 s, 18.3 MB/s
++ md5sum testfile.out
+ original_md5sum='20dd24fb015feb7de67bbdc12f2c16bf  testfile.out'
+ echo 3
++ md5sum testfile.out
+ echo after drop_caches 20dd24fb015feb7de67bbdc12f2c16bf testfile.out 20dd24fb015feb7de67bbdc12f2c16bf testfile.out
after drop_caches 20dd24fb015feb7de67bbdc12f2c16bf testfile.out 20dd24fb015feb7de67bbdc12f2c16bf testfile.out
[root@trevis-60vm4 lustre]#

Also tried client with branch: origin/2.8.2-llnl from fs/lustre-release-fe-llnl
top commit 8356dd88e2e59edd1462bb4647f61d5a210d4262
run reproducer 10 times, cannot reproduce.

Sarah Liu added a comment - 20/Nov/18 3:41 AM - edited I downloaded lustre-2.10.5_2.chaos.tar.gz and lustre-2.8.2_1.chaos.tar.gz (cannot find 2.8.2_5) from https://github.com/LLNL/lustre/releases , compile them and cannot reproduce. Server: compile lustre-2.10.5_2.chaos on kernel 3.10.0-862.14.4.el7_lustre.x86_64 1 MDT, 1 OST on single node [root@trevis-60vm1 utils]# uname -a Linux trevis-60vm1.trevis.whamcloud.com 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Thu Nov 8 07:41:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [root@trevis-60vm1 utils]# [root@trevis-60vm1 ~]# lu-11663/lustre-2.10.5_2.chaos/lustre/utils/lctl get_param version version=2.10.5 [root@trevis-60vm1 ~]# [root@trevis-60vm1 ~]# rpm -qa|grep zfs libzfs2-0.7.9-1.el7.x86_64 kmod-lustre-osd-zfs-2.11.56_140_g2339e1b-1.el7.x86_64 libzfs2-devel-0.7.9-1.el7.x86_64 lustre-osd-zfs-mount-2.11.56_140_g2339e1b-1.el7.x86_64 zfs-0.7.9-1.el7.x86_64 ... Single client, compile lustre-2.8.2_1.chaos on kernel 3.10.0-327.3.1.el7.x86_64(2.8.0 kernel) [root@trevis-60vm4 ~]# lu-11663/lustre-2.8.2_1.chaos/lustre/utils/lctl get_param version version=lustre: 2.8.2 kernel: patchless_client build: 2.8.2 [root@trevis-60vm4 ~]# [root@trevis-60vm4 ~]# uname -a Linux trevis-60vm4.trevis.whamcloud.com 3.10.0-327.3.1.el7.x86_64 #1 SMP Fri Nov 20 05:40:26 EST 2015 x86_64 x86_64 x86_64 GNU/Linux [root@trevis-60vm4 ~]# result [root@trevis-60vm4 ~]# cd /mnt/lustre/ [root@trevis-60vm4 lustre]# sh foo.sh + dd if=/dev/urandom of=testfile.in bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.00236147 s, 8.7 MB/s + dd if=testfile.in of=testfile.out bs=10240 count=2 2+0 records in 2+0 records out 20480 bytes (20 kB) copied, 0.00111802 s, 18.3 MB/s ++ md5sum testfile.out + original_md5sum='20dd24fb015feb7de67bbdc12f2c16bf testfile.out' + echo 3 ++ md5sum testfile.out + echo after drop_caches 20dd24fb015feb7de67bbdc12f2c16bf testfile.out 20dd24fb015feb7de67bbdc12f2c16bf testfile.out after drop_caches 20dd24fb015feb7de67bbdc12f2c16bf testfile.out 20dd24fb015feb7de67bbdc12f2c16bf testfile.out [root@trevis-60vm4 lustre]# Also tried client with branch: origin/2.8.2-llnl from fs/lustre-release-fe-llnl top commit 8356dd88e2e59edd1462bb4647f61d5a210d4262 run reproducer 10 times, cannot reproduce.

Olaf Faaland added a comment - 19/Nov/18 10:24 PM

I'm still working on trying previous client versions. I should have at least one other version tested today.

For context, this issue has been observed on client cluster catalyst, which mounts three lustre file systems.

lustre3 hosted on porter. This is lustre 2.10.5 based.
lustre1 hosted on copper. This is lustre 2.8.2 based.
lscratchh hosted on zinc. This is lustre 2.8.2 based.

Connections are through routers. The routers in catalyst are the same version as the clients. All nodes are x86_64. I don't recall the IB-to-IP router nodes lustre or kernel versions but can find out.

catalyst-compute <~~> catalyst-router <~~> lustre3
catalyst-compute <~~> catalyst-router <~~> IB-to-IP-router <~~> IP-to-IB-router <~~> (lustre1 and lscratchh)

We have observed this issue only on lustre3 so far.

During testing this weekend I ran two 1000-iteration test sets on 20 dedicated catalyst nodes. During both sets:

one node, catalyst110, reproduced the problem > 95% of the time
a different node reproduced the problem about 15% of the time
fifteen nodes never reproduced the problem

In the first test set, I only ran the reproducer against lustre3, where the issue was first identified last week.
In the second test set, I ran the reproducer first against lustre3 and then against lustre1. The problem was reproduced only with lustre3, the 2.10 file system. It was never reproduced with lustre1.

Olaf Faaland added a comment - 19/Nov/18 10:24 PM I'm still working on trying previous client versions. I should have at least one other version tested today. For context, this issue has been observed on client cluster catalyst, which mounts three lustre file systems. lustre3 hosted on porter. This is lustre 2.10.5 based. lustre1 hosted on copper. This is lustre 2.8.2 based. lscratchh hosted on zinc. This is lustre 2.8.2 based. Connections are through routers. The routers in catalyst are the same version as the clients. All nodes are x86_64. I don't recall the IB-to-IP router nodes lustre or kernel versions but can find out. catalyst-compute < > catalyst-router < > lustre3 catalyst-compute < > catalyst-router < > IB-to-IP-router < > IP-to-IB-router < > (lustre1 and lscratchh) We have observed this issue only on lustre3 so far. During testing this weekend I ran two 1000-iteration test sets on 20 dedicated catalyst nodes. During both sets: one node, catalyst110, reproduced the problem > 95% of the time a different node reproduced the problem about 15% of the time fifteen nodes never reproduced the problem In the first test set, I only ran the reproducer against lustre3, where the issue was first identified last week. In the second test set, I ran the reproducer first against lustre3 and then against lustre1. The problem was reproduced only with lustre3, the 2.10 file system. It was never reproduced with lustre1.

Olaf Faaland added a comment - 16/Nov/18 5:43 PM

I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released.

I agree. I'll try that.

Olaf Faaland added a comment - 16/Nov/18 5:43 PM I guess the other question is whether you tried running the reproducer on some previous version on the client? Is it possible that this is a newly introduced problem? It seems a bit strange that there would be a problem like this going unnoticed since 2.8 was released. I agree. I'll try that.

Olaf Faaland added a comment - 16/Nov/18 5:41 PM

In testing since yesterday I'm sometimes finding the corruption does not occur - that is, if I run the same reproduce 60 times in a row on the same client, for example, it may show corruption 50 times in a row and then show no corruption for the last 10.

I attached for-upload-lu-11663.tar.bz2 which has -1 debug logs for 3 attempts, along with the terminal output when I ran the reproducer and an index matching the results to the log files. I run lctl dk before each attempt and after, so there are 6 log files.

After the first attempt, which shows the corruption, I umount all the lustre file systems and then mount them again. I then run the same reproducer twice and no corruption occurs. I'm not sure whether that's due to the umount/remount or not.

Olaf Faaland added a comment - 16/Nov/18 5:41 PM In testing since yesterday I'm sometimes finding the corruption does not occur - that is, if I run the same reproduce 60 times in a row on the same client, for example, it may show corruption 50 times in a row and then show no corruption for the last 10. I attached for-upload-lu-11663.tar.bz2 which has -1 debug logs for 3 attempts, along with the terminal output when I ran the reproducer and an index matching the results to the log files. I run lctl dk before each attempt and after, so there are 6 log files. After the first attempt, which shows the corruption, I umount all the lustre file systems and then mount them again. I then run the same reproducer twice and no corruption occurs. I'm not sure whether that's due to the umount/remount or not.

People

Assignee:: Alex Zhuravlev

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 14/Nov/18 1:24 AM

Updated:: 17/Dec/18 11:34 PM

Resolved:: 30/Nov/18 6:30 PM