[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6
Affects Version/s: Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
Labels:
- llnl
Environment:
client catalyst: lustre-2.8.2_5.chaos-1.ch6.x86_64
server: porter lustre-2.10.5_2.chaos-3.ch6.x86_64
kernel-3.10.0-862.14.4.1chaos.ch6.x86_64 (RHEL 7.5 derivative)

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

The apparent contents of a file change after dropping caches:

[root@catalyst110:toss-4371.umm1t]# ./proc6.olaf
+ dd if=/dev/urandom of=testfile20K.in bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.024565 s, 834 kB/s
+ dd if=testfile20K.in of=testfile20K.out bs=10240 count=2
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 0.0451045 s, 454 kB/s
++ md5sum testfile20K.out
+ original_md5sum='1060a4c01a415d7c38bdd00dcf09dd22  testfile20K.out'
+ echo 3
++ md5sum testfile20K.out
+ echo after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out
after drop_caches 1060a4c01a415d7c38bdd00dcf09dd22 testfile20K.out 717122f4dd25f2e75834a8b21c79ce50 testfile20K.out                                                                        

[root@catalyst110:toss-4371.umm1t]# cat proc6.olaf
#!/bin/bash

set -x

dd if=/dev/urandom of=testfile.in bs=10240 count=2
dd if=testfile.in of=testfile.out bs=10240 count=2

#dd if=/dev/urandom of=testfile.in bs=102400 count=2
#dd if=testfile.in of=testfile.out bs=102400 count=2
original_md5sum=$(md5sum testfile.out)
echo 3 >/proc/sys/vm/drop_caches

echo after drop_caches $original_md5sum $(md5sum testfile.out)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

for-upload-lu-11663.tar.bz2
2.92 MB
16/Nov/18 5:34 PM
lu-11663-2018-11-26.tgz
3.55 MB
27/Nov/18 12:53 AM

Issue Links

is related to

LU-11697 BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums

Resolved

LU-11729 ARM: sanity test_810: BAD WRITE CHECKSUM with adler

Resolved

LU-10683 write checksum errors

Resolved

LU-11798 cur_grant goes to 0 and never increases with 2.8 client and 2.10 server

Resolved

Activity

[LU-11663] corrupt data after page-unaligned write with zfs backend lustre 2.10

Alex Zhuravlev added a comment - 29/Nov/18 6:07 PM - edited

there are two options here: 1) revert ~~LU-10683~~ (but potentially get bad RPC checksum messages back) 2) apply https://review.whamcloud.com/#/c/33726/ patch which is still under testing.
both options have worked against our reproducer (see in option #2 patch) on b2_10
we are still investigating the root cause for ~~LU-10683~~ (bad checksums)

Alex Zhuravlev added a comment - 29/Nov/18 6:07 PM - edited there are two options here: 1) revert LU-10683 (but potentially get bad RPC checksum messages back) 2) apply https://review.whamcloud.com/#/c/33726/ patch which is still under testing. both options have worked against our reproducer (see in option #2 patch) on b2_10 we are still investigating the root cause for LU-10683 (bad checksums)

Gerrit Updater added a comment - 29/Nov/18 5:46 AM

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33748
Subject: ~~LU-11663~~ osd-zfs: write partial pages with correct offset
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 6f9a0292eacb0d603b14cc03290a574cb7f0c846

Gerrit Updater added a comment - 29/Nov/18 5:46 AM Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33748 Subject: LU-11663 osd-zfs: write partial pages with correct offset Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6f9a0292eacb0d603b14cc03290a574cb7f0c846

Peter Jones added a comment - 29/Nov/18 5:37 AM

Strange. Alex's patch did not get an auto comment - https://review.whamcloud.com/#/c/33726/. As I understand it, this patch seems to be holding up well against the reproducer but the test cases need some refinement. Are we now at the point when LLNL can use a b2_10 port of this patch on their affected filesystem?

Peter Jones added a comment - 29/Nov/18 5:37 AM Strange. Alex's patch did not get an auto comment - https://review.whamcloud.com/#/c/33726/. As I understand it, this patch seems to be holding up well against the reproducer but the test cases need some refinement. Are we now at the point when LLNL can use a b2_10 port of this patch on their affected filesystem?

Patrick Farrell (Inactive) added a comment - 27/Nov/18 3:43 PM

Olaf,

If a bug is opened for the grant issue, could you tag me on it? Thx.

Patrick Farrell (Inactive) added a comment - 27/Nov/18 3:43 PM Olaf, If a bug is opened for the grant issue, could you tag me on it? Thx.

Oleg Drokin added a comment - 27/Nov/18 6:18 AM

btw, since we are concentrating this ticket on the data corruption, if you want to pursue why some nodes are stuck with no grant and do not appear to be getting any more grant until remount, you probably should open another ticket for this.

Oleg Drokin added a comment - 27/Nov/18 6:18 AM btw, since we are concentrating this ticket on the data corruption, if you want to pursue why some nodes are stuck with no grant and do not appear to be getting any more grant until remount, you probably should open another ticket for this.

Peter Jones added a comment - 27/Nov/18 4:55 AM

Alex

Can you please investigate?

Peter

Peter Jones added a comment - 27/Nov/18 4:55 AM Alex Can you please investigate? Peter

Oleg Drokin added a comment - 27/Nov/18 3:50 AM

Shortest reproducer:

lctl set_param fail_loc=0x411
dd if=/dev/urandom of=testfile.in bs=10240 count=2
md5sum testfile.in
lctl set_param ldlm.namespaces.*osc*.lru_size=clear
md5sum testfile.in

Oleg Drokin added a comment - 27/Nov/18 3:50 AM Shortest reproducer: lctl set_param fail_loc=0x411 dd if=/dev/urandom of=testfile.in bs=10240 count=2 md5sum testfile.in lctl set_param ldlm.namespaces.*osc*.lru_size=clear md5sum testfile.in

Oleg Drokin added a comment - 27/Nov/18 3:30 AM - edited

Ok, I can reproduce this on master now too. There are two requirements: sync writes due to lack of grant/quota and ZFS. ldiskfs works fine.

In order to force the lack of quota codepath we can use the 0x411 failloc on the client like this: lctl set_param fail_loc=0x411

Then run the original inspired script in a lustre dir:

dd if=/dev/urandom of=testfile.in bs=10240 count=2 
dd if=testfile.in of=testfile.out bs=10240 count=2
original_md5sum=$(md5sum testfile.in)
echo 3 | sudo tee /proc/sys/vm/drop_caches ; sleep 2
md5sum=$(md5sum testfile.out)
echo after drop_caches $md5sum before $original_md5sum

Set this and you'll see the problem 100% of the time. What's interesting is doing oflags=sync to dd does not help as it still results in full page writes in RPC for partial page writes on VFS side.

It appears that the problem is either in ZFS or more likely, in osd-zfs, where when a partial page write happens, the previous content of the page is not read from disk and so we just update the partial content we got in the RPC, but overwrite whatever was supposed to be there in the part that we are not overwriting.

Comparing osd_write_prep, we can see it's a noop in osd_zfs, but in osd_zfs it actually prereads all partial pages. On the other hand osd_write in osd_zfs uses dmu_write(by_node) with offset so perhaps it's expected that zfs is expected to do this?

Either way at least it's clear what's going on now, hence this update.

Oleg Drokin added a comment - 27/Nov/18 3:30 AM - edited Ok, I can reproduce this on master now too. There are two requirements: sync writes due to lack of grant/quota and ZFS. ldiskfs works fine. In order to force the lack of quota codepath we can use the 0x411 failloc on the client like this: lctl set_param fail_loc=0x411 Then run the original inspired script in a lustre dir: dd if =/dev/urandom of=testfile.in bs=10240 count=2 dd if =testfile.in of=testfile.out bs=10240 count=2 original_md5sum=$(md5sum testfile.in) echo 3 | sudo tee /proc/sys/vm/drop_caches ; sleep 2 md5sum=$(md5sum testfile.out) echo after drop_caches $md5sum before $original_md5sum Set this and you'll see the problem 100% of the time. What's interesting is doing oflags=sync to dd does not help as it still results in full page writes in RPC for partial page writes on VFS side. It appears that the problem is either in ZFS or more likely, in osd-zfs, where when a partial page write happens, the previous content of the page is not read from disk and so we just update the partial content we got in the RPC, but overwrite whatever was supposed to be there in the part that we are not overwriting. Comparing osd_write_prep, we can see it's a noop in osd_zfs, but in osd_zfs it actually prereads all partial pages. On the other hand osd_write in osd_zfs uses dmu_write(by_node) with offset so perhaps it's expected that zfs is expected to do this? Either way at least it's clear what's going on now, hence this update.

Andreas Dilger added a comment - 27/Nov/18 1:42 AM - edited

Olaf, I think Oleg was referring to the space grant, which can be seen on the OSS with "lctl get_param obdfilter.*.tot_granted" and the amount granted to the client with "lctl get_param osc.*.cur_grant_bytes" (probably only for the OST the file was striped over. Also useful would be "lctl get_param osc.*.max_dirty_mb".

Andreas Dilger added a comment - 27/Nov/18 1:42 AM - edited Olaf, I think Oleg was referring to the space grant, which can be seen on the OSS with " lctl get_param obdfilter.*.tot_granted " and the amount granted to the client with " lctl get_param osc.*.cur_grant_bytes " (probably only for the OST the file was striped over. Also useful would be " lctl get_param osc.*.max_dirty_mb ".

Olaf Faaland added a comment - 27/Nov/18 1:14 AM

Oleg, I've uploaded lu-11663-2018-11-26.tgz which contains the test files and debug logs on both client and server during two tests; one iteration that reproduces the issue, on client catalyst101, and one where the corruption does not occur, on client catalyst106. There's a typescript file that shows the output of the test as it ran. In both cases the stripe index of the files is 0.

The node which fails the test takes much longer to write the data, consistent with the sync writes you saw in the last debug logs.

The file system where this is occurring is 28% full, with individual OSTs ranging from 25% full to 31% full.
The amount of data I personally have stored on each OST ranges from 23M to 308M; there are 80 OSTs. My total usage is 5.37G and total quota is 18T. lfs quota says total allocated block limit is 5T, and each OST reports a limit of 64G.

Olaf Faaland added a comment - 27/Nov/18 1:14 AM Oleg, I've uploaded lu-11663-2018-11-26.tgz which contains the test files and debug logs on both client and server during two tests; one iteration that reproduces the issue, on client catalyst101, and one where the corruption does not occur, on client catalyst106. There's a typescript file that shows the output of the test as it ran. In both cases the stripe index of the files is 0. The node which fails the test takes much longer to write the data, consistent with the sync writes you saw in the last debug logs. The file system where this is occurring is 28% full, with individual OSTs ranging from 25% full to 31% full. The amount of data I personally have stored on each OST ranges from 23M to 308M; there are 80 OSTs. My total usage is 5.37G and total quota is 18T. lfs quota says total allocated block limit is 5T, and each OST reports a limit of 64G.

Oleg Drokin added a comment - 26/Nov/18 9:40 PM

I reviewed the -1 logs.

Interesting observations I have:
1. The before unmount, you had some strange grant shortage, what this means is every write was actually synchronous. You can track this by messages that say this:

00000008:00000020:37.0:1542388236.271202:0:6433:0:(osc_cache.c:1608:osc_enter_cache()) lustre3-OST002a-osc-ffff8a1c6ed62800: grant { dirty: 0/8192 dirty_pages: 0/16449536 dropped: 0 avail: 0, reserved: 0, flight: 0 }lru {in list: 0, left: 3, waiters: 0 }no grant space, fall back to sync i/o

2. so this leads to a sync write in a middle of a page, twice.
3. This is actually got both .in and .out file, but only .out file is somehow damaged, huh?
4. We know that we are writing the correct data to the server because we can observe both write requests, to .in and .out files and the checksum comes the same, see the "checksum at write origin" message repeated twice for the same request.
We cannot see if it's what was read, though, because the final read comes after readahead so all 4 pages are read in one go and the checksum is not comparable (interesting experiment would have been to disable readahead or do directio reads or some such to see if the bad data comes straight from the server, which I think it does, but we cannot be 100% sure).

Now looking at the successful iterations after remounts, there are two major differences there:
1. There's plenty of grant so no sync writes are happening.
2. The drop caches does nothing, there are NO write RPCs in those locks (grep for 'o4-' to confirm). There are no reads either (grep for 'checksum .* confirmed' you see only two requests with fffffff checksum, that's the empty read at EOF).

these two thigns combined mean that whatever corruption you had, even if it's happening, would not be seen.

Anyway my current conclusion is the corruption is actually happening on the server, it could be the disk or Lustre somewhere, I don't know about that, but the client seems to be doing everything ok.

As such I suspect we would need client+server logs of a reproduced case. Also please include both .in and .out files so we can compare them. It looks like to facilitate better reproducing you might want to dramatically shrink grant availability somehow (is the fs more prone to this is mostly full? quotas that are getting lowish in place?). I do wonder if the same thing happens when you use directio straight from dd, but since it's not page-aligned, that cannot happen and we have no easy way of triggering the sync io otherwise, huh.

I'll see if I can find a way to trigger sync io deliberately.

Oleg Drokin added a comment - 26/Nov/18 9:40 PM I reviewed the -1 logs. Interesting observations I have: 1. The before unmount, you had some strange grant shortage, what this means is every write was actually synchronous. You can track this by messages that say this: 00000008:00000020:37.0:1542388236.271202:0:6433:0:(osc_cache.c:1608:osc_enter_cache()) lustre3-OST002a-osc-ffff8a1c6ed62800: grant { dirty: 0/8192 dirty_pages: 0/16449536 dropped: 0 avail: 0, reserved: 0, flight: 0 }lru {in list: 0, left: 3, waiters: 0 }no grant space, fall back to sync i/o 2. so this leads to a sync write in a middle of a page, twice. 3. This is actually got both .in and .out file, but only .out file is somehow damaged, huh? 4. We know that we are writing the correct data to the server because we can observe both write requests, to .in and .out files and the checksum comes the same, see the "checksum at write origin" message repeated twice for the same request. We cannot see if it's what was read, though, because the final read comes after readahead so all 4 pages are read in one go and the checksum is not comparable (interesting experiment would have been to disable readahead or do directio reads or some such to see if the bad data comes straight from the server, which I think it does, but we cannot be 100% sure). Now looking at the successful iterations after remounts, there are two major differences there: 1. There's plenty of grant so no sync writes are happening. 2. The drop caches does nothing, there are NO write RPCs in those locks (grep for 'o4-' to confirm). There are no reads either (grep for 'checksum .* confirmed' you see only two requests with fffffff checksum, that's the empty read at EOF). these two thigns combined mean that whatever corruption you had, even if it's happening, would not be seen. Anyway my current conclusion is the corruption is actually happening on the server, it could be the disk or Lustre somewhere, I don't know about that, but the client seems to be doing everything ok. As such I suspect we would need client+server logs of a reproduced case. Also please include both .in and .out files so we can compare them. It looks like to facilitate better reproducing you might want to dramatically shrink grant availability somehow (is the fs more prone to this is mostly full? quotas that are getting lowish in place?). I do wonder if the same thing happens when you use directio straight from dd, but since it's not page-aligned, that cannot happen and we have no easy way of triggering the sync io otherwise, huh. I'll see if I can find a way to trigger sync io deliberately.

People

Assignee:: Alex Zhuravlev

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 14/Nov/18 1:24 AM

Updated:: 17/Dec/18 11:34 PM

Resolved:: 30/Nov/18 6:30 PM