[LU-6768] Data corruption when write and truncate in parallel in a almost-full file system Created: 26/Jun/15 Updated: 14/Sep/15 Resolved: 03/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jingwang Zhang | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Reproduced in a virtual machine using loop device as OSD-ldiskfs disk. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
In order to test the stability of the Lustre file system under continuous workload and extreme resource usage, I wrote a tool to write data to files continuously unless 'ENOSPC' occurs, then the tool will truncate and delete some old files to free the space and continue to write the files. Before the file is truncated, its content will be verified. The problem is that after running the tool for a while, the content of some file would be wrong. If the data is corrupted, it will fail with: You can get the tool at: |
| Comments |
| Comment by Jingwang Zhang [ 26/Jun/15 ] |
|
My analysis about this problem in an email: And after a long time investigation, we finally found out the root cause for the problem. In short, this is due to a race condition between the data IO (Lustre performs them directly to the block device by submit_bio()) and the metadata IO (ldiskfs performs metadata IO to the block device, which might be cached in the block device’s page cache). Lustre would commit data IO from clients by the osd_submit_bio() function in osd-ldiskfs/osd_io.c : osd_io.c 241 static void osd_submit_bio(int rw, struct bio *bio) 242 { 243 LASSERTF(rw == 0 || rw == 1, "%x\n", rw); 244 if (rw == 0) 245 submit_bio(READ, bio); 246 else 247 submit_bio(WRITE, bio); 248 } However, there might be dirty data in the block’s page cache. It’s rare but it does might happen if things happen as following order: So I think this is a problem and I think the right thing to do is to invalidate the corresponding pages in the block device’s page cache before we issue the bio to make sure that there aren’t any dirty pages, which might overwrite our data later. So we propose to make the following change on osd_submit_bio() to invalidate the page cache. osd_io.c static void osd_submit_bio(int rw, struct bio *bio) { struct inode *bdinode = bio->bi_bdev->bd_inode; LASSERTF(rw == 0 || rw == 1, "%x\n", rw); if (rw == 0) { submit_bio(READ, bio); } else { loff_t start = bio->bi_sector << 9; loff_t endbyte = start + bio->bi_size - 1; /* Invalidate the page cache in the block device, otherwise * the dirty data in block device's page cache might corrupt * the data we are going to write. */ truncate_pagecache_range(bdinode, start, endbyte); submit_bio(WRITE, bio); } } |
| Comment by Peter Jones [ 03/Jul/15 ] |
|
Alex Could you please advise on this issue? Thanks Peter |
| Comment by Alex Zhuravlev [ 13/Jul/15 ] |
|
> 1. A file got truncated, and its extent blocks are updated to complete the truncation. So those blocks became dirty. (2) is not quite correct, metadata blocks aren't freed immediately, instead they are scheduled for release up on commit. also, we already have calls to unmap_underlying_metadata() which is supposed to do what you suggested. |
| Comment by Alex Zhuravlev [ 13/Jul/15 ] |
|
what kernel version do you use? |
| Comment by Gerrit Updater [ 13/Jul/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15593 |
| Comment by Alex Zhuravlev [ 13/Jul/15 ] |
|
Jingwang, would you mind to try http://review.whamcloud.com/15593 please? |
| Comment by Jingwang Zhang [ 14/Jul/15 ] |
|
Thanks for looking into this. I'm using CentOS 6.5 with kernel version 2.6.32.431.29.2. And I will try the fix and get back to you later. |
| Comment by Jingwang Zhang [ 14/Jul/15 ] |
|
I run the reproducer for 7 hours and it didn't fail after applying the patch, where it will fail in minutes without the patch, so I believe that the problem is fixed. |
| Comment by Alex Zhuravlev [ 14/Jul/15 ] |
|
thanks for the report and testing. please, inspect the patch and help to move it forward. |
| Comment by Gerrit Updater [ 03/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15593/ |
| Comment by Peter Jones [ 03/Aug/15 ] |
|
Landed for 2.8 |
| Comment by Jay Lan (Inactive) [ 04/Aug/15 ] |
|
This may help us on |
| Comment by Jay Lan (Inactive) [ 06/Aug/15 ] |
|
Is the patch needed by server or client, or both? |
| Comment by Gerrit Updater [ 06/Aug/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/15904 |
| Comment by Jian Yu [ 06/Aug/15 ] |
|
Hi Jay, |