[LU-6768] Data corruption when write and truncate in parallel in a almost-full file system Created: 26/Jun/15  Updated: 14/Sep/15  Resolved: 03/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Jingwang Zhang Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

Reproduced in a virtual machine using loop device as OSD-ldiskfs disk.


Issue Links:
Related
is related to LU-6925 oss buffer cache corruption Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In order to test the stability of the Lustre file system under continuous workload and extreme resource usage, I wrote a tool to write data to files continuously unless 'ENOSPC' occurs, then the tool will truncate and delete some old files to free the space and continue to write the files. Before the file is truncated, its content will be verified. The problem is that after running the tool for a while, the content of some file would be wrong.

If the data is corrupted, it will fail with:
yaft: main.cpp:81: void check_file_content(const std::string&): Assertion `rbuf.checkAt(pos)' failed.
Aborted

You can get the tool at:
https://github.com/zhang-jingwang/yaft.git



 Comments   
Comment by Jingwang Zhang [ 26/Jun/15 ]

My analysis about this problem in an email:
During our tests on Lustre, we found a data corruption when we write to files and truncating/deleting other files at the same time. Usually we could find that one block in the file will contain wrong data which seems very likely to be a metadata block (ext4 extents).

And after a long time investigation, we finally found out the root cause for the problem. In short, this is due to a race condition between the data IO (Lustre performs them directly to the block device by submit_bio()) and the metadata IO (ldiskfs performs metadata IO to the block device, which might be cached in the block device’s page cache).

Lustre would commit data IO from clients by the osd_submit_bio() function in osd-ldiskfs/osd_io.c :

osd_io.c
241 static void osd_submit_bio(int rw, struct bio *bio)
242 {
243         LASSERTF(rw == 0 || rw == 1, "%x\n", rw);
244         if (rw == 0)
245                 submit_bio(READ, bio);
246         else
247                 submit_bio(WRITE, bio);
248 }

However, there might be dirty data in the block’s page cache. It’s rare but it does might happen if things happen as following order:
1. A file got truncated, and its extent blocks are updated to complete the truncation. So those blocks became dirty.
2. The file is deleted, so all its metadata blocks are free now.
3. One of the metadata blocks is used as data block to hold client’s data. It will be updated by osd_submit_bio().
4. Then the kernel decides to flush the dirty pages, the data block will be overwritten by the dirty metadata and the data is corrupted.

So I think this is a problem and I think the right thing to do is to invalidate the corresponding pages in the block device’s page cache before we issue the bio to make sure that there aren’t any dirty pages, which might overwrite our data later. So we propose to make the following change on osd_submit_bio() to invalidate the page cache.

osd_io.c
static void osd_submit_bio(int rw, struct bio *bio)
{
        struct inode *bdinode = bio->bi_bdev->bd_inode;

        LASSERTF(rw == 0 || rw == 1, "%x\n", rw);
        if (rw == 0) {
                submit_bio(READ, bio);
        } else {
                loff_t start = bio->bi_sector << 9;
                loff_t endbyte = start + bio->bi_size - 1;

                /* Invalidate the page cache in the block device, otherwise
                 * the dirty data in block device's page cache might corrupt
                 * the data we are going to write. */
                truncate_pagecache_range(bdinode, start, endbyte);
                submit_bio(WRITE, bio);
        }
}
Comment by Peter Jones [ 03/Jul/15 ]

Alex

Could you please advise on this issue?

Thanks

Peter

Comment by Alex Zhuravlev [ 13/Jul/15 ]

> 1. A file got truncated, and its extent blocks are updated to complete the truncation. So those blocks became dirty.
> 2. The file is deleted, so all its metadata blocks are free now.
> 3. One of the metadata blocks is used as data block to hold client’s data. It will be updated by osd_submit_bio().

(2) is not quite correct, metadata blocks aren't freed immediately, instead they are scheduled for release up on commit.
see ldiskfs_mb_free_blocks() for the details - when metadata != 0.

also, we already have calls to unmap_underlying_metadata() which is supposed to do what you suggested.
I'm still looking at the code.

Comment by Alex Zhuravlev [ 13/Jul/15 ]

what kernel version do you use?

Comment by Gerrit Updater [ 13/Jul/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15593
Subject: LU-6768 osd: unmap reallocated blocks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: facf7f47ced7debdbaf0d814158661857176b4bf

Comment by Alex Zhuravlev [ 13/Jul/15 ]

Jingwang, would you mind to try http://review.whamcloud.com/15593 please?

Comment by Jingwang Zhang [ 14/Jul/15 ]

Thanks for looking into this.

I'm using CentOS 6.5 with kernel version 2.6.32.431.29.2. And I will try the fix and get back to you later.

Comment by Jingwang Zhang [ 14/Jul/15 ]

I run the reproducer for 7 hours and it didn't fail after applying the patch, where it will fail in minutes without the patch, so I believe that the problem is fixed.

Comment by Alex Zhuravlev [ 14/Jul/15 ]

thanks for the report and testing. please, inspect the patch and help to move it forward.

Comment by Gerrit Updater [ 03/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15593/
Subject: LU-6768 osd: unmap reallocated blocks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bcef61a80ab4fa6cee847722184738ba4deeb971

Comment by Peter Jones [ 03/Aug/15 ]

Landed for 2.8

Comment by Jay Lan (Inactive) [ 04/Aug/15 ]

This may help us on LU-6925. Can we get a b2_5 back port? Thanks!

Comment by Jay Lan (Inactive) [ 06/Aug/15 ]

Is the patch needed by server or client, or both?
Looks like a server patch.

Comment by Gerrit Updater [ 06/Aug/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/15904
Subject: LU-6768 lvfs: unmap reallocated blocks
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: c8448bce0ad13aeb65c48905e080ddb0c536fc91

Comment by Jian Yu [ 06/Aug/15 ]

Hi Jay,
The patch is only needed by server.

Generated at Sat Feb 10 02:03:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.