[LU-1573] avoid data corruption for direct io data Created: 27/Jun/12  Updated: 03/Feb/17  Resolved: 03/Feb/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

any lustre version


Issue Links:
Duplicate
Story Points: 4
Severity: 3
Rank (Obsolete): 4001

 Description   

when we call a shutdown (without -f) we a set a 'notransno' flag to put all requests in replay queue.

case 'A':
                                LCONSOLE_WARN("Failing over %s\n",
                                              obd->obd_name);
                                obd->obd_fail = 1;
                                obd->obd_no_transno = 1;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

if that will be raced with obd_commitrw_write() which process a DIO request - reply will be sent without last_commited trasno update.
ptlrpc client will be put that request in replay queue as transno > last_commited and send a completion event to user land application... we have a request with brw pages directly pointed in user data in replay queue.
OOPS.

If user land application will be reused a same buffer for different data after exit from write(2) call, ptlrpc will started to replay that request, but send a invalid data to the OST. so we a corrupt data on OST side.

replicate that bug is very easy.
use lctl --device notransno command and use directio write. that will don't blocked and exited - but ptlrpc request will have a pointer to userspace.

we found that bug in testing DIO under failover.
we call a default replay_barier / fail functions on ost side and see - sometimes file a corrupted.
corruption fully addressed to the requests replayed after reconnect.
after disable sending a reply from a OST to the client for sync journal case - we have found that bug fixes,
but looks it's affected not just testing environment - but race window be smaller.



 Comments   
Comment by Alexey Lyashkov [ 27/Jun/12 ]

remote: http://review.whamcloud.com/3197
remote:

patch created as port of 2.1 code, so question about create same for osd_io.c and possible for osd-zfs.

Comment by Andreas Dilger [ 04/Jul/12 ]

Alex, I was looking to see how this would apply to OFD/osd-ldiskfs/osd-zfs, but it is not at all obvious due to changes in this code.

Could you (or Mike, as you see fit) please look into porting this patch for the updated OFD/OSD code.

Comment by Alex Zhuravlev [ 05/Jul/12 ]

shouldn't we just always set transno to 0 on OSS side for sync requests?

Comment by Oleg Drokin [ 05/Jul/12 ]

I personally think this is not really a "critical" issue.

Sure the corruption is real, but the condition is not all that normal.
It requires a lustre umount (a soft failover) that's hopefully a pretty rare operation. Typically a node just dies and that's it.

Soft failovers are usually only done in testing for simplicity and speed, or for administrative reasons.

I guess the only case not like this is the fail back once the primary node comes back online.

Comment by Alexey Lyashkov [ 05/Jul/12 ]

Alex,

for sync requests - we need always reply with request transno <= committed transno, or don't send reply at all.
or provide more changes on client side to avoid moving that requests from sending to replay list.

Comment by Alex Zhuravlev [ 05/Jul/12 ]

for sync request transno makes no sense, IMHO.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-542

Comment by Alex Zhuravlev [ 18/Dec/12 ]

not sure what's expected in this ticket.. as I said before, if RPC is not a subject to replay, then just do not set transno in reply.

Comment by Alexey Lyashkov [ 15/Mar/13 ]

Alex,

that is not enough because DIO page have a pointer to user page, instead of buffered IO have a copy user data inside.
if you set a transno to zero - you will leave that is from recovery so if we have uncommited data we will lost these changes.

Comment by Alex Zhuravlev [ 15/Mar/13 ]

sorry, I don't follow... it doesn't matter where the buffers are, imo. sync request means OST replies when the write is committed. in turn, this mean recovery/replay is not needed in the first place. then setting transno to 0 should be enough.

Comment by Alexey Lyashkov [ 27/Mar/13 ]

when we have set a no_trasno - we have stop sending a transo updates to a clients (same way used to create recovery barrier in tests). after it - all requests put in recovery queue as rq_transno isn't less then last_transo from a clients. Ptlrpc don't have a differences between buffered write (where kernel have a copy of userspace data) and DIO - don't have a copy and have a direct pointer to userspace. both types of requests put in replay queue to able to replay after fail.

that description enough? or that is fixes in 2.4 in different way? may you verify DIO requests don't put in recovery queue after recovery barrier?

Comment by Alex Zhuravlev [ 27/Mar/13 ]

you're confusing last_committed (which we do not update when notransno flag is set) and per-rpc transno which we set always:

void target_committed_to_req(struct ptlrpc_request *req)
{
struct obd_export *exp = req->rq_export;

if (!exp->exp_obd->obd_no_transno && req->rq_repmsg != NULL)
lustre_msg_set_last_committed(req->rq_repmsg,
exp->exp_last_committed);

Comment by Alex Zhuravlev [ 26/Apr/13 ]

as per discussion with Alexey at LUG: DIO should be waiting for IO completion on OST (otherwise it can get stuck awaiting for last transno to be transfered by ping), given so it makes sense to set transno to 0 in this case.

Comment by Alexey Lyashkov [ 26/Apr/13 ]

so my original patch looks fine. that is drop a reply for a no trasno mode, simulating to waiting a commit from journal and prevent a moving from sending to replay list.

Comment by Gerrit Updater [ 18/Jan/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/24940
Subject: LU-1573 test: replay-ost-single test_9 issue extra write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9a40f1130d2355443b0a08f77c21135f2ff7a66

Comment by Gerrit Updater [ 03/Feb/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/16680/
Subject: LU-1573 recovery: Avoid data corruption for DIO during FOFB
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1d2fbade1b658db4386091e7938d9483f7aa4a05

Comment by Peter Jones [ 03/Feb/17 ]

Landed for 2.10

Generated at Sat Feb 10 01:17:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.