[LU-1573] avoid data corruption for direct io data - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
None
Environment:
any lustre version

Story Points:
4
Severity:
3
Rank (Obsolete):
4001

Description

when we call a shutdown (without -f) we a set a 'notransno' flag to put all requests in replay queue.

case 'A':
                                LCONSOLE_WARN("Failing over %s\n",
                                              obd->obd_name);
                                obd->obd_fail = 1;
                                obd->obd_no_transno = 1;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

if that will be raced with obd_commitrw_write() which process a DIO request - reply will be sent without last_commited trasno update.
ptlrpc client will be put that request in replay queue as transno > last_commited and send a completion event to user land application... we have a request with brw pages directly pointed in user data in replay queue.
OOPS.

If user land application will be reused a same buffer for different data after exit from write(2) call, ptlrpc will started to replay that request, but send a invalid data to the OST. so we a corrupt data on OST side.

replicate that bug is very easy.
use lctl --device notransno command and use directio write. that will don't blocked and exited - but ptlrpc request will have a pointer to userspace.

we found that bug in testing DIO under failover.
we call a default replay_barier / fail functions on ost side and see - sometimes file a corrupted.
corruption fully addressed to the requests replayed after reconnect.
after disable sending a reply from a OST to the client for sync journal case - we have found that bug fixes,
but looks it's affected not just testing environment - but race window be smaller.

Attachments

Activity

People

Assignee:: Alex Zhuravlev

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/Jun/12 3:51 AM

Updated:: 03/Feb/17 12:43 AM

Resolved:: 03/Feb/17 12:43 AM