[LU-8202] Data corruption during failover due to overlapping extent locks - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Due to the "just grant already granted locks" behavior on servers during failover, it is possible to get overlapping extent locks granted, which can lead to data corruption in a number of ways.

The specific case we have seen in customer testing is shared file I/O from multiple clients, when two clients are trying to write to the same page. (Note these are NOT overlapping writes - They're writing to different parts of the page.)

One client has a granted LDLM lock covering this page and the other client is waiting when the OST is failed over, then during recovery, the client which was waiting contacts the server first. Its waiting lock is processed and granted, because nothing is in the way. Then the client which already had a granted lock arrives, and its lock is granted immediately.

The two clients are then allowed to write to this page at the same time. Since it's a partial page write, it is a read-modify-write operation, and if two read-modify-write operations happen at the same time, they can end up with old data in part of the page. Because they can both read the page before the other process modifies it. So whichever one writes last will modify part of the page, but for the rest of the page, it will re-write what was in the page before.
(Feel free to read 'page' as 'disk block' in parts of the above. Also, this scenario assumes the writethrough cache is disabled, as I believe cache locking would prevent this particular scenario.)

This is the simplest example of corruption due to this behavior, and we have a test case which can replicate it with reasonable confidence. However, many other possible scenarios are available.

Consider this additional case:
Client 1 is writing to an area - write completes on the client (returns to userspace), client 1 communicates to client 2 saying "data is ready", client 2 reads the same area, generates a waiting lock behind the write lock for client 1, failover happens. Waiting read lock from client 2 is granted before the write lock from client 1.

Client 2 reads data, gets bad data.
—

There are a variety of other possible scenarios as well, and I should stress that none of this is limited to multiple clients working on the same page.

I will attach a test case and provide logs of an example in the comments.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mpi_test.c
6 kB
24/May/16 8:35 PM

Activity

People

Assignee:: WC Triage

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/May/16 8:33 PM

Updated:: 01/Jul/16 3:11 PM

Resolved:: 01/Jul/16 3:11 PM