[LU-8202] Data corruption during failover due to overlapping extent locks Created: 24/May/16 Updated: 01/Jul/16 Resolved: 01/Jul/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Due to the "just grant already granted locks" behavior on servers during failover, it is possible to get overlapping extent locks granted, which can lead to data corruption in a number of ways. The specific case we have seen in customer testing is shared file I/O from multiple clients, when two clients are trying to write to the same page. (Note these are NOT overlapping writes - They're writing to different parts of the page.) One client has a granted LDLM lock covering this page and the other client is waiting when the OST is failed over, then during recovery, the client which was waiting contacts the server first. Its waiting lock is processed and granted, because nothing is in the way. Then the client which already had a granted lock arrives, and its lock is granted immediately. The two clients are then allowed to write to this page at the same time. Since it's a partial page write, it is a read-modify-write operation, and if two read-modify-write operations happen at the same time, they can end up with old data in part of the page. Because they can both read the page before the other process modifies it. So whichever one writes last will modify part of the page, but for the rest of the page, it will re-write what was in the page before. This is the simplest example of corruption due to this behavior, and we have a test case which can replicate it with reasonable confidence. However, many other possible scenarios are available. Consider this additional case: Client 2 reads data, gets bad data. There are a variety of other possible scenarios as well, and I should stress that none of this is limited to multiple clients working on the same page. I will attach a test case and provide logs of an example in the comments. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 24/May/16 ] |
|
The test case runs on three Lustre clients - two writers, one reader. It works like this: Two writers each write half of the same page with two different recognizable strings, then a reader reads the result and checks it. If the result is incorrect, the reader aborts the job. Here's a more detailed explanation. There are MPI_Barrier calls at the start and end of both the write and read functions. The write function skips the start barrier the first time, which has the effect of linking the "end of write" barrier to the "start of read" barrier and the "end of read" barrier to the "start of write" barrier. Step 1: Job continues like this. The verifier looks for anything other than the expected output. The possibilities for bad data are: 1) The same corruption issue as in the customer test overlapping read-modify-write, in which case we get either all As or all Bs in the file on disk. |
| Comment by Patrick Farrell (Inactive) [ 24/May/16 ] |
|
This is the aforementioned test case. It must be run on at least three nodes, with the ranks arranged round-robin across the nodes. This can be done with mpirun with the --map-by-node option, for example: mpirun -n 3 --map-by-node --host centclient02,centclient03,centclient04 mpi_test.o Once you've got at least three nodes, you can run any number of copies of the job. They write to files name test_file#, where # is the 'group' number. (Every three ranks is a group, and every group has a file.) Start the job, then do failover. We find that with just one instance of this job (three ranks, three nodes), it seems to fail most of the time. Surprisingly, we've seen our best results with just one instance of the job. We have not tried, say, 2 or 3 (6 or 12 ranks), but we did try a few hundred ranks on ~20-30 nodes and had trouble reproducing the problem. |
| Comment by Alexander Zarochentsev [ 01/Jul/16 ] |
| Comment by Patrick Farrell (Inactive) [ 01/Jul/16 ] |
|
Zam is right. I think it might be worth keeping the test case, but this issue is resolved by those two changes, so we can close it out. |
| Comment by Patrick Farrell (Inactive) [ 01/Jul/16 ] |