[LU-67] write_disjoint: data corruption Created: 09/Feb/11 Updated: 28/Jun/11 Resolved: 14/Mar/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 2 |
| Bugzilla ID: | 24,375 |
| Rank (Obsolete): | 10282 |
| Description |
|
Write disjoint occasionally fails with a data corruption pattern like this: Originally reproduced at Oracle (see the bug and the attachment for the logs). |
| Comments |
| Comment by Andreas Dilger [ 09/Feb/11 ] |
|
Note that there has been a similar problem with write_disjoint for ages on 1.6 and 1.8, I think it is https://bugzilla.lustre.org/show_bug.cgi?id=3654. |
| Comment by Oleg Drokin [ 09/Feb/11 ] |
|
Could be. The issue that I have found exists since forever. It's just a race between enqueue reply and completion AST where completion AST happens first with correct LVB and then RPC reply rewrites correct lvb with incorrect (I know there is a fix for this race, but it's racy by itself). |
| Comment by Chris Gearing (Inactive) [ 10/Feb/11 ] |
|
Is severity 3 the highest or lowest? I ask because data corruption would seem to me to always be highest severity. An important point I guess is whether this is this silent data corruption or not. I'm not knowledgable enough to now what is detecting the error from your log. |
| Comment by Peter Jones [ 10/Feb/11 ] |
|
Severity 3 is the default and means a minor issue. Bumping the severity to major issue (2) |
| Comment by Peter Jones [ 14/Mar/11 ] |
|
This fix has been landed upstream for 1.8.6 |
| Comment by Oleg Drokin [ 14/Mar/11 ] |
|
landed to 2.1 and 1.8.6 branches |