[LU-67] write_disjoint: data corruption Created: 09/Feb/11  Updated: 28/Jun/11  Resolved: 14/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 24375.tar.bz2    
Severity: 2
Bugzilla ID: 24,375
Rank (Obsolete): 10282

 Description   

Write disjoint occasionally fails with a data corruption pattern like this:
loop 0: chunk_size 103399
loop 1000: chunk_size 69125
loop 2000: chunk_size 104360
loop 3000: chunk_size 11295
loop 4000: chunk_size 77918
loop 4370: chunk_size 51125
loop 4371: chunk 3 corrupted with chunk_size 93369, page_size 4096
ranks: page boundry chunk boundry page boundry
A -> B: 90112 93369 94208
B -> C: 184320 186738 188416
C -> D: 278528 280107 282624
D -> E: 372736 373476 376832
E -> F: 462848 466845 466944
F -> G: 557056 560214 561152
G -> H: 651264 653583 655360
H -> I: 745472 746952 749568
I -> J: 839680 840321 843776
J -> K: 929792 933690 933888
K -> L: 1024000 1027059 1028096
0000000 A A A A A A A A A A A A A A A A
*
0093360 A A A A A A A A A B B B B B B B
0093376 B B B B B B B B B B B B B B B B
*
0186736 B B C C C C C C C C C C C C C C
0186752 C C C C C C C C C C C C C C C C
*
0280096 C C C C C C C C C C C D D D D D
0280112 D D D D D D D D D D D D D D D D
*
0372736 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0373472 nul nul nul nul E E E E E E E E E E E E
0373488 E E E E E E E E E E E E E E E E
*
0466832 E E E E E E E E E E E E E F F F
0466848 F F F F F F F F F F F F F F F F
*
0560208 F F F F F F G G G G G G G G G G
0560224 G G G G G G G G G G G G G G G G
*
0653568 G G G G G G G G G G G G G G G H
0653584 H H H H H H H H H H H H H H H H
*
0746944 H H H H H H H H I I I I I I I I
0746960 I I I I I I I I I I I I I I I I
*
0840320 I J J J J J J J J J J J J J J J
0840336 J J J J J J J J J J J J J J J J
*
0933680 J J J J J J J J J J K K K K K K
0933696 K K K K K K K K K K K K K K K K
*
1027056 K K K L L L L L L L L L L L L L
1027072 L L L L L L L L L L L L L L L L
*
1120416 L L L L L L L L L L L L
1120428
rank 0, loop 4371: data check error - exiting
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

Originally reproduced at Oracle (see the bug and the attachment for the logs).
Now I also reproduced the issue locally, but after examining the logs I believe there might be two issues since my logs are pretty different from Oracle logs.
The issue that I can reproduce also affects lustre 1.8



 Comments   
Comment by Andreas Dilger [ 09/Feb/11 ]

Note that there has been a similar problem with write_disjoint for ages on 1.6 and 1.8, I think it is https://bugzilla.lustre.org/show_bug.cgi?id=3654.

Comment by Oleg Drokin [ 09/Feb/11 ]

Could be. The issue that I have found exists since forever. It's just a race between enqueue reply and completion AST where completion AST happens first with correct LVB and then RPC reply rewrites correct lvb with incorrect (I know there is a fix for this race, but it's racy by itself).
I am testing a patch for most of today and it seems to be holding well, so I plan to give it to ORNL tomorrow for more testing and also submit for inspections.

Comment by Chris Gearing (Inactive) [ 10/Feb/11 ]

Is severity 3 the highest or lowest? I ask because data corruption would seem to me to always be highest severity. An important point I guess is whether this is this silent data corruption or not. I'm not knowledgable enough to now what is detecting the error from your log.

Comment by Peter Jones [ 10/Feb/11 ]

Severity 3 is the default and means a minor issue. Bumping the severity to major issue (2)

Comment by Peter Jones [ 14/Mar/11 ]

This fix has been landed upstream for 1.8.6

Comment by Oleg Drokin [ 14/Mar/11 ]

landed to 2.1 and 1.8.6 branches

Generated at Sat Feb 10 01:03:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.