Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
-
None
-
2
-
24,375
-
10282
Description
Write disjoint occasionally fails with a data corruption pattern like this:
loop 0: chunk_size 103399
loop 1000: chunk_size 69125
loop 2000: chunk_size 104360
loop 3000: chunk_size 11295
loop 4000: chunk_size 77918
loop 4370: chunk_size 51125
loop 4371: chunk 3 corrupted with chunk_size 93369, page_size 4096
ranks: page boundry chunk boundry page boundry
A -> B: 90112 93369 94208
B -> C: 184320 186738 188416
C -> D: 278528 280107 282624
D -> E: 372736 373476 376832
E -> F: 462848 466845 466944
F -> G: 557056 560214 561152
G -> H: 651264 653583 655360
H -> I: 745472 746952 749568
I -> J: 839680 840321 843776
J -> K: 929792 933690 933888
K -> L: 1024000 1027059 1028096
0000000 A A A A A A A A A A A A A A A A
*
0093360 A A A A A A A A A B B B B B B B
0093376 B B B B B B B B B B B B B B B B
*
0186736 B B C C C C C C C C C C C C C C
0186752 C C C C C C C C C C C C C C C C
*
0280096 C C C C C C C C C C C D D D D D
0280112 D D D D D D D D D D D D D D D D
*
0372736 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0373472 nul nul nul nul E E E E E E E E E E E E
0373488 E E E E E E E E E E E E E E E E
*
0466832 E E E E E E E E E E E E E F F F
0466848 F F F F F F F F F F F F F F F F
*
0560208 F F F F F F G G G G G G G G G G
0560224 G G G G G G G G G G G G G G G G
*
0653568 G G G G G G G G G G G G G G G H
0653584 H H H H H H H H H H H H H H H H
*
0746944 H H H H H H H H I I I I I I I I
0746960 I I I I I I I I I I I I I I I I
*
0840320 I J J J J J J J J J J J J J J J
0840336 J J J J J J J J J J J J J J J J
*
0933680 J J J J J J J J J J K K K K K K
0933696 K K K K K K K K K K K K K K K K
*
1027056 K K K L L L L L L L L L L L L L
1027072 L L L L L L L L L L L L L L L L
*
1120416 L L L L L L L L L L L L
1120428
rank 0, loop 4371: data check error - exiting
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
Originally reproduced at Oracle (see the bug and the attachment for the logs).
Now I also reproduced the issue locally, but after examining the logs I believe there might be two issues since my logs are pretty different from Oracle logs.
The issue that I can reproduce also affects lustre 1.8