Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
    • None
    • 2
    • 24,375
    • 10282

    Description

      Write disjoint occasionally fails with a data corruption pattern like this:
      loop 0: chunk_size 103399
      loop 1000: chunk_size 69125
      loop 2000: chunk_size 104360
      loop 3000: chunk_size 11295
      loop 4000: chunk_size 77918
      loop 4370: chunk_size 51125
      loop 4371: chunk 3 corrupted with chunk_size 93369, page_size 4096
      ranks: page boundry chunk boundry page boundry
      A -> B: 90112 93369 94208
      B -> C: 184320 186738 188416
      C -> D: 278528 280107 282624
      D -> E: 372736 373476 376832
      E -> F: 462848 466845 466944
      F -> G: 557056 560214 561152
      G -> H: 651264 653583 655360
      H -> I: 745472 746952 749568
      I -> J: 839680 840321 843776
      J -> K: 929792 933690 933888
      K -> L: 1024000 1027059 1028096
      0000000 A A A A A A A A A A A A A A A A
      *
      0093360 A A A A A A A A A B B B B B B B
      0093376 B B B B B B B B B B B B B B B B
      *
      0186736 B B C C C C C C C C C C C C C C
      0186752 C C C C C C C C C C C C C C C C
      *
      0280096 C C C C C C C C C C C D D D D D
      0280112 D D D D D D D D D D D D D D D D
      *
      0372736 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
      *
      0373472 nul nul nul nul E E E E E E E E E E E E
      0373488 E E E E E E E E E E E E E E E E
      *
      0466832 E E E E E E E E E E E E E F F F
      0466848 F F F F F F F F F F F F F F F F
      *
      0560208 F F F F F F G G G G G G G G G G
      0560224 G G G G G G G G G G G G G G G G
      *
      0653568 G G G G G G G G G G G G G G G H
      0653584 H H H H H H H H H H H H H H H H
      *
      0746944 H H H H H H H H I I I I I I I I
      0746960 I I I I I I I I I I I I I I I I
      *
      0840320 I J J J J J J J J J J J J J J J
      0840336 J J J J J J J J J J J J J J J J
      *
      0933680 J J J J J J J J J J K K K K K K
      0933696 K K K K K K K K K K K K K K K K
      *
      1027056 K K K L L L L L L L L L L L L L
      1027072 L L L L L L L L L L L L L L L L
      *
      1120416 L L L L L L L L L L L L
      1120428
      rank 0, loop 4371: data check error - exiting
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
      with errorcode -1.

      Originally reproduced at Oracle (see the bug and the attachment for the logs).
      Now I also reproduced the issue locally, but after examining the logs I believe there might be two issues since my logs are pretty different from Oracle logs.
      The issue that I can reproduce also affects lustre 1.8

      Attachments

        Activity

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: