Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
    • None
    • 2
    • 24,375
    • 10282

    Description

      Write disjoint occasionally fails with a data corruption pattern like this:
      loop 0: chunk_size 103399
      loop 1000: chunk_size 69125
      loop 2000: chunk_size 104360
      loop 3000: chunk_size 11295
      loop 4000: chunk_size 77918
      loop 4370: chunk_size 51125
      loop 4371: chunk 3 corrupted with chunk_size 93369, page_size 4096
      ranks: page boundry chunk boundry page boundry
      A -> B: 90112 93369 94208
      B -> C: 184320 186738 188416
      C -> D: 278528 280107 282624
      D -> E: 372736 373476 376832
      E -> F: 462848 466845 466944
      F -> G: 557056 560214 561152
      G -> H: 651264 653583 655360
      H -> I: 745472 746952 749568
      I -> J: 839680 840321 843776
      J -> K: 929792 933690 933888
      K -> L: 1024000 1027059 1028096
      0000000 A A A A A A A A A A A A A A A A
      *
      0093360 A A A A A A A A A B B B B B B B
      0093376 B B B B B B B B B B B B B B B B
      *
      0186736 B B C C C C C C C C C C C C C C
      0186752 C C C C C C C C C C C C C C C C
      *
      0280096 C C C C C C C C C C C D D D D D
      0280112 D D D D D D D D D D D D D D D D
      *
      0372736 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
      *
      0373472 nul nul nul nul E E E E E E E E E E E E
      0373488 E E E E E E E E E E E E E E E E
      *
      0466832 E E E E E E E E E E E E E F F F
      0466848 F F F F F F F F F F F F F F F F
      *
      0560208 F F F F F F G G G G G G G G G G
      0560224 G G G G G G G G G G G G G G G G
      *
      0653568 G G G G G G G G G G G G G G G H
      0653584 H H H H H H H H H H H H H H H H
      *
      0746944 H H H H H H H H I I I I I I I I
      0746960 I I I I I I I I I I I I I I I I
      *
      0840320 I J J J J J J J J J J J J J J J
      0840336 J J J J J J J J J J J J J J J J
      *
      0933680 J J J J J J J J J J K K K K K K
      0933696 K K K K K K K K K K K K K K K K
      *
      1027056 K K K L L L L L L L L L L L L L
      1027072 L L L L L L L L L L L L L L L L
      *
      1120416 L L L L L L L L L L L L
      1120428
      rank 0, loop 4371: data check error - exiting
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
      with errorcode -1.

      Originally reproduced at Oracle (see the bug and the attachment for the logs).
      Now I also reproduced the issue locally, but after examining the logs I believe there might be two issues since my logs are pretty different from Oracle logs.
      The issue that I can reproduce also affects lustre 1.8

      Attachments

        Activity

          [LU-67] write_disjoint: data corruption
          pjones Peter Jones made changes -
          Affects Version/s Original: Lustre 1.8.x [ 10010 ]
          green Oleg Drokin made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          green Oleg Drokin added a comment -

          landed to 2.1 and 1.8.6 branches

          green Oleg Drokin added a comment - landed to 2.1 and 1.8.6 branches
          pjones Peter Jones added a comment -

          This fix has been landed upstream for 1.8.6

          pjones Peter Jones added a comment - This fix has been landed upstream for 1.8.6
          pjones Peter Jones made changes -
          Priority Original: Major [ 3 ] New: Blocker [ 1 ]
          pjones Peter Jones made changes -
          Severity Original: 3 New: 2
          pjones Peter Jones added a comment -

          Severity 3 is the default and means a minor issue. Bumping the severity to major issue (2)

          pjones Peter Jones added a comment - Severity 3 is the default and means a minor issue. Bumping the severity to major issue (2)

          Is severity 3 the highest or lowest? I ask because data corruption would seem to me to always be highest severity. An important point I guess is whether this is this silent data corruption or not. I'm not knowledgable enough to now what is detecting the error from your log.

          chris Chris Gearing (Inactive) added a comment - Is severity 3 the highest or lowest? I ask because data corruption would seem to me to always be highest severity. An important point I guess is whether this is this silent data corruption or not. I'm not knowledgable enough to now what is detecting the error from your log.
          green Oleg Drokin added a comment -

          Could be. The issue that I have found exists since forever. It's just a race between enqueue reply and completion AST where completion AST happens first with correct LVB and then RPC reply rewrites correct lvb with incorrect (I know there is a fix for this race, but it's racy by itself).
          I am testing a patch for most of today and it seems to be holding well, so I plan to give it to ORNL tomorrow for more testing and also submit for inspections.

          green Oleg Drokin added a comment - Could be. The issue that I have found exists since forever. It's just a race between enqueue reply and completion AST where completion AST happens first with correct LVB and then RPC reply rewrites correct lvb with incorrect (I know there is a fix for this race, but it's racy by itself). I am testing a patch for most of today and it seems to be holding well, so I plan to give it to ORNL tomorrow for more testing and also submit for inspections.

          Note that there has been a similar problem with write_disjoint for ages on 1.6 and 1.8, I think it is https://bugzilla.lustre.org/show_bug.cgi?id=3654.

          adilger Andreas Dilger added a comment - Note that there has been a similar problem with write_disjoint for ages on 1.6 and 1.8, I think it is https://bugzilla.lustre.org/show_bug.cgi?id=3654 .

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: