Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12681

Data corruption - due incorrect KMS with SEL files

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      In testing with SEL port to the 2.12 branch, Grev found an data corruption issue.
      I checked it with last master on client side and issue still present.
      example of it

      [4] FAILED comparison of buffer containing 8-byte ints:
      [4]   File name = /mnt/lustre/SEL/IOR.o
      [4]   In transfer 392, 256 errors between buffer indices 212992 and 213247.
      [4]   File byte offset = 5454028800:
      [4]     Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018
      [4]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
      ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414)
      

      current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST.
      this quick and dirty fix resolves an issue but KMS problem needs invested.

      @@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                      GOTO(out, result);
              }
      
      +#if 0
              /*
               * If are writing to a new page, no need to read old data.
               * The extent locking will have updated the KMS, and for our
               * purposes here we can treat it like i_size.
               */
      -       if (attr->cat_kms <= offset) {
      +       if (attr->cat_kms < offset) {
                      char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);
      
                      memset(kaddr, 0, cl_page_size(obj));
                      ll_kunmap_atomic(kaddr, KM_USER0);
      +               CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset);
                      GOTO(out, result = 0);
              }
      +#endif
      
      00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048
      00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456
      00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048
      and brw sends a full page to the OST
      00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520]
      

      it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem.

      Attachments

        Issue Links

          Activity

            People

              vitaly_fertman Vitaly Fertman
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: