Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12681

Data corruption - due incorrect KMS with SEL files

Details

    • 3
    • 9223372036854775807

    Description

      In testing with SEL port to the 2.12 branch, Grev found an data corruption issue.
      I checked it with last master on client side and issue still present.
      example of it

      [4] FAILED comparison of buffer containing 8-byte ints:
      [4]   File name = /mnt/lustre/SEL/IOR.o
      [4]   In transfer 392, 256 errors between buffer indices 212992 and 213247.
      [4]   File byte offset = 5454028800:
      [4]     Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018
      [4]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
      ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414)
      

      current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST.
      this quick and dirty fix resolves an issue but KMS problem needs invested.

      @@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                      GOTO(out, result);
              }
      
      +#if 0
              /*
               * If are writing to a new page, no need to read old data.
               * The extent locking will have updated the KMS, and for our
               * purposes here we can treat it like i_size.
               */
      -       if (attr->cat_kms <= offset) {
      +       if (attr->cat_kms < offset) {
                      char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);
      
                      memset(kaddr, 0, cl_page_size(obj));
                      ll_kunmap_atomic(kaddr, KM_USER0);
      +               CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset);
                      GOTO(out, result = 0);
              }
      +#endif
      
      00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048
      00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456
      00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048
      and brw sends a full page to the OST
      00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520]
      

      it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem.

      Attachments

        Issue Links

          Activity

            [LU-12681] Data corruption - due incorrect KMS with SEL files

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40739
            Subject: LU-12681 osc: wrong cache of LVB attrs
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 5aa6767d63368e630531669e51b4c8e11e1788b8

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40739 Subject: LU-12681 osc: wrong cache of LVB attrs Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 5aa6767d63368e630531669e51b4c8e11e1788b8
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36200/
            Subject: LU-12681 osc: wrong cache of LVB attrs, part2
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 40319db5bc649adaf3dad066e2c1bb49f7f1c04a

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36200/ Subject: LU-12681 osc: wrong cache of LVB attrs, part2 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 40319db5bc649adaf3dad066e2c1bb49f7f1c04a

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36199/
            Subject: LU-12681 osc: wrong cache of LVB attrs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8ac020df4592fc6e85edd75d54cb3795a4e50f8e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36199/ Subject: LU-12681 osc: wrong cache of LVB attrs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8ac020df4592fc6e85edd75d54cb3795a4e50f8e

            Sorry, I meant the read required to do a partial page write.  I know what you're talking about.

            The read is of the full page, but then the write overwrites part of it.  So that's why I said a "partial page read", but that is incorrect, as you pointed out.

            My point about the corruption being limited in this case to just zeroes, not garbage data, still stands.  That's why LU-12786 doesn't match here.

            pfarrell Patrick Farrell (Inactive) added a comment - Sorry, I meant the read required to do a partial page write.  I know what you're talking about. The read is of the full page, but then the write overwrites part of it.  So that's why I said a "partial page read", but that is incorrect, as you pointed out. My point about the corruption being limited in this case to just zeroes, not garbage data, still stands.  That's why LU-12786 doesn't match here.

            Partick,

            you are wrong. This bug on partial page write. POSIX requests a zero page before modify.
            and lustre does it in the

            static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                                               struct cl_page *pg, struct file *file)
            {
                    /*
                     * If are writing to a new page, no need to read old data.
                     * The extent locking will have updated the KMS, and for our
                     * purposes here we can treat it like i_size.
                     */
                    if (attr->cat_kms <= offset) {
                            char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);
            
                            memset(kaddr, 0, cl_page_size(obj));
                            ll_kunmap_atomic(kaddr, KM_USER0);
                            GOTO(out, result = 0);
                    }
            

            it caused a zeros will send to OST and to the disk.

            shadow Alexey Lyashkov added a comment - Partick, you are wrong. This bug on partial page write . POSIX requests a zero page before modify. and lustre does it in the static int ll_prepare_partial_page( const struct lu_env *env, struct cl_io *io, struct cl_page *pg, struct file *file) { /* * If are writing to a new page, no need to read old data. * The extent locking will have updated the KMS, and for our * purposes here we can treat it like i_size. */ if (attr->cat_kms <= offset) { char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0); memset(kaddr, 0, cl_page_size(obj)); ll_kunmap_atomic(kaddr, KM_USER0); GOTO(out, result = 0); } it caused a zeros will send to OST and to the disk.

            Ihara,

            I don't think so - This bug results mostly in zeroes in partial page reads.  In most scenarios, this is transient corruption that stays on the client, and doesn't go to disk.  (It can go to disk, but only under special scenarios.)  You have much larger regions of bad data (most of a write in all cases I saw), the data is bad on disk, and the data is not zeroes, instead it's mostly random garbage.  (It is in one case, but it's possible for unwritten regions of disk to contain zeroes rather than garbage.)

            pfarrell Patrick Farrell (Inactive) added a comment - Ihara, I don't think so - This bug results mostly in zeroes in partial page reads.  In most scenarios, this is transient corruption that stays on the client, and doesn't go to disk.  (It can go to disk, but only under special scenarios.)  You have much larger regions of bad data (most of a write in all cases I saw), the data is bad on disk, and the data is not zeroes, instead it's mostly random garbage.  (It is in one case, but it's possible for unwritten regions of disk to contain zeroes rather than garbage.)

            OK, so patch for LU-12681 would be worth to try for my workload, then?

            sihara Shuichi Ihara added a comment - OK, so patch for LU-12681 would be worth to try for my workload, then?

            It's very generic problem, primary case is PFL or SEL, where layout changed sometimes.
            But in some cases this error can hit on any client with CLIO, so lustre 2.0+ affected.
            This needs to flush inode from inode cache, but ldlm locks need to stay in cache, likely pined with IO.
            I remember similar problems in long time ago in past, when inode info was moved from lock to the ldlm resource.

            shadow Alexey Lyashkov added a comment - It's very generic problem, primary case is PFL or SEL, where layout changed sometimes. But in some cases this error can hit on any client with CLIO, so lustre 2.0+ affected. This needs to flush inode from inode cache, but ldlm locks need to stay in cache, likely pined with IO. I remember similar problems in long time ago in past, when inode info was moved from lock to the ldlm resource.

            I got a similar problem on unaligned single shared file workload with IOR. LU-12786
            I'm not using SEL, but is LU-12681 general problem or specific in SEL?

            sihara Shuichi Ihara added a comment - I got a similar problem on unaligned single shared file workload with IOR. LU-12786 I'm not using SEL, but is LU-12681 general problem or specific in SEL?
            vitaly_fertman Vitaly Fertman added a comment - - edited

            The problem is regularly reproduced with IOR running with transfer size not aligned by page size and also being run on a SEL layout. e.g.:

            1st problem offset:
            redpill-client01: [27] At transfer buffer #32, index #212992 (file byte offset 2666450944):

            the happened IO is:
            redpill-client01: api = POSIX
            redpill-client01: test filename = /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior
            redpill-client01: access = single-shared-file
            redpill-client01: pattern = strided (33 segments)
            redpill-client01: ordering in a file = sequential offsets
            redpill-client01: ordering inter file=constant task offsets = 1
            redpill-client01: clients = 48 (8 per node)
            redpill-client01: repetitions = 2
            redpill-client01: xfersize = 1.63 MiB
            redpill-client01: blocksize = 27.66 MiB
            redpill-client01: aggregate filesize = 42.78 GiB
            redpill-client01:
            redpill-client01: Using Time Stamp 1565463864 (0x5d4f1538) for Data Signature
            redpill-client01: Commencing write performance test.
            redpill-client01: Sat Aug 10 19:04:24 2019
            redpill-client01:
            redpill-client01: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
            redpill-client01: ------ --------- ---------- --------- -------- -------- -------- -------- ----
            redpill-client01: write 171.50 28322 1666.00 0.015037 255.36 0.116421 255.46 0 XXCEL

            i.e. each client writes 33 * 28322K, each 28322K are written by 1666K

            the file by itself:
            lfs getstripe -y /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior
            lcm_layout_gen: 64
            lcm_mirror_count: 1
            lcm_entry_count: 3

            • component0:
              lcme_id: 1
              lcme_mirror_id: 0
              lcme_flags: init
              lcme_extent.e_start: 0
              lcme_extent.e_end: 67108864
              sub_layout:
              lmm_stripe_count: 1
              lmm_stripe_size: 1048576
              lmm_pattern: raid0
              lmm_layout_gen: 0
              lmm_stripe_offset: 0
              lmm_objects:
              l_ost_idx: 0
              l_fid: 0x100000000:0x23a551c4:0x0
            • component1:
              lcme_id: 3
              lcme_mirror_id: 0
              lcme_flags: init
              lcme_extent.e_start: 67108864
              lcme_extent.e_end: 45969571840
              sub_layout:
              lmm_stripe_count: 1
              lmm_stripe_size: 1048576
              lmm_pattern: raid0
              lmm_layout_gen: 0
              lmm_stripe_offset: 6
              lmm_objects:
              l_ost_idx: 6
              l_fid: 0x100060000:0x2368b93b:0x0
            • component2:
              lcme_id: 4
              lcme_mirror_id: 0
              lcme_flags: extension
              lcme_extent.e_start: 45969571840
              lcme_extent.e_end: EOF
              sub_layout:
              lmm_stripe_count: 0
              lmm_extension_size: 134217728
              lmm_pattern: raid0
              lmm_layout_gen: 0
              lmm_stripe_offset: -1

            the problem is that first 2k of the block are zeros, the next 2k in the block is a start of the next transfersize (2666450944+2k / 1666k = 1563) however this is not a segment border, thus written by the same client.
            it means a half of a page at the end of one transfersize chunk of data is lost when followed by another transfersize chunk of data from the same client.

            SEL is involved just because it triggers layout changes regularly, however the problem originally belongs to the layout lock handling in CLIO - osc objects do not survive after layout lock change and the cached lvb attributes are lost, with the info about the modifications have been made; new osc objects are re-populated by the lvb cache from the outdated lock's lvb. a later preparation of a partial page checks KMS which is low enough to not read the page in advance - the left part is just left 0-ed.

            the patch descriptions have more detailed explanations of the particular scenarios being fixed.

            vitaly_fertman Vitaly Fertman added a comment - - edited The problem is regularly reproduced with IOR running with transfer size not aligned by page size and also being run on a SEL layout. e.g.: 1st problem offset: redpill-client01: [27] At transfer buffer #32, index #212992 (file byte offset 2666450944): the happened IO is: redpill-client01: api = POSIX redpill-client01: test filename = /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior redpill-client01: access = single-shared-file redpill-client01: pattern = strided (33 segments) redpill-client01: ordering in a file = sequential offsets redpill-client01: ordering inter file=constant task offsets = 1 redpill-client01: clients = 48 (8 per node) redpill-client01: repetitions = 2 redpill-client01: xfersize = 1.63 MiB redpill-client01: blocksize = 27.66 MiB redpill-client01: aggregate filesize = 42.78 GiB redpill-client01: redpill-client01: Using Time Stamp 1565463864 (0x5d4f1538) for Data Signature redpill-client01: Commencing write performance test. redpill-client01: Sat Aug 10 19:04:24 2019 redpill-client01: redpill-client01: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter redpill-client01: ------ --------- ---------- --------- -------- -------- -------- -------- ---- redpill-client01: write 171.50 28322 1666.00 0.015037 255.36 0.116421 255.46 0 XXCEL i.e. each client writes 33 * 28322K, each 28322K are written by 1666K the file by itself: lfs getstripe -y /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior lcm_layout_gen: 64 lcm_mirror_count: 1 lcm_entry_count: 3 component0: lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 67108864 sub_layout: lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: l_ost_idx: 0 l_fid: 0x100000000:0x23a551c4:0x0 component1: lcme_id: 3 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 67108864 lcme_extent.e_end: 45969571840 sub_layout: lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 6 lmm_objects: l_ost_idx: 6 l_fid: 0x100060000:0x2368b93b:0x0 component2: lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 45969571840 lcme_extent.e_end: EOF sub_layout: lmm_stripe_count: 0 lmm_extension_size: 134217728 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 the problem is that first 2k of the block are zeros, the next 2k in the block is a start of the next transfersize (2666450944+2k / 1666k = 1563) however this is not a segment border, thus written by the same client. it means a half of a page at the end of one transfersize chunk of data is lost when followed by another transfersize chunk of data from the same client. SEL is involved just because it triggers layout changes regularly, however the problem originally belongs to the layout lock handling in CLIO - osc objects do not survive after layout lock change and the cached lvb attributes are lost, with the info about the modifications have been made; new osc objects are re-populated by the lvb cache from the outdated lock's lvb. a later preparation of a partial page checks KMS which is low enough to not read the page in advance - the left part is just left 0-ed. the patch descriptions have more detailed explanations of the particular scenarios being fixed.

            People

              vitaly_fertman Vitaly Fertman
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: