Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9714

Changelog consumer test reports 'Local llog found corrupted'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      [admin@snx11253n002 ~]$ lctl get_param mdd.snx11253-MDT0000.changelog_users;
      mdd.snx11253-MDT0000.changelog_users=
      current index: 8936329
      ID index
      cl2 7049068
      cl31347 3348651
      cl31349 3348651
      cl31351 3348651
      cl33628 6330946
      cl33632 6335474
      cl2 7049068
      cl33962 7379382
      cl33963 7379382
      cl33964 7379382

      TEST PROCEDURE:
      For each rank:
      (I) CONSUME_LOGS (STARTREC=<arg> or 1)
      [1] Set endrec = startrec + params.chunksize
      [2] Read CL in parallel with N threads with range (startrec, endrec), and extract
      actual last record read (lastrec) .
      a] If no new files are found ...
      i] Retry until allowed (by refresh timeout, refresh retries).
      Otherwise quit.
      b] If new files are found ...
      i] If clearing is enabled, clear logs up to lastrec.
      ii] Increment record ranges
      (startrec = lastrec, endrec = startrec + chunksize)
      d] repeat step [2]
      TEST EXECUTION:
      The test was executed with the follow arguments:
      $ changelog-consumer.py -v --read-clear --clusers cl2 cl2 cl2 cl2 --mdt snx11253-MDT0000 --chunksize 9999 --read-threads 8 --update-retries 6 --update-after 30
      Basically this is reading 9999 Changelogs in 8 threads, then clearing 9999 Changelogs in 1 thread (implied, not show). These actions are being duplicated across 4 separate nodes, and against the "cl2" user.

      The root cause is parallel processing llog with modification.
      The llog has llog_handle which include

      lgh_cur_idx; /* used during llog_process */
      lgh_cur_offset; /* used during llog_process */

      They both are modified by llog_process_thread every time we process record from llog.

      Those fields are used at llog_osd_write_rec for modification/rewrite llog record. The race exist when two or more threads are processing the same llog and at least one of them do modification.
      1) llog_process_thread lgh_cur_idx=10 lgh_cur_offset=2000,
      2) llog_process_thread lgh_cur_idx=5 lgh_cur_offset=1000,

      1) log_process_thread >lpi_cb>mdd_changelog_user_purge_cb->llog_write... llog_osd_write_rec and then

      if (idx != loghandle->lgh_cur_idx) {                         
                                      CERROR("%s: modify index mismatch %d %d\n",          
                                             o->do_lu.lo_dev->ld_obd->obd_name, idx,       
                                             loghandle->lgh_cur_idx);                      
                                     RETURN(-EFAULT);                                     
                }  
      
      
      

      From logs

      May 4 19:55:29 snx11253n002 kernel: LustreError: 29737:0:(llog_osd.c:441:llog_osd_write_rec()) snx11253-MDT0000-osd: modify index mismatch 2 34132

      This was the first case.

      (2) Lets imagine that second thread modify lgh_cur_idx/lgh_cur_offset right after this check

      if (idx != loghandle->lgh_cur_idx)

      then

      lgi->lgi_off = loghandle->lgh_cur_offset;

      lgi->lgi_off became 1000 instead of 2000. And the wrong modification will happen.

      Attachments

        Activity

          [LU-9714] Changelog consumer test reports 'Local llog found corrupted'
          pjones Peter Jones added a comment -

          Alex

          The functional change tracked here has landed to master. Do you still intend to land the accompanying test? If so, please could you rebase it so that it can complete testing and reviews?

          Thanks

          Peter

          pjones Peter Jones added a comment - Alex The functional change tracked here has landed to master. Do you still intend to land the accompanying test? If so, please could you rebase it so that it can complete testing and reviews? Thanks Peter

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27838/
          Subject: LU-9714 llog: fix llog_process_thread race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 52b693c588555c55dd44fe3a27a1bf8c8cccac31

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27838/ Subject: LU-9714 llog: fix llog_process_thread race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 52b693c588555c55dd44fe3a27a1bf8c8cccac31

          Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29035
          Subject: LU-9714 test: checking llog_process_thread race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: cd7ad26a3208ea69e393a632b3496ed77d767d52

          gerrit Gerrit Updater added a comment - Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29035 Subject: LU-9714 test: checking llog_process_thread race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cd7ad26a3208ea69e393a632b3496ed77d767d52

          user mode test, so it was

          lfs changelog_clear

          aboyko Alexander Boyko added a comment - user mode test, so it was lfs changelog_clear

          what kind of modification is used in that test? You've mentioned only 'clearing' of changelog, is it done with llog_cancel or you are using llog_write() to wipe these records entirely?

          tappro Mikhail Pershin added a comment - what kind of modification is used in that test? You've mentioned only 'clearing' of changelog, is it done with llog_cancel or you are using llog_write() to wipe these records entirely?

          Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/27838
          Subject: LU-9714 llog: fix llog_process_thread race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8c8ffad1a2e01b371351bec82037958fe8b282b3

          gerrit Gerrit Updater added a comment - Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/27838 Subject: LU-9714 llog: fix llog_process_thread race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8c8ffad1a2e01b371351bec82037958fe8b282b3

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: