Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6146

LFSCK fall into wait for ever because of race condition when check/set cfs_fail_val

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • 3
    • 17159

    Description

      There is race condition in LFSCK when inject failure stub for test. For example:

       764                 if (OBD_FAIL_CHECK(OBD_FAIL_LFSCK_DELAY2) &&
       765                     cfs_fail_val > 0) {
       766                         struct l_wait_info lwi;
       767 
       768                         lwi = LWI_TIMEOUT(cfs_time_seconds(cfs_fail_val),
       769                                           NULL, NULL);
       770                         l_wait_event(thread->t_ctl_waitq,
       771                                      !thread_is_running(thread),
       772                                      &lwi);
       773 
       774                         if (unlikely(!thread_is_running(thread))) {
       775                                 CDEBUG(D_LFSCK, "%s: scan dir exit for engine "
       776                                        "stop, parent "DFID", cookie "LPX64"\n",
       777                                        lfsck_lfsck2name(lfsck),
       778                                        PFID(lfsck_dto2fid(dir)),
       779                                        lfsck->li_cookie_dir);
       780                                 RETURN(0);
       781                         }
       782                 }
      

      The "cfs_fail_val" may be changed by others after the check at line 765 but before using it at line 768. Then the LFSCK engine will fall into "wait" until someone run "lfsck_stop".

      Attachments

        Activity

          [LU-6146] LFSCK fall into wait for ever because of race condition when check/set cfs_fail_val
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13481/
          Subject: LU-6146 tests: race condition for check/use cfs_fail_val
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f6ef1b797f2f6b28e7c5860b6cf16759cadfc9a4

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13481/ Subject: LU-6146 tests: race condition for check/use cfs_fail_val Project: fs/lustre-release Branch: master Current Patch Set: Commit: f6ef1b797f2f6b28e7c5860b6cf16759cadfc9a4

          Recently, there are many failure instances for sanity-lfsck test_4, part of them are because of LU-6109/LU-6147, the others are because of this ticket.

          yong.fan nasf (Inactive) added a comment - Recently, there are many failure instances for sanity-lfsck test_4, part of them are because of LU-6109 / LU-6147 , the others are because of this ticket.

          I don't see any tests in Maloo that have been marked with this bug. I do see a late number of test failures due to LU-5121 that are not related to interop testing. Is this bug related to that?

          adilger Andreas Dilger added a comment - I don't see any tests in Maloo that have been marked with this bug. I do see a late number of test failures due to LU-5121 that are not related to interop testing. Is this bug related to that?

          This issue may cause many sanity-lfsck test failures, so we have to resolve it before Lustre-2.7 released.

          yong.fan nasf (Inactive) added a comment - This issue may cause many sanity-lfsck test failures, so we have to resolve it before Lustre-2.7 released.

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13481
          Subject: LU-6146 tests: race condition for check/use cfs_fail_val
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 868327e31d2d3c5b4aaaeb5a18aad34cc5e27b79

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13481 Subject: LU-6146 tests: race condition for check/use cfs_fail_val Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 868327e31d2d3c5b4aaaeb5a18aad34cc5e27b79

          People

            yong.fan nasf (Inactive)
            yong.fan nasf (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: