Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6146

LFSCK fall into wait for ever because of race condition when check/set cfs_fail_val

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • 3
    • 17159

    Description

      There is race condition in LFSCK when inject failure stub for test. For example:

       764                 if (OBD_FAIL_CHECK(OBD_FAIL_LFSCK_DELAY2) &&
       765                     cfs_fail_val > 0) {
       766                         struct l_wait_info lwi;
       767 
       768                         lwi = LWI_TIMEOUT(cfs_time_seconds(cfs_fail_val),
       769                                           NULL, NULL);
       770                         l_wait_event(thread->t_ctl_waitq,
       771                                      !thread_is_running(thread),
       772                                      &lwi);
       773 
       774                         if (unlikely(!thread_is_running(thread))) {
       775                                 CDEBUG(D_LFSCK, "%s: scan dir exit for engine "
       776                                        "stop, parent "DFID", cookie "LPX64"\n",
       777                                        lfsck_lfsck2name(lfsck),
       778                                        PFID(lfsck_dto2fid(dir)),
       779                                        lfsck->li_cookie_dir);
       780                                 RETURN(0);
       781                         }
       782                 }
      

      The "cfs_fail_val" may be changed by others after the check at line 765 but before using it at line 768. Then the LFSCK engine will fall into "wait" until someone run "lfsck_stop".

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            yong.fan nasf (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: