[LU-6146] LFSCK fall into wait for ever because of race condition when check/set cfs_fail_val Created: 21/Jan/15  Updated: 25/Jan/15  Resolved: 25/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Severity: 3
Rank (Obsolete): 17159

 Description   

There is race condition in LFSCK when inject failure stub for test. For example:

 764                 if (OBD_FAIL_CHECK(OBD_FAIL_LFSCK_DELAY2) &&
 765                     cfs_fail_val > 0) {
 766                         struct l_wait_info lwi;
 767 
 768                         lwi = LWI_TIMEOUT(cfs_time_seconds(cfs_fail_val),
 769                                           NULL, NULL);
 770                         l_wait_event(thread->t_ctl_waitq,
 771                                      !thread_is_running(thread),
 772                                      &lwi);
 773 
 774                         if (unlikely(!thread_is_running(thread))) {
 775                                 CDEBUG(D_LFSCK, "%s: scan dir exit for engine "
 776                                        "stop, parent "DFID", cookie "LPX64"\n",
 777                                        lfsck_lfsck2name(lfsck),
 778                                        PFID(lfsck_dto2fid(dir)),
 779                                        lfsck->li_cookie_dir);
 780                                 RETURN(0);
 781                         }
 782                 }

The "cfs_fail_val" may be changed by others after the check at line 765 but before using it at line 768. Then the LFSCK engine will fall into "wait" until someone run "lfsck_stop".



 Comments   
Comment by Gerrit Updater [ 21/Jan/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13481
Subject: LU-6146 tests: race condition for check/use cfs_fail_val
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 868327e31d2d3c5b4aaaeb5a18aad34cc5e27b79

Comment by nasf (Inactive) [ 21/Jan/15 ]

This issue may cause many sanity-lfsck test failures, so we have to resolve it before Lustre-2.7 released.

Comment by Andreas Dilger [ 21/Jan/15 ]

I don't see any tests in Maloo that have been marked with this bug. I do see a late number of test failures due to LU-5121 that are not related to interop testing. Is this bug related to that?

Comment by nasf (Inactive) [ 23/Jan/15 ]

Recently, there are many failure instances for sanity-lfsck test_4, part of them are because of LU-6109/LU-6147, the others are because of this ticket.

Comment by Gerrit Updater [ 25/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13481/
Subject: LU-6146 tests: race condition for check/use cfs_fail_val
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f6ef1b797f2f6b28e7c5860b6cf16759cadfc9a4

Comment by Peter Jones [ 25/Jan/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:57:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.