[LU-5075] Test failure on test suite sanity-lfsck, subtest test_8 Fail to start LFSCK for namespace! Created: 17/May/14 Updated: 25/Aug/14 Resolved: 25/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14009 | ||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ffe8ec08-dd4b-11e3-9f1e-52540035b04c. The sub-test test_8 failed with the following error:
Info required for matching: sanity-lfsck 8 |
| Comments |
| Comment by Andreas Dilger [ 20/May/14 ] |
|
Looks like MDT0001 was evicted by MDT0000 just before the LFSCK started, which caused it to fail. |
| Comment by nasf (Inactive) [ 31/Jul/14 ] |
|
The eviction is not the root reason for test_8 failure, because test_8 only start namespace LFSCK on the MDT0, so as long as MDT0 works well, the test should work. echo "stop $SINGLEMDS" stop $SINGLEMDS > /dev/null || error "(18) Fail to stop MDS!" #define OBD_FAIL_LFSCK_NO_AUTO 0x160b (1) ===> do_facet $SINGLEMDS $LCTL set_param fail_loc=0x160b echo "start $SINGLEMDS" (2) ===> start $SINGLEMDS $MDT_DEVNAME $MOUNT_OPTS_SCRUB > /dev/null || error "(19) Fail to start MDS!" (3)===> STATUS=$($SHOW_NAMESPACE | awk '/^status/ { print $2 }') [ "$STATUS" == "paused" ] || error "(20) Expect 'paused', but got '$STATUS'" #define OBD_FAIL_LFSCK_DELAY3 0x1602 (4)===> do_facet $SINGLEMDS $LCTL set_param fail_val=2 fail_loc=0x1602 (5)===> $START_NAMESPACE || error "(21) Fail to start LFSCK for namespace!" As shown above, before restart the MDS at point (2), the test sets the fail_loc=0x160b at point (1) to prevent the LFSCK to be restarted automatically. But the LFSCK restart will be called after the recovery. There is race condition that when the test moves to the point (3), the recovery is not completed yet, so the check at the point (3) is passed, and then the tests resets the fail_loc=0x1602 at the point (4), after such reset, the recovery completed, because the fail_loc has been reset, so the LFSCK can be restarted automatically successfully, so when the test moves to the point (5), it will find the LFSCK is in running already. I will make patch to fix the race in the test scripts. |
| Comment by nasf (Inactive) [ 31/Jul/14 ] |
|
Here is the patch: |
| Comment by Jian Yu [ 19/Aug/14 ] |
|
More instances on master branch: This is blocking patch to pass review testing. |
| Comment by nasf (Inactive) [ 25/Aug/14 ] |
|
The patch has been landed to master. |