[LU-5241] 2.4.3<->2.5.2 interop: sanity-lfsck test_0: FAIL: (9) Expect 'completed', but got 'scanning-phase1' Created: 21/Jun/14  Updated: 14/Jun/15  Resolved: 04/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2, Lustre 2.4.3
Fix Version/s: Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: mn4
Environment:

Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_4/73/ (2.4.3)
Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_5/74/ (2.5.2 RC2)


Issue Links:
Blocker
is blocked by LU-5248 Test failure on sanity-lfsck.sh, subt... Resolved
Severity: 3
Rank (Obsolete): 14614

 Description   

sanity-lfsck test 0 failed as follows:

Started LFSCK on the device lustre-MDT0000: namespace.
CMD: shadow-4vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags:
param:
time_since_last_completed: N/A
time_since_latest_start: 1 seconds
time_since_last_checkpoint: N/A
latest_start_position: 13, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 0
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 0
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 0
run_time_phase1: 0 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 0 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 0 items/sec
real-time_speed_phase2: N/A
current_position: 12, N/A, N/A
CMD: shadow-4vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
CMD: shadow-4vm4 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000
Stopped LFSCK on the device lustre-MDT0000.
CMD: shadow-4vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
CMD: shadow-4vm4 /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t namespace
Started LFSCK on the device lustre-MDT0000: namespace.
CMD: shadow-4vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
CMD: shadow-4vm4 /usr/sbin/lctl set_param fail_loc=0
fail_loc=0
CMD: shadow-4vm4 /usr/sbin/lctl set_param fail_val=0
fail_val=0
CMD: shadow-4vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
 sanity-lfsck test_0: @@@@@@ FAIL: (9) Expect 'completed', but got 'scanning-phase1' 

Maloo report: https://maloo.whamcloud.com/test_sets/8d110b74-f903-11e3-9283-52540035b04c



 Comments   
Comment by Jian Yu [ 21/Jun/14 ]

Hi Nasf,

Could you please take a look at the failure to see whether this is an issue on Lustre b2_5 side? Thanks.

Comment by nasf (Inactive) [ 22/Jun/14 ]

The failure is related with the following test scripts:

        do_facet $SINGLEMDS $LCTL set_param fail_loc=0
        do_facet $SINGLEMDS $LCTL set_param fail_val=0
        sleep 3
        STATUS=$($SHOW_NAMESPACE | awk '/^status/ { print $2 }')
        [ "$STATUS" == "completed" ] ||
                error "(9) Expect 'completed', but got '$STATUS'"

From the test log, I cannot find any abnormal cases to indicate potential Lustre bugs. Instead, I suspect that it is related with the "sleep 3". Because "sleep 3" is an average estimated time that the LFSCK can finish the scanning, but such estimation may be affected by kinds of facts, such as VM scheduler trouble. We have improved the test scripts in the master as following:

        do_facet $SINGLEMDS $LCTL set_param fail_loc=0 fail_val=0
        wait_update_facet $SINGLEMDS "$LCTL get_param -n \
                mdd.${MDT_DEV}.lfsck_namespace |
                awk '/^status/ { print \\\$2 }'" "completed" 6 || {
                $SHOW_NAMESPACE
                error "(9) unexpected status"
        }

So if possible, we should back-port the patch http://review.whamcloud.com/9704 to b2_5 and b2_4. Such patch improved the sanity-scrub/sanity-lfsck test scripts.

Comment by Peter Jones [ 22/Jun/14 ]

Thanks Fanyong. Emoly, could you please make the appropriate change to the test for b2_5 and b2_4?

Comment by Emoly Liu [ 27/Jun/14 ]

The backported patch to b2_5 is here: http://review.whamcloud.com/#/c/10818/
The backported patch to b2_4 is here: http://review.whamcloud.com/#/c/10892/

Comment by Emoly Liu [ 22/Aug/14 ]

This problem is being blocked by LU-5248. We should land that fix first.

Comment by Jian Yu [ 20/Sep/14 ]

The back-ported patch for Lustre b2_5 branch http://review.whamcloud.com/10818 was updated to depend on http://review.whamcloud.com/11006, which is the patch for LU-5248.

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10818/
Subject: LU-5241 tests: speed up sanity-lfsck and sanity-scrub tests
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 2ab8b98ea5dafbce59043e5d8477e794197116a0

Comment by Peter Jones [ 04/Dec/14 ]

Landed for 2.5.4

Generated at Sat Feb 10 01:49:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.