[LU-2266] recovery-small test 27 waits for wrong condition Created: 02/Nov/12  Updated: 31/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4, Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Unresolved Votes: 0
Labels: patch

Issue Links:
Duplicate
duplicates LU-5965 recovery-small 27 looks works incorre... Resolved
Severity: 3
Rank (Obsolete): 5420

 Description   

Long ago in patch for bug 23542 to make test 27 time-bound, an error was made that disables the test most of the time and potentially introduces unknown side effects for further tests:

@@ -725,12 +725,8 @@ test_27() {
 #define OBD_FAIL_OSC_SHUTDOWN            0x407
        do_facet $SINGLEMDS lctl set_param fail_loc=0x80000407
        # need to wait for reconnect
-       echo -n waiting for fail_loc
-       while [ $(do_facet $SINGLEMDS lctl get_param -n fail_loc) -eq -214748261
-           sleep 1
-           echo -n .
-       done
-       do_facet $SINGLEMDS lctl get_param -n fail_loc
+       echo waiting for fail_loc
+       wait_update_facet $SINGLEMDS "lctl get_param -n fail_loc" "-2147482617"

clearly the wait should be for 3221226503 which is 0xc0000407 (= 0x80000407 + 0x40000000(CFS_FAILED - when the test triggered).

I found this after a bizarre failure of test 27 like this:

14:53:22 (1351623202) network interface is UP
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
fail_loc=0x80000407
waiting for fail_loc
Waiting 90 secs for update
Waiting 80 secs for update
Waiting 70 secs for update
Waiting 60 secs for update
Waiting 50 secs for update
Waiting 40 secs for update
Waiting 30 secs for update
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 90s: wanted '-2147482617' got '3221226503'


 Comments   
Comment by Oleg Drokin [ 02/Nov/12 ]

patch in http://review.whamcloud.com/#change,4451

Comment by Oleg Drokin [ 06/Nov/12 ]

Hm, it seems the problem actually runs deeper here.

Not only was the value wrong, but additionally there is a race between mds osc reconnect and setting fail_loc, which could lead to this patch never hitting anything at all.

Comment by Oleg Drokin [ 06/Nov/12 ]

Ok, looking at the test and a bit of history of the bug (bz5949) I must admit I don't fully understand what's going on, but I know how to replicate what's needed

Generated at Sat Feb 10 01:23:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.