[LU-2266] recovery-small test 27 waits for wrong condition Created: 02/Nov/12 Updated: 31/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.4, Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5420 | ||||||||
| Description |
|
Long ago in patch for bug 23542 to make test 27 time-bound, an error was made that disables the test most of the time and potentially introduces unknown side effects for further tests: @@ -725,12 +725,8 @@ test_27() {
#define OBD_FAIL_OSC_SHUTDOWN 0x407
do_facet $SINGLEMDS lctl set_param fail_loc=0x80000407
# need to wait for reconnect
- echo -n waiting for fail_loc
- while [ $(do_facet $SINGLEMDS lctl get_param -n fail_loc) -eq -214748261
- sleep 1
- echo -n .
- done
- do_facet $SINGLEMDS lctl get_param -n fail_loc
+ echo waiting for fail_loc
+ wait_update_facet $SINGLEMDS "lctl get_param -n fail_loc" "-2147482617"
clearly the wait should be for 3221226503 which is 0xc0000407 (= 0x80000407 + 0x40000000(CFS_FAILED - when the test triggered). I found this after a bizarre failure of test 27 like this: 14:53:22 (1351623202) network interface is UP Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 fail_loc=0x80000407 waiting for fail_loc Waiting 90 secs for update Waiting 80 secs for update Waiting 70 secs for update Waiting 60 secs for update Waiting 50 secs for update Waiting 40 secs for update Waiting 30 secs for update Waiting 20 secs for update Waiting 10 secs for update Update not seen after 90s: wanted '-2147482617' got '3221226503' |
| Comments |
| Comment by Oleg Drokin [ 02/Nov/12 ] |
| Comment by Oleg Drokin [ 06/Nov/12 ] |
|
Hm, it seems the problem actually runs deeper here. Not only was the value wrong, but additionally there is a race between mds osc reconnect and setting fail_loc, which could lead to this patch never hitting anything at all. |
| Comment by Oleg Drokin [ 06/Nov/12 ] |
|
Ok, looking at the test and a bit of history of the bug (bz5949) I must admit I don't fully understand what's going on, but I know how to replicate what's needed |