Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2266

recovery-small test 27 waits for wrong condition

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.1.4, Lustre 2.4.1
    • 3
    • 5420

    Description

      Long ago in patch for bug 23542 to make test 27 time-bound, an error was made that disables the test most of the time and potentially introduces unknown side effects for further tests:

      @@ -725,12 +725,8 @@ test_27() {
       #define OBD_FAIL_OSC_SHUTDOWN            0x407
              do_facet $SINGLEMDS lctl set_param fail_loc=0x80000407
              # need to wait for reconnect
      -       echo -n waiting for fail_loc
      -       while [ $(do_facet $SINGLEMDS lctl get_param -n fail_loc) -eq -214748261
      -           sleep 1
      -           echo -n .
      -       done
      -       do_facet $SINGLEMDS lctl get_param -n fail_loc
      +       echo waiting for fail_loc
      +       wait_update_facet $SINGLEMDS "lctl get_param -n fail_loc" "-2147482617"
      

      clearly the wait should be for 3221226503 which is 0xc0000407 (= 0x80000407 + 0x40000000(CFS_FAILED - when the test triggered).

      I found this after a bizarre failure of test 27 like this:

      14:53:22 (1351623202) network interface is UP
      Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      fail_loc=0x80000407
      waiting for fail_loc
      Waiting 90 secs for update
      Waiting 80 secs for update
      Waiting 70 secs for update
      Waiting 60 secs for update
      Waiting 50 secs for update
      Waiting 40 secs for update
      Waiting 30 secs for update
      Waiting 20 secs for update
      Waiting 10 secs for update
      Update not seen after 90s: wanted '-2147482617' got '3221226503'
      

      Attachments

        Issue Links

          Activity

            [LU-2266] recovery-small test 27 waits for wrong condition
            green Oleg Drokin added a comment -

            Ok, looking at the test and a bit of history of the bug (bz5949) I must admit I don't fully understand what's going on, but I know how to replicate what's needed

            green Oleg Drokin added a comment - Ok, looking at the test and a bit of history of the bug (bz5949) I must admit I don't fully understand what's going on, but I know how to replicate what's needed
            green Oleg Drokin added a comment -

            Hm, it seems the problem actually runs deeper here.

            Not only was the value wrong, but additionally there is a race between mds osc reconnect and setting fail_loc, which could lead to this patch never hitting anything at all.

            green Oleg Drokin added a comment - Hm, it seems the problem actually runs deeper here. Not only was the value wrong, but additionally there is a race between mds osc reconnect and setting fail_loc, which could lead to this patch never hitting anything at all.
            green Oleg Drokin added a comment - patch in http://review.whamcloud.com/#change,4451

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: