Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4065

sanity-hsm test_300 failure: 'cdt state is not stopped'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.0
    • Luster master build # 1715
      OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients
    • 3
    • 10892

    Description

      The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c

      From the client test_log:

      == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367)
      Stop coordinator and remove coordinator state at mount
      mdt.scratch-MDT0000.hsm_control=shutdown
      Changed after 0s: from '' to 'stopping'
      Waiting 10 secs for update
      Updated after 8s: wanted 'stopped' got 'stopped'
      Failing mds1 on mds
      Stopping /lustre/scratch/mdt0 (opts:) on mds
      pdsh@c15: mds: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to mds
      14:23:15 (1380835395) waiting for mds network 900 secs ...
      14:23:15 (1380835395) network interface is UP
      mount facets: mds1
      Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
      Started scratch-MDT0000
      c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec
      Changed after 0s: from '' to 'enabled'
      Waiting 20 secs for update
      Waiting 10 secs for update
      Update not seen after 20s: wanted 'stopped' got 'enabled'
       sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()
      

      Attachments

        Activity

          [LU-4065] sanity-hsm test_300 failure: 'cdt state is not stopped'

          Landed for 2.8.0

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/
          Subject: LU-4065 tests: hsm copytool_cleanup improvement
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/ Subject: LU-4065 tests: hsm copytool_cleanup improvement Project: fs/lustre-release Branch: master Current Patch Set: Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          scherementsev Sergey Cheremencev added a comment - There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...
          In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.

          About why test_300 may fail(my view):

          1. cdt_set_mount_state sets param using -P
          2. "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state)
          3. At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.

          Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

          yes

          About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          scherementsev Sergey Cheremencev added a comment - Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ... In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed. About why test_300 may fail(my view): cdt_set_mount_state sets param using -P "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state) At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? yes About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description.

          Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

          And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues.

          Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.

          bfaccini Bruno Faccini (Inactive) added a comment - Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues. Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.

          Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/.

          Short explanation:

          1. "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P.
          2. copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown").
          3. At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.
          4. Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread).
            copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.
          scherementsev Sergey Cheremencev added a comment - Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/ . Short explanation: "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P. copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown"). At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1. Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread). copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.

          James, according to this auto-test failure report it is unclear for me how this happened ...
          BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now :

          Name            Status          Run at                  Duration        Return code     Error                                           Test log        Bugs
          test_300        FAIL    2014-09-05 01:47:58 UTC         31              1               Restart of mds1 failed!                         Preview 10
          test_300        FAIL    2014-09-02 13:14:59 UTC         80              1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-08-27 01:19:32 UTC         107             1               post-failover df: 1                             Preview 10
          test_300        FAIL    2014-08-21 19:55:39 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-07-01 22:43:31 UTC         599             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-07-01 10:47:59 UTC         107             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-06-21 14:32:23 UTC         723             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-06-02 17:16:48 UTC         331             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-05-28 05:46:52 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-27 09:56:15 UTC         975             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-22 19:10:52 UTC         105             1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065
          test_300        FAIL    2014-05-08 18:55:20 UTC         716             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-07 22:34:02 UTC         746             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-03-26 22:02:07 UTC         77              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4125, LU-4065
          test_300        FAIL    2014-03-11 20:01:49 UTC         79              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065, LU-4065
          test_300        FAIL    2013-12-08 19:39:12 UTC         755             1               import is not in FULL state                                     LU-4018, LU-4361
          
          bfaccini Bruno Faccini (Inactive) added a comment - James, according to this auto-test failure report it is unclear for me how this happened ... BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now : Name Status Run at Duration Return code Error Test log Bugs test_300 FAIL 2014-09-05 01:47:58 UTC 31 1 Restart of mds1 failed! Preview 10 test_300 FAIL 2014-09-02 13:14:59 UTC 80 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-08-27 01:19:32 UTC 107 1 post-failover df: 1 Preview 10 test_300 FAIL 2014-08-21 19:55:39 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-07-01 22:43:31 UTC 599 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-07-01 10:47:59 UTC 107 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-06-21 14:32:23 UTC 723 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-06-02 17:16:48 UTC 331 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-05-28 05:46:52 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-27 09:56:15 UTC 975 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-22 19:10:52 UTC 105 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065 test_300 FAIL 2014-05-08 18:55:20 UTC 716 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-07 22:34:02 UTC 746 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-03-26 22:02:07 UTC 77 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4125, LU-4065 test_300 FAIL 2014-03-11 20:01:49 UTC 79 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065, LU-4065 test_300 FAIL 2013-12-08 19:39:12 UTC 755 1 import is not in FULL state LU-4018, LU-4361

          People

            bfaccini Bruno Faccini (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: