Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4065

sanity-hsm test_300 failure: 'cdt state is not stopped'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.0
    • Luster master build # 1715
      OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients
    • 3
    • 10892

    Description

      The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c

      From the client test_log:

      == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367)
      Stop coordinator and remove coordinator state at mount
      mdt.scratch-MDT0000.hsm_control=shutdown
      Changed after 0s: from '' to 'stopping'
      Waiting 10 secs for update
      Updated after 8s: wanted 'stopped' got 'stopped'
      Failing mds1 on mds
      Stopping /lustre/scratch/mdt0 (opts:) on mds
      pdsh@c15: mds: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to mds
      14:23:15 (1380835395) waiting for mds network 900 secs ...
      14:23:15 (1380835395) network interface is UP
      mount facets: mds1
      Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
      Started scratch-MDT0000
      c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec
      Changed after 0s: from '' to 'enabled'
      Waiting 20 secs for update
      Waiting 10 secs for update
      Update not seen after 20s: wanted 'stopped' got 'enabled'
       sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()
      

      Attachments

        Activity

          [LU-4065] sanity-hsm test_300 failure: 'cdt state is not stopped'
          mdiep Minh Diep made changes -
          Link New: This issue is related to LDEV-243 [ LDEV-243 ]
          mdiep Minh Diep made changes -
          Link Original: This issue is related to LDEV-342 [ LDEV-342 ]
          mdiep Minh Diep made changes -
          Link New: This issue is related to LDEV-342 [ LDEV-342 ]
          mdiep Minh Diep made changes -
          Link New: This issue is related to JFC-17 [ JFC-17 ]
          bfaccini Bruno Faccini (Inactive) made changes -
          Link New: This issue is related to SGI-288 [ SGI-288 ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Fix Version/s New: Lustre 2.8.0 [ 11113 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

          Landed for 2.8.0

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/
          Subject: LU-4065 tests: hsm copytool_cleanup improvement
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/ Subject: LU-4065 tests: hsm copytool_cleanup improvement Project: fs/lustre-release Branch: master Current Patch Set: Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          scherementsev Sergey Cheremencev added a comment - There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...
          In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.

          About why test_300 may fail(my view):

          1. cdt_set_mount_state sets param using -P
          2. "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state)
          3. At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.

          Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

          yes

          About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          scherementsev Sergey Cheremencev added a comment - Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ... In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed. About why test_300 may fail(my view): cdt_set_mount_state sets param using -P "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state) At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? yes About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          People

            bfaccini Bruno Faccini (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: