Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12870

sanity-hsm test 9A fails with “uuid Dumping not found in agent list on mds1”

Details

    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_9A fails with multiple errors for Ubuntu client testing . Looking at results starting 01 APRIL 2019, this test fails 100% of the time for Ubuntu 16.04 and ~73% of the time for Ubuntu 18.04. For master, this test fails 100% of the time since April. For b2_12, it looks like something bad landed on or before 02 JULY 2019 with Lustre version 2.12.2.69 because we have 100% failure of this test from this date until today; failures start with test session https://testing.whamcloud.com/test_sets/86cbddd4-9e26-11e9-8fc1-52540065bddc.

      When sanity-hsm test 9A fails, we see about 50 or more other sanity-hsm tests fail and, in all cases, eventually a later test will time out; see https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc or https://testing.whamcloud.com/test_sets/a8bded48-5db6-11e9-92fe-52540065bddc .

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc, we see

      trevis-63vm10: trevis-63vm10.trevis.whamcloud.com: executing libtool execute ps -C lhsmtool_posix -o args=
      trevis-63vm10: rpc.sh: line 21: libtool: command not found
       sanity-hsm test_9A: @@@@@@ FAIL: Found no Agent or with no mount-point  parameter 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:859:get_agent_uuid()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1177:test_9A()
        = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578420.log;
               dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578420.log
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      CMD: trevis-63vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agents |		 grep Dumping
       sanity-hsm test_9A: @@@@@@ FAIL: uuid Dumping not found in agent list on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:819:check_agent_registered_by_mdt()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:839:check_agent_registered()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1178:test_9A()
        = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
      Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.*.1570578423.log
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578423.log;
               dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578423.log
      Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      done.
      CMD: trevis-63vm10 libtool execute pkill -x lhsmtool_posix
      trevis-63vm10: sh: libtool: command not found
      CMD: trevis-63vm10 rm -rf /tmp/arc1/sanity-hsm.test_9A/
      FAIL 9A (7s)
      

      In LU-12632, Hongchao looked at some of sanity-hsm errors and said:

      On LDiskFS
      the related HSM archive operations are not started, and it could be caused by the absence of "libtool"

      CMD: onyx-34vm7 libtool --mode=e pkill -x lhsmtool_posix
      onyx-34vm7: sh: libtool: command not found
      CMD: onyx-34vm7 rm -rf /tmp/arc1/sanity-hsm.test_90/
      it cause the previous copy tool can't be killed and affect the following copy tool.

      As Hongchao points out, we do see the 'libtool command missing' message in the logs for the failed test sessions.

      Attachments

        Issue Links

          Activity

            [LU-12870] sanity-hsm test 9A fails with “uuid Dumping not found in agent list on mds1”

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38822/
            Subject: LU-12870 build: sanity-hsm test depends on libtool
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: ddbfc253af323f41c9fd2301d4a0167b23252ad6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38822/ Subject: LU-12870 build: sanity-hsm test depends on libtool Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: ddbfc253af323f41c9fd2301d4a0167b23252ad6
            jhammond John Hammond added a comment -

            > Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes? To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages?

            adilger this is not really needed. We just need to remove the libtool uses altogether. See LU-14034.

            jhammond John Hammond added a comment - > Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes? To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages? adilger this is not really needed. We just need to remove the libtool uses altogether. See LU-14034 .
            pjones Peter Jones added a comment -

            Seems like this fix was landed

            pjones Peter Jones added a comment - Seems like this fix was landed

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38822
            Subject: LU-12870 build: sanity-hsm test depends on libtool
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7a15115683aea1a59f77b1e21a30b2ab78cfd085

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38822 Subject: LU-12870 build: sanity-hsm test depends on libtool Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7a15115683aea1a59f77b1e21a30b2ab78cfd085

            We're seeing sanity-hsm test 9A fail with the same errors for 2.12.4; https://testing.whamcloud.com/test_sets/4dc3aaa6-3414-11ea-b1e8-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - We're seeing sanity-hsm test 9A fail with the same errors for 2.12.4; https://testing.whamcloud.com/test_sets/4dc3aaa6-3414-11ea-b1e8-52540065bddc
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36471/
            Subject: LU-12870 build: sanity-hsm test depends on libtool
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dbce727a3633ce03d24c28defce9a0ed6d1ef106

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36471/ Subject: LU-12870 build: sanity-hsm test depends on libtool Project: fs/lustre-release Branch: master Current Patch Set: Commit: dbce727a3633ce03d24c28defce9a0ed6d1ef106

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36471
            Subject: LU-12870 build: sanity-hsm test depends on libtool
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: da105a8c4581220aaa4792f9ff518c734e74c767

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36471 Subject: LU-12870 build: sanity-hsm test depends on libtool Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: da105a8c4581220aaa4792f9ff518c734e74c767
            mdiep Minh Diep added a comment -

            we need libtool-bin

            mdiep Minh Diep added a comment - we need libtool-bin

            Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes?  To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages?

             I couldn't see an obvious patch that landed on April 1st that might have caused this, but it should be relatively straight forward to bisect the landings on that day to find the cause of it is failing 100%.

            adilger Andreas Dilger added a comment - Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes?  To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages?  I couldn't see an obvious patch that landed on April 1st that might have caused this, but it should be relatively straight forward to bisect the landings on that day to find the cause of it is failing 100%.

            People

              mdiep Minh Diep
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: