Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12870

sanity-hsm test 9A fails with “uuid Dumping not found in agent list on mds1”

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_9A fails with multiple errors for Ubuntu client testing . Looking at results starting 01 APRIL 2019, this test fails 100% of the time for Ubuntu 16.04 and ~73% of the time for Ubuntu 18.04. For master, this test fails 100% of the time since April. For b2_12, it looks like something bad landed on or before 02 JULY 2019 with Lustre version 2.12.2.69 because we have 100% failure of this test from this date until today; failures start with test session https://testing.whamcloud.com/test_sets/86cbddd4-9e26-11e9-8fc1-52540065bddc.

      When sanity-hsm test 9A fails, we see about 50 or more other sanity-hsm tests fail and, in all cases, eventually a later test will time out; see https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc or https://testing.whamcloud.com/test_sets/a8bded48-5db6-11e9-92fe-52540065bddc .

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc, we see

      trevis-63vm10: trevis-63vm10.trevis.whamcloud.com: executing libtool execute ps -C lhsmtool_posix -o args=
      trevis-63vm10: rpc.sh: line 21: libtool: command not found
       sanity-hsm test_9A: @@@@@@ FAIL: Found no Agent or with no mount-point  parameter 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:859:get_agent_uuid()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1177:test_9A()
        = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578420.log;
               dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578420.log
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      CMD: trevis-63vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agents |		 grep Dumping
       sanity-hsm test_9A: @@@@@@ FAIL: uuid Dumping not found in agent list on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:819:check_agent_registered_by_mdt()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:839:check_agent_registered()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1178:test_9A()
        = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
      Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.*.1570578423.log
      CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578423.log;
               dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578423.log
      Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      done.
      CMD: trevis-63vm10 libtool execute pkill -x lhsmtool_posix
      trevis-63vm10: sh: libtool: command not found
      CMD: trevis-63vm10 rm -rf /tmp/arc1/sanity-hsm.test_9A/
      FAIL 9A (7s)
      

      In LU-12632, Hongchao looked at some of sanity-hsm errors and said:

      On LDiskFS
      the related HSM archive operations are not started, and it could be caused by the absence of "libtool"

      CMD: onyx-34vm7 libtool --mode=e pkill -x lhsmtool_posix
      onyx-34vm7: sh: libtool: command not found
      CMD: onyx-34vm7 rm -rf /tmp/arc1/sanity-hsm.test_90/
      it cause the previous copy tool can't be killed and affect the following copy tool.

      As Hongchao points out, we do see the 'libtool command missing' message in the logs for the failed test sessions.

      Attachments

        Issue Links

          Activity

            People

              mdiep Minh Diep
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: