[LU-12870] sanity-hsm test 9A fails with “uuid Dumping not found in agent list on mds1” Created: 16/Oct/19  Updated: 22/Oct/20  Resolved: 24/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Minh Diep
Resolution: Fixed Votes: 0
Labels: ubuntu, ubuntu16, ubuntu18
Environment:

Ubuntu


Issue Links:
Related
is related to LU-12632 sanity-hsm test_90: FAIL: requests di... Resolved
is related to LU-14034 test-framework and sanity-hsm use lib... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-hsm test_9A fails with multiple errors for Ubuntu client testing . Looking at results starting 01 APRIL 2019, this test fails 100% of the time for Ubuntu 16.04 and ~73% of the time for Ubuntu 18.04. For master, this test fails 100% of the time since April. For b2_12, it looks like something bad landed on or before 02 JULY 2019 with Lustre version 2.12.2.69 because we have 100% failure of this test from this date until today; failures start with test session https://testing.whamcloud.com/test_sets/86cbddd4-9e26-11e9-8fc1-52540065bddc.

When sanity-hsm test 9A fails, we see about 50 or more other sanity-hsm tests fail and, in all cases, eventually a later test will time out; see https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc or https://testing.whamcloud.com/test_sets/a8bded48-5db6-11e9-92fe-52540065bddc .

Looking at the suite_log for https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc, we see

trevis-63vm10: trevis-63vm10.trevis.whamcloud.com: executing libtool execute ps -C lhsmtool_posix -o args=
trevis-63vm10: rpc.sh: line 21: libtool: command not found
 sanity-hsm test_9A: @@@@@@ FAIL: Found no Agent or with no mount-point  parameter 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:859:get_agent_uuid()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1177:test_9A()
  = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578420.log;
         dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578420.log
CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
CMD: trevis-63vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agents |		 grep Dumping
 sanity-hsm test_9A: @@@@@@ FAIL: uuid Dumping not found in agent list on mds1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:819:check_agent_registered_by_mdt()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:839:check_agent_registered()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1178:test_9A()
  = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main()
Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.*.1570578423.log
CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578423.log;
         dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578423.log
Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
CMD: trevis-63vm10 libtool execute pkill -x lhsmtool_posix
trevis-63vm10: sh: libtool: command not found
CMD: trevis-63vm10 rm -rf /tmp/arc1/sanity-hsm.test_9A/
FAIL 9A (7s)

In LU-12632, Hongchao looked at some of sanity-hsm errors and said:

On LDiskFS
the related HSM archive operations are not started, and it could be caused by the absence of "libtool"

CMD: onyx-34vm7 libtool --mode=e pkill -x lhsmtool_posix
onyx-34vm7: sh: libtool: command not found
CMD: onyx-34vm7 rm -rf /tmp/arc1/sanity-hsm.test_90/
it cause the previous copy tool can't be killed and affect the following copy tool.

As Hongchao points out, we do see the 'libtool command missing' message in the logs for the failed test sessions.



 Comments   
Comment by Andreas Dilger [ 16/Oct/19 ]

Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes?  To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages?

 I couldn't see an obvious patch that landed on April 1st that might have caused this, but it should be relatively straight forward to bisect the landings on that day to find the cause of it is failing 100%.

Comment by Minh Diep [ 16/Oct/19 ]

we need libtool-bin

Comment by Gerrit Updater [ 17/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36471
Subject: LU-12870 build: sanity-hsm test depends on libtool
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: da105a8c4581220aaa4792f9ff518c734e74c767

Comment by Gerrit Updater [ 12/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36471/
Subject: LU-12870 build: sanity-hsm test depends on libtool
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dbce727a3633ce03d24c28defce9a0ed6d1ef106

Comment by Peter Jones [ 12/Nov/19 ]

Landed for 2.14

Comment by James Nunez (Inactive) [ 16/Jan/20 ]

We're seeing sanity-hsm test 9A fail with the same errors for 2.12.4; https://testing.whamcloud.com/test_sets/4dc3aaa6-3414-11ea-b1e8-52540065bddc

Comment by Gerrit Updater [ 03/Jun/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38822
Subject: LU-12870 build: sanity-hsm test depends on libtool
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7a15115683aea1a59f77b1e21a30b2ab78cfd085

Comment by Peter Jones [ 24/Sep/20 ]

Seems like this fix was landed

Comment by John Hammond [ 14/Oct/20 ]

> Is this a matter of adding libtool as a Requires (or Debian equivalent) to the package, and installing this on the test nodes? To be honest, I'm not thrilled about the requirement for this, maybe it is only for the test packages?

adilger this is not really needed. We just need to remove the libtool uses altogether. See LU-14034.

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38822/
Subject: LU-12870 build: sanity-hsm test depends on libtool
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: ddbfc253af323f41c9fd2301d4a0167b23252ad6

Generated at Sat Feb 10 02:56:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.