Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4
-
Ubuntu
-
3
-
9223372036854775807
Description
sanity-hsm test_9A fails with multiple errors for Ubuntu client testing . Looking at results starting 01 APRIL 2019, this test fails 100% of the time for Ubuntu 16.04 and ~73% of the time for Ubuntu 18.04. For master, this test fails 100% of the time since April. For b2_12, it looks like something bad landed on or before 02 JULY 2019 with Lustre version 2.12.2.69 because we have 100% failure of this test from this date until today; failures start with test session https://testing.whamcloud.com/test_sets/86cbddd4-9e26-11e9-8fc1-52540065bddc.
When sanity-hsm test 9A fails, we see about 50 or more other sanity-hsm tests fail and, in all cases, eventually a later test will time out; see https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc or https://testing.whamcloud.com/test_sets/a8bded48-5db6-11e9-92fe-52540065bddc .
Looking at the suite_log for https://testing.whamcloud.com/test_sets/c3676bb2-eb26-11e9-b62b-52540065bddc, we see
trevis-63vm10: trevis-63vm10.trevis.whamcloud.com: executing libtool execute ps -C lhsmtool_posix -o args= trevis-63vm10: rpc.sh: line 21: libtool: command not found sanity-hsm test_9A: @@@@@@ FAIL: Found no Agent or with no mount-point parameter Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:859:get_agent_uuid() = /usr/lib64/lustre/tests/sanity-hsm.sh:1177:test_9A() = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main() CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578420.log; dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578420.log CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null CMD: trevis-63vm12 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agents | grep Dumping sanity-hsm test_9A: @@@@@@ FAIL: uuid Dumping not found in agent list on mds1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5864:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:819:check_agent_registered_by_mdt() = /usr/lib64/lustre/tests/sanity-hsm.sh:839:check_agent_registered() = /usr/lib64/lustre/tests/sanity-hsm.sh:1178:test_9A() = /usr/lib64/lustre/tests/test-framework.sh:6166:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6205:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6051:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1188:main() Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.*.1570578423.log CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.debug_log.\$(hostname -s).1570578423.log; dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-ubuntu1804-x86_64--full--1_9__52___8b37fabf-7e63-43bb-bb1c-9ed34b31d532/sanity-hsm.test_9A.dmesg.\$(hostname -s).1570578423.log Resetting fail_loc on all nodes...CMD: trevis-63vm10,trevis-63vm11,trevis-63vm12,trevis-63vm9.trevis.whamcloud.com lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. CMD: trevis-63vm10 libtool execute pkill -x lhsmtool_posix trevis-63vm10: sh: libtool: command not found CMD: trevis-63vm10 rm -rf /tmp/arc1/sanity-hsm.test_9A/ FAIL 9A (7s)
In LU-12632, Hongchao looked at some of sanity-hsm errors and said:
On LDiskFS
the related HSM archive operations are not started, and it could be caused by the absence of "libtool"CMD: onyx-34vm7 libtool --mode=e pkill -x lhsmtool_posix
onyx-34vm7: sh: libtool: command not found
CMD: onyx-34vm7 rm -rf /tmp/arc1/sanity-hsm.test_90/
it cause the previous copy tool can't be killed and affect the following copy tool.
As Hongchao points out, we do see the 'libtool command missing' message in the logs for the failed test sessions.