[LU-10522] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 4 Created: 16/Jan/18  Updated: 03/Dec/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Failover
Client/Server: 2.10.3 RC1
b2_10, build 68


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-random-scale test_fail_client_mds - test_fail_client_mds returned 4
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/b02af984-f65d-11e7-94c7-52540065bddc

test_fail_client_mds failed with the following error:

test_fail_client_mds returned 4

Test logs:

==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=3962 DURATION=86400 PERIOD=1200
10:34:00 (1515580440) waiting for onyx-41vm3 network 5 secs ...
10:34:00 (1515580440) network interface is UP
CMD: onyx-41vm3 rc=0;
			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
				echo \$(hostname -s): \$val;
				rc=\$val;
			fi;
			exit \$rc
CMD: onyx-41vm3 ps auxwww | grep -v grep | grep -q run_dd.sh
Client load failed on node onyx-41vm3, rc=1
2018-01-10 10:34:31 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           3962 seconds
Number of failovers before exit:
mds1 failed over 4 times
Status: FAIL: rc=4
CMD: onyx-41vm3,onyx-41vm4 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
onyx-41vm3: sh: line 1: kill: (8054) - No such process

run_tar_debug.onyx-41vm4.log

tar: etc/ssl: Cannot stat: No such file or directory
tar: etc/systemd/system/getty.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/sockets.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/multi-user.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/sysinit.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/dev-virtio\\x2dports-org.qemu.guest_agent.0.device.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/remote-fs.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/basic.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system/default.target.wants: Cannot stat: No such file or directory
tar: etc/systemd/system: Cannot stat: No such file or directory
tar: etc/systemd: Cannot stat: No such file or directory
tar: etc/rc.d/rc1.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc3.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc2.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc4.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc0.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc5.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc6.d: Cannot stat: No such file or directory
tar: etc/rc.d: Cannot stat: No such file or directory
tar: etc/alternatives: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors


 Comments   
Comment by Alena Nikitenko [ 03/Dec/21 ]

Found a similar issue with recovery-random-scale test set on 2.12.8: https://testing.whamcloud.com/test_sets/e735d4c7-0211-48dc-82e6-a6ba45ceb281 

But return code is different due to it being a different test:

...
Starting client: onyx-112vm10:  -o user_xattr,flock onyx-70vm3:onyx-70vm4:/lustre /mnt/lustre
CMD: onyx-112vm10 mkdir -p /mnt/lustre
CMD: onyx-112vm10 mount -t lustre -o user_xattr,flock onyx-70vm3:onyx-70vm4:/lustre /mnt/lustre
onyx-112vm10: mount.lustre: according to /etc/mtab onyx-70vm3:onyx-70vm4:/lustre is already mounted on /mnt/lustre
2021-11-20 21:05:50 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           65095 seconds
Number of failovers before exit:
mds1 failed over 55 times
Status: FAIL: rc=1
CMD: onyx-112vm10,onyx-112vm9 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; } 
...
tar: etc/pki/tls/certs: Cannot stat: No such file or directory
tar: etc/pki/tls: Cannot stat: No such file or directory
tar: etc/pki/java: Cannot stat: No such file or directory
tar: etc/pki/ca-trust/source: Cannot stat: No such file or directory
tar: etc/pki/ca-trust: Cannot stat: No such file or directory
tar: etc/pki: Cannot stat: No such file or directory
tar: etc/ssl: Cannot stat: No such file or directory
tar: etc/pam.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc0.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc6.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc1.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc4.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc5.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc3.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc2.d: Cannot stat: No such file or directory
tar: etc/rc.d: Cannot stat: No such file or directory
tar: etc/sysconfig/network-scripts: Cannot stat: No such file or directory
tar: etc/sysconfig: Cannot stat: No such file or directory
tar: etc/profile.d: Cannot stat: No such file or directory
tar: etc/sysctl.d: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors 
Generated at Sat Feb 10 02:35:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.