[LU-7767] Failover: replay-dual test_3: test_3 returned 1 Created: 09/Feb/16  Updated: 27/Feb/17  Resolved: 27/Feb/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

EL6.7 Server/Client - ZFS
master, build# 3314


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/7d01907a-cb55-11e5-b49e-5254006e85c2.

The sub-test test_3 failed with the following error:

test_3 returned 1

test log:

== replay-dual test 3: |X| mkdir adir, mkdir adir/bdir == 07:34:39 (1454571279)
CMD: shadow-45vm7 sync; sync; sync
Filesystem           1K-blocks  Used Available Use% Mounted on
shadow-45vm7:shadow-45vm3:/lustre
                      14220416 16128  14189952   1% /mnt/lustre
CMD: shadow-45vm1.shadow.whamcloud.com,shadow-45vm5,shadow-45vm6 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: shadow-45vm1.shadow.whamcloud.com,shadow-45vm5,shadow-45vm6 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: shadow-45vm7 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: shadow-45vm7 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: shadow-45vm7 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
CMD: shadow-45vm7 /usr/sbin/lctl dl
Failing mds1 on shadow-45vm7
+ pm -h powerman --off shadow-45vm7
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on shadow-45vm7
Command completed successfully
Failover mds1 to shadow-45vm3
07:34:56 (1454571296) waiting for shadow-45vm3 network 900 secs ...
07:34:56 (1454571296) network interface is UP
CMD: shadow-45vm3 hostname
pdsh@shadow-45vm1: shadow-45vm3: mcmd: connect failed: Connection refused
CMD: shadow-45vm3 hostname
mount facets: mds1
CMD: shadow-45vm3 zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
Starting mds1:   lustre-mdt1/mdt1 /mnt/mds1
CMD: shadow-45vm3 mkdir -p /mnt/mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/mds1
CMD: shadow-45vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4 
CMD: shadow-45vm3 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-45vm3 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-45vm3 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
Started lustre-MDT0000
CMD: shadow-45vm1.shadow.whamcloud.com,shadow-45vm5,shadow-45vm6 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
shadow-45vm1: CMD: shadow-45vm1.shadow.whamcloud.com lctl get_param -n at_max
shadow-45vm5: CMD: shadow-45vm5.shadow.whamcloud.com lctl get_param -n at_max
shadow-45vm6: CMD: shadow-45vm6.shadow.whamcloud.com lctl get_param -n at_max
shadow-45vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 17 sec
shadow-45vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 17 sec
shadow-45vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 17 sec
Resetting fail_loc on all nodes...CMD: shadow-45vm1.shadow.whamcloud.com,shadow-45vm3,shadow-45vm5,shadow-45vm6,shadow-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
pdsh@shadow-45vm1: shadow-45vm8: mcmd: xpoll (setting up stderr): Interrupted system call


 Comments   
Comment by James Nunez (Inactive) [ 11/Feb/16 ]

I'm seeing similar 'mcmd: xpoll' errors in sanity-scrub tests 1c and 11

pdsh@shadow-45vm1: shadow-45vm8: mcmd: xpoll (setting up stderr): Interrupted system call
 sanity-scrub test_1c: @@@@@@ FAIL: server shadow-45vm8 environments are insane! 

and in sanity-lfsck test_19a

cat: /mnt/lustre/d19a.sanity-lfsck/a0: Input/output error
fail_loc=0
Resetting fail_loc on all nodes...CMD: shadow-45vm1.shadow.whamcloud.com,shadow-45vm2,shadow-45vm3,shadow-45vm7,shadow-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
pdsh@shadow-45vm1: shadow-45vm8: mcmd: xpoll (setting up stderr): Interrupted system call

Logs are at
https://testing.hpdd.intel.com/test_sessions/ddd50a9c-d002-11e5-be99-5254006e85c2

Comment by Andreas Dilger [ 27/Feb/17 ]

This looks like it was an environment problem.

Generated at Sat Feb 10 02:11:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.