[LU-5619] Hard Failover replay-dual test_0b: mount MDS failed Created: 12/Sep/14  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: zfs
Environment:

server and client: lustre-master build #2642


Severity: 3
Rank (Obsolete): 15719

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/718edf3e-37c4-11e4-a2a6-5254006e85c2.

The sub-test test_0b failed with the following error:

mount1 fais

== replay-dual test 0b: lost client during waiting for next transno == 13:28:17 (1410182897)
CMD: shadow-12vm8 sync; sync; sync
Filesystem           1K-blocks  Used Available Use% Mounted on
shadow-12vm12:shadow-12vm8:/lustre
                      14223104 19712  14189056   1% /mnt/lustre
CMD: shadow-12vm5,shadow-12vm6,shadow-12vm9.shadow.whamcloud.com mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: shadow-12vm5,shadow-12vm6,shadow-12vm9.shadow.whamcloud.com if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: shadow-12vm8 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: shadow-12vm8 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: shadow-12vm8 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
CMD: shadow-12vm8 /usr/sbin/lctl dl
Failing mds1 on shadow-12vm8
+ pm -h powerman --off shadow-12vm8
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on shadow-12vm8
Command completed successfully
Failover mds1 to shadow-12vm12
13:28:33 (1410182913) waiting for shadow-12vm12 network 900 secs ...
13:28:33 (1410182913) network interface is UP
CMD: shadow-12vm12 hostname
mount facets: mds1
CMD: shadow-12vm12 zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
Starting mds1:   lustre-mdt1/mdt1 /mnt/mds1
CMD: shadow-12vm12 mkdir -p /mnt/mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/mds1
CMD: shadow-12vm12 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4 
CMD: shadow-12vm12 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
Started lustre-MDT0000
Starting client: shadow-12vm9.shadow.whamcloud.com:  -o user_xattr,flock shadow-12vm12:shadow-12vm8:/lustre /mnt/lustre
CMD: shadow-12vm9.shadow.whamcloud.com mkdir -p /mnt/lustre
CMD: shadow-12vm9.shadow.whamcloud.com mount -t lustre -o user_xattr,flock shadow-12vm12:shadow-12vm8:/lustre /mnt/lustre
mount.lustre: mount shadow-12vm12:shadow-12vm8:/lustre at /mnt/lustre failed: Input/output error
Is the MGS running?


 Comments   
Comment by Oleg Drokin [ 16/Sep/14 ]

with no console logs in the run attached, hard to correlate what happens when.
It well might be that zfs startup takes longer than usual.
I do see mds mounted just fine and after that entered recovery, but no idea how long that has lansted. we never see recovery completing too before failure is declared.

Is it possible to fetch console logs?

Comment by Sarah Liu [ 22/Nov/14 ]

Here is another instance which has console logs

https://testing.hpdd.intel.com/test_sets/cb1abc08-6a96-11e4-9c96-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client - ZFS
It's blocking a series of tests (tests_0b,1,2,3,4,5,6,8,10,19,21a)
https://testing.hpdd.intel.com/test_sets/2dfea442-9ebc-11e5-98a4-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL7 Server/Client - ZFS
build# 3305
https://testing.hpdd.intel.com/test_sets/02982ada-bbc7-11e5-8506-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found on b2_8 for failover testing , build# 6.
https://testing.hpdd.intel.com/test_sessions/54ec62da-d99d-11e5-9ebe-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/c5a8e44c-d9c7-11e5-85dd-5254006e85c2

Generated at Sat Feb 10 01:53:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.