[LU-10708] replay-single test_20b: Restart of mds1 failed! Created: 23/Feb/18  Updated: 20/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: zfs
Environment:

Hard Failover:
RHEL 7.4 Server/ZFS
RHEL 7.4 Client
2.10.58 master, build 3707


Issue Links:
Related
is related to LU-8104 Failover : replay-single test_70b: Re... Open
is related to LU-11256 replay-vbr test 7f is failing with 'R... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_20b - Restart of mds1 failed!
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/7f078076-15ba-11e8-bd00-52540065bddc

test_20b failed with the following error:

Restart of mds1 failed!

test_logs:

== replay-single test 20b: write, unlink, eviction, replay (test mds_cleanup_orphans) ================ 19:46:25 (1519069585)
CMD: onyx-32vm7 lctl set_param -n os[cd]*.*MDT*.force_sync=1
CMD: onyx-32vm6 lctl set_param -n osd*.*OS*.force_sync=1
/mnt/lustre/f20b.replay-single
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     0	          4770	       0x12a2	             0

CMD: onyx-32vm7 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.evict_client 425b1455-3c86-3ef4-e5f3-8752f5bdb612
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 1.08016 s, 37.9 MB/s
CMD: onyx-32vm7 lctl set_param -n osd*.*MDT*.force_sync=1
CMD: onyx-32vm7 /usr/sbin/lctl dl
Failing mds1 on onyx-32vm7
+ pm -h powerman --off onyx-32vm7
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on onyx-32vm7
Command completed successfully
Failover mds1 to onyx-32vm8
19:46:42 (1519069602) waiting for onyx-32vm8 network 900 secs ...
19:46:42 (1519069602) network interface is UP
CMD: onyx-32vm8 hostname
mount facets: mds1
CMD: onyx-32vm8 lsmod | grep zfs >&/dev/null || modprobe zfs;
			zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -o failmode=panic -d /dev/lvm-Role_MDS lustre-mdt1
onyx-32vm8: cannot import 'lustre-mdt1': no such pool available
 replay-single test_20b: @@@@@@ FAIL: Restart of mds1 failed! 


 Comments   
Comment by James Nunez (Inactive) [ 09/May/18 ]

We see the same problem when remounting the MDS after a failover in recovery-mds-scale in test_failover_mds. See the following for logs https://testing.hpdd.intel.com/test_sets/c57c0bda-527d-11e8-b9d3-52540065bddc .

From the client test_log

Failing mds1 on trevis-8vm7
+ pm -h powerman --off trevis-8vm7
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on trevis-8vm7
Command completed successfully
Failover mds1 to trevis-8vm8
19:58:09 (1525550289) waiting for trevis-8vm8 network 900 secs ...
19:58:09 (1525550289) network interface is UP
CMD: trevis-8vm8 hostname
mount facets: mds1
CMD: trevis-8vm8 lsmod | grep zfs >&/dev/null || modprobe zfs;
			zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
trevis-8vm8: cannot import 'lustre-mdt1': no such pool available
 recovery-mds-scale test_failover_mds: @@@@@@ FAIL: Restart of mds1 failed! 
Comment by James Nunez (Inactive) [ 17/Dec/18 ]

Similar failure for recovery-random-scale test fail_client_mds at https://testing.whamcloud.com/test_sets/e3b58552-fea5-11e8-b837-52540065bddc

CMD: trevis-25vm11 hostname
mount facets: mds1
CMD: trevis-25vm11 lsmod | grep zfs >&/dev/null || modprobe zfs;
			zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -o failmode=panic -d /dev/lvm-Role_MDS lustre-mdt1
trevis-25vm11: cannot import 'lustre-mdt1': no such pool available
 recovery-random-scale test_fail_client_mds: @@@@@@ FAIL: Restart of mds1 failed! 
Comment by James Nunez (Inactive) [ 29/Apr/19 ]

Another similar failure with replay-single test 3c at https://testing.whamcloud.com/test_sets/4051fd66-682a-11e9-bd0e-52540065bddc .

Generated at Sat Feb 10 02:37:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.