[LU-3286] recovery-double-scale test_pairwise_fail: FAIL: Restart of ost2 failed! - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Labels:
- zfs
Environment:

FSTYPE=zfs
FAILURE_MODE=HARD
TEST_GROUP=failover

Severity:
3
Rank (Obsolete):
8129

Description

While running recovery-double-scale test with FSTYPE=zfs and FAILURE_MODE=HARD to verify patch http://review.whamcloud.com/6258, the test failed as follows:

==== START === test 1: failover MDS, then OST ==========
==== Checking the clients loads BEFORE failover -- failure NOT OK
<snip>
Done checking client loads. Failing type1=MDS item1=mds1 ... 
CMD: wtm-82 /usr/sbin/lctl dl
Failing mds1 on wtm-82
CMD: wtm-82 zpool set cachefile=none lustre-mdt1; sync
+ pm -h powerman --reset wtm-82
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on wtm-82
Command completed successfully
Failover mds1 to wtm-83
21:37:40 (1367901460) waiting for wtm-83 network 900 secs ...
21:37:40 (1367901460) network interface is UP
CMD: wtm-83 hostname
mount facets: mds1
CMD: wtm-83 zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none lustre-mdt1
Starting mds1:   lustre-mdt1/mdt1 /mnt/mds1
CMD: wtm-83 mkdir -p /mnt/mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/mds1
CMD: wtm-83 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin::/sbin:/bin:/usr/sbin: NAME=ncli sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 256 
CMD: wtm-83 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
Started lustre-MDT0000
                            Failing type2=OST item2=ost4 ... 
CMD: wtm-85 /usr/sbin/lctl dl
CMD: wtm-85 /usr/sbin/lctl dl
CMD: wtm-85 /usr/sbin/lctl dl
CMD: wtm-85 zpool set cachefile=none lustre-ost4; sync
CMD: wtm-85 zpool set cachefile=none lustre-ost6; sync
Failing ost2,ost4,ost6 on wtm-85
CMD: wtm-85 zpool set cachefile=none lustre-ost2; sync
+ pm -h powerman --reset wtm-85
Command completed successfully
reboot facets: ost2,ost4,ost6
+ pm -h powerman --on wtm-85
Command completed successfully
Failover ost2 to wtm-84
Failover ost4 to wtm-84
Failover ost6 to wtm-84
21:38:19 (1367901499) waiting for wtm-84 network 900 secs ...
21:38:19 (1367901499) network interface is UP
CMD: wtm-84 hostname
mount facets: ost2,ost4,ost6
CMD: wtm-84 zpool list -H lustre-ost2 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none lustre-ost2
Starting ost2:   lustre-ost2/ost2 /mnt/ost2
CMD: wtm-84 mkdir -p /mnt/ost2; mount -t lustre   		                   lustre-ost2/ost2 /mnt/ost2
wtm-84: mount.lustre: mount lustre-ost2/ost2 at /mnt/ost2 failed: Input/output error
wtm-84: Is the MGS running?
Start of lustre-ost2/ost2 on ost2 failed 5
 recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!

Dmesg on OSS wtm-84 showed that:

LustreError: 9681:0:(obd_mount_server.c:1123:server_register_target()) lustre-OST0001: error registering with the MGS: rc = -5 (not fatal)
LustreError: 6180:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88062faed400 x1434348208262360/t0(0) o101->MGC10.10.18.253@tcp@10.10.18.253@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 6180:0:(client.c:1052:ptlrpc_import_delay_req()) Skipped 1 previous similar message
LustreError: 15c-8: MGC10.10.18.253@tcp: The configuration from log 'lustre-OST0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 9681:0:(obd_mount_server.c:1257:server_start_targets()) failed to start server lustre-OST0001: -5
LustreError: 9681:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -5
LustreError: 9681:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-OST0001: Can't end config log lustre-client.
LustreError: 9681:0:(obd_mount_server.c:1426:server_put_super()) lustre-OST0001: failed to disconnect lwp. (rc=-2)
LustreError: 9681:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-OST0001
Lustre: server umount lustre-OST0001 complete
LustreError: 9681:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-5)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!

Dmesg on MDS wtm-83 showed that:

Lustre: DEBUG MARKER: Failing type2=OST item2=ost4 ...
Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 4 clients reconnect
Lustre: lustre-MDT0000: Recovery over after 0:08, of 4 clients 4 recovered and 0 were evicted.
Lustre: 5225:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1367901499/real 1367901499]  req@ffff880c17898400 x1434348659147084/t0(0) o400->lustre-OST0001-osc-MDT0000@10.10.19.26@tcp:28/4 lens 224/224 e 0 to 1 dl 1367901543 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-OST0003-osc-MDT0000: Connection to lustre-OST0003 (at 10.10.19.26@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: 5225:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: lustre-OST0005-osc-MDT0000: Connection to lustre-OST0005 (at 10.10.19.26@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed! 
Lustre: DEBUG MARKER: recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!

Maloo report:
https://maloo.whamcloud.com/test_sets/ebe1f318-b6e0-11e2-b6f1-52540035b04c

recovery-double-scale test_pairwise_fail: FAIL: Restart of ost2 failed!

Details

Description

Attachments

Activity

People

Dates