Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0, Lustre 2.4.1
-
FSTYPE=zfs
FAILURE_MODE=HARD
TEST_GROUP=failover
-
3
-
8129
Description
While running recovery-double-scale test with FSTYPE=zfs and FAILURE_MODE=HARD to verify patch http://review.whamcloud.com/6258, the test failed as follows:
==== START === test 1: failover MDS, then OST ========== ==== Checking the clients loads BEFORE failover -- failure NOT OK <snip> Done checking client loads. Failing type1=MDS item1=mds1 ... CMD: wtm-82 /usr/sbin/lctl dl Failing mds1 on wtm-82 CMD: wtm-82 zpool set cachefile=none lustre-mdt1; sync + pm -h powerman --reset wtm-82 Command completed successfully reboot facets: mds1 + pm -h powerman --on wtm-82 Command completed successfully Failover mds1 to wtm-83 21:37:40 (1367901460) waiting for wtm-83 network 900 secs ... 21:37:40 (1367901460) network interface is UP CMD: wtm-83 hostname mount facets: mds1 CMD: wtm-83 zpool list -H lustre-mdt1 >/dev/null 2>&1 || zpool import -f -o cachefile=none lustre-mdt1 Starting mds1: lustre-mdt1/mdt1 /mnt/mds1 CMD: wtm-83 mkdir -p /mnt/mds1; mount -t lustre lustre-mdt1/mdt1 /mnt/mds1 CMD: wtm-83 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin::/sbin:/bin:/usr/sbin: NAME=ncli sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 256 CMD: wtm-83 zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null Started lustre-MDT0000 Failing type2=OST item2=ost4 ... CMD: wtm-85 /usr/sbin/lctl dl CMD: wtm-85 /usr/sbin/lctl dl CMD: wtm-85 /usr/sbin/lctl dl CMD: wtm-85 zpool set cachefile=none lustre-ost4; sync CMD: wtm-85 zpool set cachefile=none lustre-ost6; sync Failing ost2,ost4,ost6 on wtm-85 CMD: wtm-85 zpool set cachefile=none lustre-ost2; sync + pm -h powerman --reset wtm-85 Command completed successfully reboot facets: ost2,ost4,ost6 + pm -h powerman --on wtm-85 Command completed successfully Failover ost2 to wtm-84 Failover ost4 to wtm-84 Failover ost6 to wtm-84 21:38:19 (1367901499) waiting for wtm-84 network 900 secs ... 21:38:19 (1367901499) network interface is UP CMD: wtm-84 hostname mount facets: ost2,ost4,ost6 CMD: wtm-84 zpool list -H lustre-ost2 >/dev/null 2>&1 || zpool import -f -o cachefile=none lustre-ost2 Starting ost2: lustre-ost2/ost2 /mnt/ost2 CMD: wtm-84 mkdir -p /mnt/ost2; mount -t lustre lustre-ost2/ost2 /mnt/ost2 wtm-84: mount.lustre: mount lustre-ost2/ost2 at /mnt/ost2 failed: Input/output error wtm-84: Is the MGS running? Start of lustre-ost2/ost2 on ost2 failed 5 recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!
Dmesg on OSS wtm-84 showed that:
LustreError: 9681:0:(obd_mount_server.c:1123:server_register_target()) lustre-OST0001: error registering with the MGS: rc = -5 (not fatal) LustreError: 6180:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88062faed400 x1434348208262360/t0(0) o101->MGC10.10.18.253@tcp@10.10.18.253@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 LustreError: 6180:0:(client.c:1052:ptlrpc_import_delay_req()) Skipped 1 previous similar message LustreError: 15c-8: MGC10.10.18.253@tcp: The configuration from log 'lustre-OST0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 9681:0:(obd_mount_server.c:1257:server_start_targets()) failed to start server lustre-OST0001: -5 LustreError: 9681:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -5 LustreError: 9681:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-OST0001: Can't end config log lustre-client. LustreError: 9681:0:(obd_mount_server.c:1426:server_put_super()) lustre-OST0001: failed to disconnect lwp. (rc=-2) LustreError: 9681:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-OST0001 Lustre: server umount lustre-OST0001 complete LustreError: 9681:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount (-5) Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!
Dmesg on MDS wtm-83 showed that:
Lustre: DEBUG MARKER: Failing type2=OST item2=ost4 ... Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 4 clients reconnect Lustre: lustre-MDT0000: Recovery over after 0:08, of 4 clients 4 recovered and 0 were evicted. Lustre: 5225:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1367901499/real 1367901499] req@ffff880c17898400 x1434348659147084/t0(0) o400->lustre-OST0001-osc-MDT0000@10.10.19.26@tcp:28/4 lens 224/224 e 0 to 1 dl 1367901543 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-OST0003-osc-MDT0000: Connection to lustre-OST0003 (at 10.10.19.26@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 5225:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-OST0005-osc-MDT0000: Connection to lustre-OST0005 (at 10.10.19.26@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed! Lustre: DEBUG MARKER: recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: Restart of ost2 failed!
Maloo report:
https://maloo.whamcloud.com/test_sets/ebe1f318-b6e0-11e2-b6f1-52540035b04c
Thanks Lai. That is probably too big a change to include in a maintenance release so let's close this as fixed in 2.6