Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.10.3
-
None
-
3
-
9223372036854775807
Description
replay-single test_53g fails for failover test sessions. The last lines in the client test_log are:
Failover mds1 to onyx-42vm8
02:53:10 (1515725590) waiting for onyx-42vm8 network 900 secs ...
02:53:10 (1515725590) network interface is UP
CMD: onyx-42vm8 hostname
mount facets: mds1
CMD: onyx-42vm8 lsmod | grep zfs >&/dev/null || modprobe zfs;
zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
CMD: onyx-42vm8 zfs get -H -o value lustre:svname lustre-mdt1/mdt1
Starting mds1: lustre-mdt1/mdt1 /mnt/lustre-mds1
CMD: onyx-42vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre lustre-mdt1/mdt1 /mnt/lustre-mds1
CMD: onyx-42vm8 /usr/sbin/lctl get_param -n health_check
CMD: onyx-42vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4
onyx-42vm8: onyx-42vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: onyx-42vm8 zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-42vm8 zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-42vm8 zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null
Started lustre-MDT0000
replay-single test_53g: @@@@@@ FAIL: close_pid should not exist
Test 53g looks like the following, up to the error:
1388 test_53g() {
1389 cancel_lru_locks mdc # cleanup locks from former test cases
1390
1391 mkdir $DIR/${tdir}-1 || error "mkdir $DIR/${tdir}-1 failed"
1392 mkdir $DIR/${tdir}-2 || error "mkdir $DIR/${tdir}-2 failed"
1393 multiop $DIR/${tdir}-1/f O_c &
1394 close_pid=$!
1395
1396 #define OBD_FAIL_MDS_REINT_NET_REP 0x119
1397 do_facet $SINGLEMDS "lctl set_param fail_loc=0x119"
1398 mcreate $DIR/${tdir}-2/f &
1399 open_pid=$!
1400 sleep 1
1401
1402 #define OBD_FAIL_MDS_CLOSE_NET 0x115
1403 do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000115"
1404 kill -USR1 $close_pid
1405 cancel_lru_locks mdc # force the close
1406 do_facet $SINGLEMDS "lctl set_param fail_loc=0"
1407
1408 #bz20647: make sure all pids are exists before failover
1409 [ -d /proc/$close_pid ] || error "close_pid doesn't exist"
1410 [ -d /proc/$open_pid ] || error "open_pid doesn't exists"
1411 replay_barrier_nodf $SINGLEMDS
1412 fail_nodf $SINGLEMDS
1413 wait $open_pid || error "open_pid failed"
1414 sleep 2
1415 # close should be gone
1416 [ -d /proc/$close_pid ] && error "close_pid should not exist"
This test has failed with this error only a couple of times:
2018-01-12 – b2_10 2.10.3.RC1 - https://testing.hpdd.intel.com/test_sets/22ac34a8-f750-11e7-a10a-52540065bddc
2018-01-11 - master 2.10.56.102 - https://testing.hpdd.intel.com/test_sets/be07ca94-f6cd-11e7-bd00-52540065bddc