[LU-10518] replay-single test 53g failed with 'close_pid should not exist' Created: 16/Jan/18  Updated: 13/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_53g fails for failover test sessions. The last lines in the client test_log are:

Failover mds1 to onyx-42vm8
02:53:10 (1515725590) waiting for onyx-42vm8 network 900 secs ...
02:53:10 (1515725590) network interface is UP
CMD: onyx-42vm8 hostname
mount facets: mds1
CMD: onyx-42vm8 lsmod | grep zfs >&/dev/null || modprobe zfs;
			zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
CMD: onyx-42vm8 zfs get -H -o value 						lustre:svname lustre-mdt1/mdt1
Starting mds1:   lustre-mdt1/mdt1 /mnt/lustre-mds1
CMD: onyx-42vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/lustre-mds1
CMD: onyx-42vm8 /usr/sbin/lctl get_param -n health_check
CMD: onyx-42vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
onyx-42vm8: onyx-42vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: onyx-42vm8 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-42vm8 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-42vm8 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
Started lustre-MDT0000
 replay-single test_53g: @@@@@@ FAIL: close_pid should not exist

Test 53g looks like the following, up to the error:

1388 test_53g() {
1389         cancel_lru_locks mdc    # cleanup locks from former test cases
1390 
1391         mkdir $DIR/${tdir}-1 || error "mkdir $DIR/${tdir}-1 failed"
1392         mkdir $DIR/${tdir}-2 || error "mkdir $DIR/${tdir}-2 failed"
1393         multiop $DIR/${tdir}-1/f O_c &
1394         close_pid=$!
1395 
1396         #define OBD_FAIL_MDS_REINT_NET_REP 0x119
1397         do_facet $SINGLEMDS "lctl set_param fail_loc=0x119"
1398         mcreate $DIR/${tdir}-2/f &
1399         open_pid=$!
1400         sleep 1
1401 
1402         #define OBD_FAIL_MDS_CLOSE_NET 0x115
1403         do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000115"
1404         kill -USR1 $close_pid
1405         cancel_lru_locks mdc    # force the close
1406         do_facet $SINGLEMDS "lctl set_param fail_loc=0"
1407 
1408         #bz20647: make sure all pids are exists before failover
1409         [ -d /proc/$close_pid ] || error "close_pid doesn't exist"
1410         [ -d /proc/$open_pid ] || error "open_pid doesn't exists"
1411         replay_barrier_nodf $SINGLEMDS
1412         fail_nodf $SINGLEMDS
1413         wait $open_pid || error "open_pid failed"
1414         sleep 2
1415         # close should be gone
1416         [ -d /proc/$close_pid ] && error "close_pid should not exist"

This test has failed with this error only a couple of times:
2018-01-12 – b2_10 2.10.3.RC1 - https://testing.hpdd.intel.com/test_sets/22ac34a8-f750-11e7-a10a-52540065bddc
2018-01-11 - master 2.10.56.102 - https://testing.hpdd.intel.com/test_sets/be07ca94-f6cd-11e7-bd00-52540065bddc



 Comments   
Comment by Jian Yu [ 06/Feb/18 ]

More failure instances on master branch under failover test group:
https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/fe6bae0a-0ad9-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/4d1259a8-06d6-11e8-a7cd-52540065bddc

Generated at Sat Feb 10 02:35:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.