Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10518

replay-single test 53g failed with 'close_pid should not exist'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      replay-single test_53g fails for failover test sessions. The last lines in the client test_log are:

      Failover mds1 to onyx-42vm8
      02:53:10 (1515725590) waiting for onyx-42vm8 network 900 secs ...
      02:53:10 (1515725590) network interface is UP
      CMD: onyx-42vm8 hostname
      mount facets: mds1
      CMD: onyx-42vm8 lsmod | grep zfs >&/dev/null || modprobe zfs;
      			zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
      			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
      CMD: onyx-42vm8 zfs get -H -o value 						lustre:svname lustre-mdt1/mdt1
      Starting mds1:   lustre-mdt1/mdt1 /mnt/lustre-mds1
      CMD: onyx-42vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/lustre-mds1
      CMD: onyx-42vm8 /usr/sbin/lctl get_param -n health_check
      CMD: onyx-42vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      onyx-42vm8: onyx-42vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      CMD: onyx-42vm8 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: onyx-42vm8 zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: onyx-42vm8 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
      Started lustre-MDT0000
       replay-single test_53g: @@@@@@ FAIL: close_pid should not exist
      

      Test 53g looks like the following, up to the error:

      1388 test_53g() {
      1389         cancel_lru_locks mdc    # cleanup locks from former test cases
      1390 
      1391         mkdir $DIR/${tdir}-1 || error "mkdir $DIR/${tdir}-1 failed"
      1392         mkdir $DIR/${tdir}-2 || error "mkdir $DIR/${tdir}-2 failed"
      1393         multiop $DIR/${tdir}-1/f O_c &
      1394         close_pid=$!
      1395 
      1396         #define OBD_FAIL_MDS_REINT_NET_REP 0x119
      1397         do_facet $SINGLEMDS "lctl set_param fail_loc=0x119"
      1398         mcreate $DIR/${tdir}-2/f &
      1399         open_pid=$!
      1400         sleep 1
      1401 
      1402         #define OBD_FAIL_MDS_CLOSE_NET 0x115
      1403         do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000115"
      1404         kill -USR1 $close_pid
      1405         cancel_lru_locks mdc    # force the close
      1406         do_facet $SINGLEMDS "lctl set_param fail_loc=0"
      1407 
      1408         #bz20647: make sure all pids are exists before failover
      1409         [ -d /proc/$close_pid ] || error "close_pid doesn't exist"
      1410         [ -d /proc/$open_pid ] || error "open_pid doesn't exists"
      1411         replay_barrier_nodf $SINGLEMDS
      1412         fail_nodf $SINGLEMDS
      1413         wait $open_pid || error "open_pid failed"
      1414         sleep 2
      1415         # close should be gone
      1416         [ -d /proc/$close_pid ] && error "close_pid should not exist"
      

      This test has failed with this error only a couple of times:
      2018-01-12 – b2_10 2.10.3.RC1 - https://testing.hpdd.intel.com/test_sets/22ac34a8-f750-11e7-a10a-52540065bddc
      2018-01-11 - master 2.10.56.102 - https://testing.hpdd.intel.com/test_sets/be07ca94-f6cd-11e7-bd00-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: