Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5619

Hard Failover replay-dual test_0b: mount MDS failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0, Lustre 2.8.0
    • server and client: lustre-master build #2642
    • 3
    • 15719

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/718edf3e-37c4-11e4-a2a6-5254006e85c2.

      The sub-test test_0b failed with the following error:

      mount1 fais

      == replay-dual test 0b: lost client during waiting for next transno == 13:28:17 (1410182897)
      CMD: shadow-12vm8 sync; sync; sync
      Filesystem           1K-blocks  Used Available Use% Mounted on
      shadow-12vm12:shadow-12vm8:/lustre
                            14223104 19712  14189056   1% /mnt/lustre
      CMD: shadow-12vm5,shadow-12vm6,shadow-12vm9.shadow.whamcloud.com mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
      CMD: shadow-12vm5,shadow-12vm6,shadow-12vm9.shadow.whamcloud.com if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
      CMD: shadow-12vm8 /usr/sbin/lctl --device lustre-MDT0000 notransno
      CMD: shadow-12vm8 /usr/sbin/lctl --device lustre-MDT0000 readonly
      CMD: shadow-12vm8 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      CMD: shadow-12vm8 /usr/sbin/lctl dl
      Failing mds1 on shadow-12vm8
      + pm -h powerman --off shadow-12vm8
      Command completed successfully
      reboot facets: mds1
      + pm -h powerman --on shadow-12vm8
      Command completed successfully
      Failover mds1 to shadow-12vm12
      13:28:33 (1410182913) waiting for shadow-12vm12 network 900 secs ...
      13:28:33 (1410182913) network interface is UP
      CMD: shadow-12vm12 hostname
      mount facets: mds1
      CMD: shadow-12vm12 zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
      			zpool import -f -o cachefile=none -d /dev/lvm-Role_MDS lustre-mdt1
      Starting mds1:   lustre-mdt1/mdt1 /mnt/mds1
      CMD: shadow-12vm12 mkdir -p /mnt/mds1; mount -t lustre   		                   lustre-mdt1/mdt1 /mnt/mds1
      CMD: shadow-12vm12 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4 
      CMD: shadow-12vm12 zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
      Started lustre-MDT0000
      Starting client: shadow-12vm9.shadow.whamcloud.com:  -o user_xattr,flock shadow-12vm12:shadow-12vm8:/lustre /mnt/lustre
      CMD: shadow-12vm9.shadow.whamcloud.com mkdir -p /mnt/lustre
      CMD: shadow-12vm9.shadow.whamcloud.com mount -t lustre -o user_xattr,flock shadow-12vm12:shadow-12vm8:/lustre /mnt/lustre
      mount.lustre: mount shadow-12vm12:shadow-12vm8:/lustre at /mnt/lustre failed: Input/output error
      Is the MGS running?
      

      Attachments

        Issue Links

          Activity

            [LU-5619] Hard Failover replay-dual test_0b: mount MDS failed
            standan Saurabh Tandan (Inactive) added a comment - - edited Another instance found on b2_8 for failover testing , build# 6. https://testing.hpdd.intel.com/test_sessions/54ec62da-d99d-11e5-9ebe-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/c5a8e44c-d9c7-11e5-85dd-5254006e85c2

            Another instance found for hardfailover: EL7 Server/Client - ZFS
            build# 3305
            https://testing.hpdd.intel.com/test_sets/02982ada-bbc7-11e5-8506-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Another instance found for hardfailover: EL7 Server/Client - ZFS build# 3305 https://testing.hpdd.intel.com/test_sets/02982ada-bbc7-11e5-8506-5254006e85c2

            master, build# 3264, 2.7.64 tag
            Hard Failover: EL6.7 Server/Client - ZFS
            It's blocking a series of tests (tests_0b,1,2,3,4,5,6,8,10,19,21a)
            https://testing.hpdd.intel.com/test_sets/2dfea442-9ebc-11e5-98a4-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Hard Failover: EL6.7 Server/Client - ZFS It's blocking a series of tests (tests_0b,1,2,3,4,5,6,8,10,19,21a) https://testing.hpdd.intel.com/test_sets/2dfea442-9ebc-11e5-98a4-5254006e85c2
            sarah Sarah Liu added a comment - Here is another instance which has console logs https://testing.hpdd.intel.com/test_sets/cb1abc08-6a96-11e4-9c96-5254006e85c2
            green Oleg Drokin added a comment -

            with no console logs in the run attached, hard to correlate what happens when.
            It well might be that zfs startup takes longer than usual.
            I do see mds mounted just fine and after that entered recovery, but no idea how long that has lansted. we never see recovery completing too before failure is declared.

            Is it possible to fetch console logs?

            green Oleg Drokin added a comment - with no console logs in the run attached, hard to correlate what happens when. It well might be that zfs startup takes longer than usual. I do see mds mounted just fine and after that entered recovery, but no idea how long that has lansted. we never see recovery completing too before failure is declared. Is it possible to fetch console logs?

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: