Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8228

replay-single test 87 fails with 'Restart of ost1 failed!'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • autotest review-dne-part-2
    • 3
    • 9223372036854775807

    Description

      replay-single test_87 fails with

       'Restart of ost1 failed!' 
      

      From the test_log, we see that after the failover of the OST, the mount has problems

      Failover ost1 to trevis-4vm3
      10:12:00 (1464603120) waiting for trevis-4vm3 network 900 secs ...
      10:12:00 (1464603120) network interface is UP
      CMD: trevis-4vm3 hostname
      mount facets: ost1
      CMD: trevis-4vm3 test -b /dev/lvm-Role_OSS/P1
      CMD: trevis-4vm3 e2label /dev/lvm-Role_OSS/P1
      Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/ost1
      CMD: trevis-4vm3 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
      trevis-4vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: No such file or directory
      trevis-4vm3: Is the MGS specification correct?
      trevis-4vm3: Is the filesystem name correct?
      trevis-4vm3: If upgrading, is the copied client log valid? (see upgrade docs)
      Start of /dev/lvm-Role_OSS/P1 on ost1 failed 2
      

      This issue has occurred 4 times in the past two weeks and the logs are not consistent on all failures, but on two of the failures, the OST console has

      18:17:53:[ 5498.685473] Lustre: Failing over lustre-OST0000
      18:17:53:[ 5498.697452] Removing read-only on unknown block (0xfc00000)
      18:17:53:[ 5498.707991] Lustre: server umount lustre-OST0000 complete
      18:17:53:[ 5498.869364] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      18:17:53:[ 5499.538728] LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.2.4.216@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      18:17:53:[ 5499.548875] LustreError: Skipped 7 previous similar messages
      18:17:53:[ 5509.256355] Lustre: DEBUG MARKER: hostname
      18:17:53:[ 5509.645443] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P1
      18:17:53:[ 5509.955019] Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
      18:17:53:[ 5510.316293] LDISKFS-fs (dm-0): recovery complete
      18:17:53:[ 5510.334090] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache
      18:17:53:[ 5510.375257] LustreError: 26044:0:(llog_osd.c:256:llog_osd_read_header()) lustre-OST0000-osd: bad log lustre-client [0xa:0x74:0x0] header magic: 0x2083300 (expected 0x10645539)
      18:17:53:[ 5510.385449] LustreError: 26044:0:(mgc_request.c:1741:mgc_llog_local_copy()) MGC10.2.4.216@tcp: failed to copy remote log lustre-client: rc = -5
      18:17:53:[ 5510.389682] LustreError: 13a-8: Failed to get MGS log lustre-client and no local copy.
      18:17:53:[ 5510.391675] LustreError: 15c-8: MGC10.2.4.216@tcp: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      18:17:53:[ 5510.397556] LustreError: 26044:0:(obd_mount_server.c:1326:server_start_targets()) lustre-OST0000: failed to start LWP: -2
      18:17:53:[ 5510.399960] LustreError: 26044:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -2
      18:17:53:[ 5510.402275] Lustre: Failing over lustre-OST0000
      18:17:53:[ 5510.447063] Lustre: server umount lustre-OST0000 complete
      18:17:53:[ 5510.452435] LustreError: 26044:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-2)
      18:17:53:[ 5510.807253] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_87: @@@@@@ FAIL: Restart of ost1 failed! 
      

      All failures are on review-dne-part-2.

      The following links to logs are all the failures seen to date:
      2016-05-16 19:50:47 - 2.8.52.70.g49cd5fd - https://testing.hpdd.intel.com/test_sets/4ad651a2-1c0a-11e6-855a-5254006e85c2
      2016-05-25 23:55:13 – 2.8.51.3.ge28e633 - https://testing.hpdd.intel.com/test_sets/3b82c9ec-2326-11e6-a8f9-5254006e85c2
      2016-05-30 08:47:44 - 2.8.53.38.gd685fc5 - https://testing.hpdd.intel.com/test_sets/20dd7d0e-2690-11e6-ab39-5254006e85c2
      2016-05-31 18:40:42 - 2.8.53.28.g3c47b56 - https://testing.hpdd.intel.com/test_sets/a2e361f2-27b1-11e6-aac3-5254006e85c2

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: