Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-893

system hang when running recovery-mds-scale FLAVOR=OSS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.2.0
    • None
    • lustre-master build #353 RHEL6-x86_64 for both server and client
    • 3
    • 10260

    Description

      Running recovery-mds-scale FLAVOR=OSS with quota enables and HARD failure mode, console log shows one of the OSS's network is up but after a while it cannot be accessed. After reboot the node, it's back to use.

      ==== Checking the clients loads AFTER failover – failure NOT OK
      ost6 has failed over 1 times, and counting...
      sleeping 421 seconds ...
      ==== Checking the clients loads BEFORE failover – failure NOT OK ELAPSED=179 DURATION=86400 PERIOD=600
      Wait ost4 recovery complete before doing next failover ....
      affected facets: ost1,ost2,ost3,ost4,ost5,ost6
      client-12: *.lustre-OST0000.recovery_status status: INACTIVE
      client-12: *.lustre-OST0001.recovery_status status: COMPLETE
      client-12: *.lustre-OST0002.recovery_status status: INACTIVE
      client-12: *.lustre-OST0003.recovery_status status: COMPLETE
      client-12: *.lustre-OST0004.recovery_status status: INACTIVE
      client-12: *.lustre-OST0005.recovery_status status: COMPLETE
      Checking clients are in FULL state before doing next failover
      client-13: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-13: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-13: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-13: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-17: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-17: cannot run remote command on client-13,client-17,client-18 with
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-13: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-17: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-13: cannot run remote command on client-13,client-17,client-18 with
      client-17: cannot run remote command on client-13,client-17,client-18 with
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-17: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-17: cannot run remote command on client-13,client-17,client-18 with
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-17: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-18: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-17: cannot run remote command on client-13,client-17,client-18 with
      client-18: cannot run remote command on client-13,client-17,client-18 with
      client-17: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-17: cannot run remote command on client-13,client-17,client-18 with
      client-17: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
      client-17: cannot run remote command on client-13,client-17,client-18 with
      Starting failover on ost4
      Failing ost4 on node client-12
      + pm -h powerman --off client-12
      Command completed successfully
      affected facets: ost1,ost2,ost3,ost4,ost5,ost6
      + pm -h powerman --on client-12
      Command completed successfully
      Failover ost1 to fat-amd-2
      Failover ost2 to fat-amd-2
      Failover ost3 to fat-amd-2
      Failover ost4 to fat-amd-2
      Failover ost5 to fat-amd-2
      Failover ost6 to fat-amd-2
      15:04:41 (1322867081) waiting for fat-amd-2 network 900 secs ...
      15:04:41 (1322867081) network interface is UP
      Starting ost1: /dev/disk/by-id/scsi-1IET_00020001 /mnt/ost1
      fat-amd-2: debug=0xb3f0405
      fat-amd-2: subsystem_debug=0xffb7efff
      fat-amd-2: debug_mb=48
      Started lustre-OST0000
      Starting ost2: /dev/disk/by-id/scsi-1IET_00030001 /mnt/ost2
      -------------------------------------------------------------------

      PING fat-amd-2.lab.whamcloud.com (10.10.4.133) 56(84) bytes of data.
      From brent.lab.whamcloud.com (10.10.0.1) icmp_seq=1 Destination Host Unreachable

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: