Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.2.0
-
None
-
lustre-master build #353 RHEL6-x86_64 for both server and client
-
3
-
10260
Description
Running recovery-mds-scale FLAVOR=OSS with quota enables and HARD failure mode, console log shows one of the OSS's network is up but after a while it cannot be accessed. After reboot the node, it's back to use.
==== Checking the clients loads AFTER failover – failure NOT OK
ost6 has failed over 1 times, and counting...
sleeping 421 seconds ...
==== Checking the clients loads BEFORE failover – failure NOT OK ELAPSED=179 DURATION=86400 PERIOD=600
Wait ost4 recovery complete before doing next failover ....
affected facets: ost1,ost2,ost3,ost4,ost5,ost6
client-12: *.lustre-OST0000.recovery_status status: INACTIVE
client-12: *.lustre-OST0001.recovery_status status: COMPLETE
client-12: *.lustre-OST0002.recovery_status status: INACTIVE
client-12: *.lustre-OST0003.recovery_status status: COMPLETE
client-12: *.lustre-OST0004.recovery_status status: INACTIVE
client-12: *.lustre-OST0005.recovery_status status: COMPLETE
Checking clients are in FULL state before doing next failover
client-13: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0000-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-13: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: osc.lustre-OST0001-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-13,client-17,client-18 with
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0002-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0003-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-18: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-18: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0004-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
client-17: osc.lustre-OST0005-osc-[^M]*.ost_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-13,client-17,client-18 with
Starting failover on ost4
Failing ost4 on node client-12
+ pm -h powerman --off client-12
Command completed successfully
affected facets: ost1,ost2,ost3,ost4,ost5,ost6
+ pm -h powerman --on client-12
Command completed successfully
Failover ost1 to fat-amd-2
Failover ost2 to fat-amd-2
Failover ost3 to fat-amd-2
Failover ost4 to fat-amd-2
Failover ost5 to fat-amd-2
Failover ost6 to fat-amd-2
15:04:41 (1322867081) waiting for fat-amd-2 network 900 secs ...
15:04:41 (1322867081) network interface is UP
Starting ost1: /dev/disk/by-id/scsi-1IET_00020001 /mnt/ost1
fat-amd-2: debug=0xb3f0405
fat-amd-2: subsystem_debug=0xffb7efff
fat-amd-2: debug_mb=48
Started lustre-OST0000
Starting ost2: /dev/disk/by-id/scsi-1IET_00030001 /mnt/ost2
-------------------------------------------------------------------
PING fat-amd-2.lab.whamcloud.com (10.10.4.133) 56(84) bytes of data.
From brent.lab.whamcloud.com (10.10.0.1) icmp_seq=1 Destination Host Unreachable
Attachments
Issue Links
- is related to
-
LU-885 recovery-mds-scale (FLAVOR=mds) fail, network is not avaliable
-
- Resolved
-
- Trackbacks
-
Lustre 2.2.0 mini release testing tracker
Lustre 2.2.0 Mini Release Tag: 2.1.52.0 Build: https://newbuild.whamcloud....
-
Lustre 2.2.0 release testing tracker
Lustre 2.2.0 RC1 Tag: 2.2.0RC1 Build: https://build.whamcloud.com/job/lustreb22/11/ Google doc: https://docs.google.com/a/whamcloud.com/spreadsheet/ccc?key=0AkK5hBTd2cvHdDFsSWt2RlBocE5kdi03OUYtX21ZYkE#gid=3 Lustre 2.2....