Details
Description
Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.
Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node
only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?