Details
Description
Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.
Are both MGS and MDS failed over in this test?
the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds
(default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as
"LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case.
for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery
(imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).