Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5724

IR recovery doesn't behave properly with Lustre 2.5

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.
    • 2
    • 16076

    Description

      Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

      Attachments

        Issue Links

          Activity

            [LU-5724] IR recovery doesn't behave properly with Lustre 2.5
            yujian Jian Yu added a comment -

            Close this ticket as a duplicate of LU-4119.

            yujian Jian Yu added a comment - Close this ticket as a duplicate of LU-4119 .
            yujian Jian Yu added a comment -

            Large scale testing passed.

            Hi James, can we close this ticket now?

            yujian Jian Yu added a comment - Large scale testing passed. Hi James, can we close this ticket now?
            yujian Jian Yu added a comment -

            With the patch for LU-4119, the recovery issue did not occur at small scale testing in ORNL. Large scale testing will be performed.

            yujian Jian Yu added a comment - With the patch for LU-4119 , the recovery issue did not occur at small scale testing in ORNL. Large scale testing will be performed.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node
            only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?

            hongchao.zhang Hongchao Zhang added a comment - - edited Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?

            No the MGS is left up. We failed over the MDS and OSS together.

            simmonsja James A Simmons added a comment - No the MGS is left up. We failed over the MDS and OSS together.

            Are both MGS and MDS failed over in this test?

            the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds
            (default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as
            "LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case.

            for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery
            (imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).

            hongchao.zhang Hongchao Zhang added a comment - Are both MGS and MDS failed over in this test? the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds (default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as "LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case. for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery (imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).

            People

              hongchao.zhang Hongchao Zhang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: