Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5724

IR recovery doesn't behave properly with Lustre 2.5

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.
    • 2
    • 16076

    Description

      Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

      Attachments

        Issue Links

          Activity

            [LU-5724] IR recovery doesn't behave properly with Lustre 2.5

            Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

            simmonsja James A Simmons added a comment - Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above?

            [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects
            Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59
            Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54
            

            the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as
            a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover.

            Could you please attach the console&sys logs of the client? Thanks!

            hongchao.zhang Hongchao Zhang added a comment - - edited is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above? [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59 Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54 the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover. Could you please attach the console&sys logs of the client? Thanks!

            Here you go. These are the logs from the clients and servers.

            simmonsja James A Simmons added a comment - Here you go. These are the logs from the clients and servers.

            The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

            simmonsja James A Simmons added a comment - The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

            as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT.

            00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted.
            

            which nodes does the client log "client-dump.log" contain? no eviction record was found in this log.

            btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with
            those debug logs?

            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT. 00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted. which nodes does the client log "client-dump.log" contain? no eviction record was found in this log. btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with those debug logs? Thanks!

            We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/LU-5724/*.log. The OSS seem to recovery but the MDS did not recovery properly.

            simmonsja James A Simmons added a comment - We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/ LU-5724 /*.log. The OSS seem to recovery but the MDS did not recovery properly.

            No. Only the MDS and OSS were restarted.

            simmonsja James A Simmons added a comment - No. Only the MDS and OSS were restarted.

            Does "single server node" mean that the MGS was also restarted in the test?

            jay Jinshan Xiong (Inactive) added a comment - Does "single server node" mean that the MGS was also restarted in the test?

            Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

            simmonsja James A Simmons added a comment - Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

            Today we tested the latest 2.5 lustre code with the following patches:

            LU-793
            LU-3338
            LU-5485
            LU-5651
            LU-5740

            witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well.

            simmonsja James A Simmons added a comment - Today we tested the latest 2.5 lustre code with the following patches: LU-793 LU-3338 LU-5485 LU-5651 LU-5740 witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well.

            The cause of our recovery issues was three things. They are LU-5079, LU-5287, and lastly LU-5651. Of those only LU-5651 is left to be merged to b2_5. So this ticket should remain open until that patch lands.

            simmonsja James A Simmons added a comment - The cause of our recovery issues was three things. They are LU-5079 , LU-5287 , and lastly LU-5651 . Of those only LU-5651 is left to be merged to b2_5. So this ticket should remain open until that patch lands.

            People

              hongchao.zhang Hongchao Zhang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: