Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After upgrading Lustre from 2.5.1 to 2.7.61 on snx11117 the clients can not be mounted (mount client hangs) because of endless recovery:

[423225.578209] Lustre: snx11117-MDT0000: Denying connection for new client 3ec3f6c4-e172-39d7-383c-d4c19737f54c(at 10.9.100.9@o2ib3), waitin
g for 2 known clients (2 recovered, 0 in progress, and 0 evicted) to recover in 21188498:05                                                  
[423225.601237] Lustre: Skipped 41 previous similar messages

It seems "~~LU-3540~~ lod: update recovery thread" broke recovery_time_hard functionality.
check_for_recovery_ready causes endless loop in target_recovery_overseer when flag tdtd_replay_ready is not set:

static int check_for_recovery_ready(struct lu_target *lut)
...
       if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
                LASSERT(clnts <= obd->obd_max_recoverable_clients);
                if (clnts + obd->obd_stale_clients <
                    obd->obd_max_recoverable_clients)
                        return 0;
        }    

        if (lut->lut_tdtd != NULL) {
                if (!lut->lut_tdtd->tdtd_replay_ready) {
                        /* Let's extend recovery timer, in case the recovery
                         * timer expired, and some clients got evicted */
                        extend_recovery_timer(obd, obd->obd_recovery_timeout,
                                              true);
                        return 0;
                } else {
                        dtrq_list_dump(lut->lut_tdtd, D_HA);
                }    
        }

check_for_recovery_ready doesn't return 1 despite of the fact that all clients already connected and recovery has expired:

00010000:00080000:0.0:1450170133.405945:0:243397:0:(ldlm_lib.c:2081:check_for_recovery_ready()) connected 2 stale 0 max_recoverable_clients 2 abort 0 expired 1

Because of not set tdtd_replay_ready flag check_for_recovery_ready returns 0 and trying to extend recovery timer(without success):

00010000:00080000:0.0:1450170133.405947:0:243397:0:(ldlm_lib.c:1745:extend_recovery_timer()) snx11117-MDT0000: recovery timer will expire in 4294905278 seconds

Imo below strings brake previous logic of target_recovery_overseer and recovery_time_hard:

        if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
                LASSERT(clnts <= obd->obd_max_recoverable_clients);
                if (clnts + obd->obd_stale_clients <
                    obd->obd_max_recoverable_clients)
                        return 0;
        }

See the difference with check_for_clients used before ~~LU-3540~~:

static int check_for_clients(struct obd_device *obd)
{
       unsigned int clnts = atomic_read(&obd->obd_connected_clients);

       if (obd->obd_abort_recovery || obd->obd_recovery_expired)
               return 1;
       LASSERT(clnts <= obd->obd_max_recoverable_clients);
       return (clnts + obd->obd_stale_clients ==
               obd->obd_max_recoverable_clients);
}

Attachments

Activity

People

Assignee:: Di Wang (Inactive)

Reporter:: Sergey Cheremencev

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Dec/15 6:54 PM

Updated:: 11/May/18 6:27 PM

Resolved:: 09/Sep/16 5:48 PM