[LU-1166] recovery never finished Created: 02/Mar/12  Updated: 18/Nov/16  Resolved: 29/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2

Type: Bug Priority: Minor
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.1.0 + with minimal back porting from 2.2


Issue Links:
Related
is related to LU-1522 ASSERTION(cfs_atomic_read(&obd->obd_r... Resolved
Severity: 3
Rank (Obsolete): 4669

 Description   

while testing we hit a situation when recovery never finished and recovery timer exceed a hard recovery timer.

00010000:00080000:20.0:1330709620.108824:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:20.0:1330709690.108951:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:20.0:1330709690.120847:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709760.120868:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709760.132776:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709830.131858:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709830.143745:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709900.142871:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709900.154725:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 40 seconds
00010000:00000400:1.0:1330709940.153865:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.165727:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:13.0:1330709940.165827:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:13.0:1330709940.177697:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.178088:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.189941:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:13.0:1330709940.190014:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:13.0:1330709940.201864:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.202082:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.213933:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.214821:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports

...

after analyzing a logs that hand looks addressed to waiting in target_recovery_overseer function with check_for_clients() argument.
that hung looks a result of using a
       if (obd->obd_no_conn == 0 &&
           obd->obd_connected_clients + obd->obd_stale_clients ==
           obd->obd_max_recoverable_clients)

in case of MDT
obd_no_conn set by post recovery if at least one ost connected and config llog processed.
but mdt_postrecov can't called because recovery isn't finished.

second issue in that area - reset_recovery_timer function.
if we have a race and reset_recovery_timer function called in same time as recovery should be finished, but before timer a hit, we set a '0' (and negative number at next turn) as next timer time.

00010000:00080000:1.0:1330709942.007696:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 4294967294 seconds
00010000:00080000:1.1:1330709942.007794:0:9:0:(ldlm_lib.c:1887:target_recovery_expired()) snxs4-MDT0000: recovery timed out; 36 clients are still in recovery after 902s (49 clients connected)
00010000:00000400:13.0:1330709942.007802:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports



 Comments   
Comment by Mikhail Pershin [ 03/Mar/12 ]

obd_no_conn is set in mdt_obd_notify -> mdt_allow_cli, morevoer that would be just useless to set it after recovery what means no any client can participate in recovery ever. I don't think this is related to the issue.

About reset_recovery_timer(), can you show in details how it can become negative?

Also note, that there were changes after 2.1.0 related to recovery timer changes:

2012-02-16 Jinshan Xiong LU-889 recovery: rework extend_recovery_timer()
2011-11-03 Jinshan Xiong ORNL-28 recovery: rework extend_recovery_timer()
2011-10-21 Jinshan Xiong ORNL-28: Set recovery timeout correctly

unfortunately first one broke recovery timer and only with last one it is restored. Please check these patches weren't ported to your 2.1.0 from 2.2 separately. And maybe it makes sense to port them all together, because I see that now this function differs from 2.1.0, probably issue is fixed already.

Comment by Alexey Lyashkov [ 03/Mar/12 ]

Hm.. you are right.
obd_no_conn == 0.
I found a crash dump from a similar situation.
$7 = {
obd_type = 0xffff88043e9d4740,
obd_magic = 2874988271,
obd_name = "snxs4-MDT0000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\00
0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\
000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
obd_uuid =

{ uuid = "snxs4-MDT0000_UUID\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" }

,
obd_lu_dev = 0xffff880441754000,
obd_minor = 2,
obd_attached = 1,
obd_set_up = 1,
obd_recovering = 1,
obd_abort_recovery = 1,
obd_version_recov = 1,
obd_replayable = 1,
obd_no_transno = 0,
obd_no_recov = 0,
obd_stopping = 0,
obd_starting = 1,
obd_force = 0,
obd_fail = 0,
obd_async_recov = 0,
obd_no_conn = 0,
obd_inactive = 0,
obd_process_conf = 0,
obd_recovery_expired = 0
...
obd_refcount =

{ counter = 4171 }

,
...

obd_unlinked_exports =

{ next = 0xffff88042b7fd4a0, prev = 0xffff88042b7fd4a0 }

,
obd_delayed_exports =

{ next = 0xffff88045033c1a8, prev = 0xffff88045033c1a8 }

,
obd_num_exports = 2072,
..
obd_exports_timed =

{ next = 0xffff88042b98e4d0, prev = 0xffff88042b6f68d0 }

,
obd_eviction_timer = 0,
obd_max_recoverable_clients = 4471,
obd_connected_clients = 2095,
obd_stale_clients = 2400,
obd_delayed_clients = 0,
...

obd_recovery_start = 1330601299,
obd_recovery_end = 1330602199,
obd_recovery_time_hard = 900,
obd_recovery_timeout = 900,
obd_recovery_data = {
trd_recovery_handler = 0xffffffffa09a18b0 <mdt_recovery_handle>,
trd_processing_task = 15692,
trd_starting = {
done = 0,
wait = {
lock = {
raw_lock =

{ slock = 196611 }

},
task_list =

{ next = 0xffff88045033c510, prev = 0xffff88045033c510 }

}
},
trd_finishing = {
done = 0,
wait = {
lock = {
raw_lock =

{ slock = 65537 }

},
task_list =

{ next = 0xffff88042b94dbc0, prev = 0xffff88042b94dbc0 }

}
}
},
obd_replayed_locks = 0,
obd_req_replay_clients =

{ counter = 24 },
obd_lock_replay_clients = { counter = 24 }

,

Comment by Mikhail Pershin [ 03/Mar/12 ]

After discussion with Alex it is clean that endless recovery cycle caused by wrong number of stale clients and connected, so the check below never true:

check_for_clients(struct obd_device *obd)
{
        unsigned int clnts = cfs_atomic_read(&obd->obd_connected_clients);

        if (obd->obd_abort_recovery || obd->obd_recovery_expired)
                return 1;

        LASSERT(clnts <= obd->obd_max_recoverable_clients);
--->    return (clnts + obd->obd_stale_clients ==
                obd->obd_max_recoverable_clients);
}

I suspect this is caused by class_disconnect_stale_exports() which moves stale clients from obd_exports list but they are still in hash, so connected client can find export from stale list and connect to it. Therefore that export will be 'connected' and 'stale' at the same time. Solution could be removal export from hash along with removal from obd_exports list, but more close investigation is needed to check that there is no other races.

Comment by Alexey Lyashkov [ 03/Mar/12 ]

per additional discussion with Mike, we have a verdict - that is race between class_disconnect and target_handle_connect.

int class_disconnect(struct obd_export *export)
{
        int already_disconnected;
        ENTRY;

        if (export == NULL) {
                fixme();
                CDEBUG(D_IOCTL, "attempting to free NULL export %p\n", export);
                RETURN(-EINVAL);
        }

        cfs_spin_lock(&export->exp_lock);
        already_disconnected = export->exp_disconnected;
        export->exp_disconnected = 1;
        cfs_spin_unlock(&export->exp_lock);

        /* class_cleanup(), abort_recovery(), and class_fail_export()
         * all end up in here, and if any of them race we shouldn't
         * call extra class_export_puts(). */
        if (already_disconnected) {
                LASSERT(cfs_hlist_unhashed(&export->exp_nid_hash));
                GOTO(no_disconn, already_disconnected);
        }

        CDEBUG(D_IOCTL, "disconnect: cookie "LPX64"\n",
               export->exp_handle.h_cookie);

        if (!cfs_hlist_unhashed(&export->exp_nid_hash))
                cfs_hash_del(export->exp_obd->obd_nid_hash,
                             &export->exp_connection->c_peer.nid,
                             &export->exp_nid_hash);

        class_export_recovery_cleanup(export);
<<<
wait where
>>>
        class_unlink_export(export);
no_disconn:
        class_export_put(export);
        RETURN(0);
}

if target_handle_connect will raced with class_export_recovery_cleanup in waiting on
obd_recovery_task_lock - we will count a one export twice.
we will leak a obd_connected_clients, and other recovery counters...

 
  obd_req_replay_clients = {
    counter = 24
  }, 
  obd_lock_replay_clients = {
    counter = 24
  }, 
Comment by Alexey Lyashkov [ 05/Mar/12 ]

remote: New Changes:
remote: http://review.whamcloud.com/2255

Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,client,el5,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,server,el5,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,client,el6,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,server,el6,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 29/Mar/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #531
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = FAILURE
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Peter Jones [ 29/Mar/12 ]

Landed for 2.3

Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el6,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,server,el5,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el6,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el5,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el5,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el6,inkernel #340
LU-1166 recovery: don't leak a connected client counter. (Revision 737da0331e8407a704cd11c04f18c2cd3d437800)

Result = SUCCESS
Oleg Drokin : 737da0331e8407a704cd11c04f18c2cd3d437800
Files :

  • lustre/obdclass/genops.c
  • lustre/ldlm/ldlm_lib.c
Comment by Bob Glossman (Inactive) [ 07/May/12 ]

http://review.whamcloud.com/#change,2665
back port to b2_1

Comment by Bob Glossman (Inactive) [ 22/May/12 ]

http://review.whamcloud.com/#change,2874
more changes missing from previous back port

Comment by Gregoire Pichon [ 12/Jul/12 ]

Hello Bob,

I don't fully understand the portions of code impacted by this ticket, but could you explain why the additional change http://review.whamcloud.com/#change,2874 you submitted in b2_1 is not present in the master release.

diff --git a/lustre/ldlm/ldlm_lib.c b/lustre/ldlm/ldlm_lib.c
index a8433fc..0798ba7 100644
--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1069,7 +1069,8 @@ dont_check_exports:
           class_disconnect->class_export_recovery_cleanup() race
          */
         cfs_spin_lock(&target->obd_recovery_task_lock);
-        if (target->obd_recovering && !export->exp_in_recovery) {
+        if (target->obd_recovering && !export->exp_in_recovery &&
+            !export->exp_disconnected) {
                 cfs_spin_lock(&export->exp_lock);
                 export->exp_in_recovery = 1;
                 export->exp_req_replay_needed = 1;

thanks.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-451

Generated at Sat Feb 10 01:14:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.