[LU-2368] OSTs stuck in perpetual recovery Created: 21/Nov/12  Updated: 22/May/13  Resolved: 22/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.6

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 5631

 Description   

MDS failover happend during OSTs recovery, and OST got two mds connections from different IP. First was processed by OST, second connection cause class_fail_export() at target_handle_connect(), and we got perpetual recovery.

Oct 26 17:09:07 snx11001n008 kernel: [  838.638847] Lustre: 90700:0:(ldlm_lib.c:2007:target_recovery_init()) RECOVERY: service snx11001-OST0012, 3 recoverable clients, last_transno 54526017
Oct 26 17:09:07 snx11001n008 kernel: [  838.708354] Lustre: snx11001-OST0012: Now serving snx11001-OST0012/ on /dev/md4 with recovery enabled
Oct 26 17:09:07 snx11001n008 kernel: [  838.717732] Lustre: snx11001-OST0012: Will be in recovery for at least 15:00, or until 3 clients reconnect
Oct 26 17:11:05 snx11001n008 kernel: [  956.648093] LustreError: 88011:0:(ldlm_lib.c:927:target_handle_connect()) snx11001-OST0012: NID 10.10.101.3@o2ib1 (snx11001-MDT0000-mdtlov_UUID) reconnected with 1 conn_cnt; cookies not random?
Oct 26 17:15:10 snx11001n008 kernel: [ 1201.718217] Lustre: 88009:0:(ldlm_lib.c:941:target_handle_connect()) snx11001-OST0012: connection from snx11001-MDT0000-mdtlov_UUID@10.10.101.3@o2ib1 recovering/t0 exp ffff88072cb90400 cur 1351289710 last 1351289346
Oct 26 17:18:40 snx11001n008 kernel: [ 1410.931800] Lustre: 88010:0:(ldlm_lib.c:854:target_handle_connect()) snx11001-OST0012: received MDS connection from NID 10.10.101.4@o2ib1, removing former export from NID 10.10.101.3@o2ib1
Oct 26 17:18:40 snx11001n008 kernel: [ 1410.948937] Lustre: 88010:0:(ldlm_lib.c:941:target_handle_connect()) snx11001-OST0012: connection from snx11001-MDT0000-mdtlov_UUID@10.10.101.4@o2ib1 recovering/t0 exp (null) cur 1351289920 last 0
Oct 26 17:18:40 snx11001n008 kernel: [ 1410.976334] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 381s


 Comments   
Comment by Alexander Boyko [ 21/Nov/12 ]

http://review.whamcloud.com/4641

Comment by Mikhail Pershin [ 21/Nov/12 ]

Alexander, could you give more info about how does that causes perpetual recovery? Just logs showing that will be good. I wonder just isn't this the same as LU-2104?

Comment by Alexander Boyko [ 21/Nov/12 ]

I can see updated recovery timer (recovery is timed out, evict stale exports), so this does not relate to LU-2104.

Oct 26 17:18:40 snx11001n008 kernel: [ 1410.976334] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 381s
Oct 26 17:21:31 snx11001n008 kernel: [ 1582.137247] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 209s
Oct 26 17:24:01 snx11001n008 kernel: [ 1731.873364] Lustre: 88010:0:(ldlm_lib.c:941:target_handle_connect()) snx11001-OST0012: connection from snx11001-MDT0000-mdtlov_UUID@10.10.101.4@o2ib1 recovering/t0 exp (null) cur 1351290241 last 0
Oct 26 17:25:01 snx11001n008 kernel: [ 1791.526174] Lustre: snx11001-OST0012: disconnecting 1 stale clients
Oct 26 17:25:16 snx11001n008 kernel: [ 1806.790647] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 54s
Oct 26 17:26:11 snx11001n008 kernel: [ 1861.406988] Lustre: snx11001-OST0012: recovery is timed out, evict stale exports
Oct 26 17:27:21 snx11001n008 kernel: [ 1931.296723] Lustre: snx11001-OST0012: recovery is timed out, evict stale exports
Oct 26 17:28:31 snx11001n008 kernel: [ 2001.192849] Lustre: snx11001-OST0012: recovery is timed out, evict stale exports
Oct 26 17:30:16 snx11001n008 kernel: [ 2106.354440] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 34s
Oct 26 17:32:01 snx11001n008 kernel: [ 2210.862684] Lustre: snx11001-OST0012: recovery is timed out, evict stale exports
Oct 26 17:39:01 snx11001n008 kernel: [ 2630.201143] Lustre: snx11001-OST0012: recovery is timed out, evict stale exports
Oct 26 17:47:46 snx11001n008 kernel: [ 3154.655964] LustreError: 88010:0:(ldlm_lib.c:974:target_handle_connect()) snx11001-OST0012: denying connection for new client 10.10.101.4@o2ib1 (snx11001-MDT0000-mdtlov_UUID): 0 clients in recovery for 34s
Comment by Mikhail Pershin [ 21/Nov/12 ]

but recovery never ends still? Or it just lasts too long?

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-738

Comment by Mikhail Pershin [ 22/Nov/12 ]

I cannot access Xyratex site and check bug internals there

Comment by Alexander Boyko [ 23/Nov/12 ]

> but recovery never ends still? Or it just lasts too long?
never ends
above, you can see how timer was restarted during 30 mins

Comment by Mikhail Pershin [ 23/Nov/12 ]

Strictly speaking the situation with two MDS connection is not something special, old one is evicted through class_fail_export(), the second is not allowed until recovery is finished. So problem is why recovery cannot finish and the reason is that class_fail_export() call. During recovery all evicted/failed clients are counted in obd_stale_clients, inconsistent counter may cause recovery stuck. I think this patch should solve your problem:

	/* if called during recovery then should keep obd_stale_clients
	 * consistent */
	if (exp->exp_obd->obd_recovering)
		exp->exp_obd->obd_stale_clients++;

I'd prefer this solution because it fixes source of problem, your patch is correct too, but cover only case with that one particular call to class_fail_export(). I tried to simulate similar situation in master branch and patch above works, I'd appreciate if you will check does it help in your case? If this is not easy to do then I will agree with your patch for b2_1.

The LU-2104 is the same problem actually, but there is also miscalculation in recovery timer reset which stops timer at all so recovery exceed timeout without being woken up, but the reason is the same and master will be fixed with lu-2104 patch anyway.

Comment by Alexander Boyko [ 25/Nov/12 ]

Thanks Mikhail, I will try it.

Comment by Alexander Boyko [ 25/Nov/12 ]

I approve, the patch

/* if called during recovery then should keep obd_stale_clients
	 * consistent */
	if (exp->exp_obd->obd_recovering)
		exp->exp_obd->obd_stale_clients++;

fix this issue. And is better than http://review.whamcloud.com/4641.

Mikhail, do you plan to land LU-2104 patch to b2_1 branch, or small change?

Comment by Mikhail Pershin [ 25/Nov/12 ]

I had no such plan so far. I think that code you checked is enough for b2_1, I'd use just that fix as it is sufficient. Can you push it for b2_1 in context of this ticket?

Comment by Alexander Boyko [ 26/Nov/12 ]

sure, I have changed my previos patch with this one.

Comment by Keith Mannthey (Inactive) [ 22/May/13 ]

Both patches were landed for 2.1 .

Generated at Sat Feb 10 01:24:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.