[LU-14930] Server has no clients to recover after failover Created: 11/Aug/21  Updated: 22/Aug/22  Resolved: 25/Aug/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12546 add option to abort recovery between ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Several reports point to the problem with INACTIVE recovery_status on MDT and OSTs. Such status is reported when server has no clients in last_rcvd file to recover, but in reported cases there were clients for sure and servers were under load. 



 Comments   
Comment by Gerrit Updater [ 11/Aug/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44610
Subject: LU-14930 mdt: abort_recov_mdt shouldn't abort client recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 905a73042510887855ae709d2a04c0b001bd2333

Comment by Mikhail Pershin [ 11/Aug/21 ]

I am not sure that this is the real reason for the issue, but found that while investigation and think that might be related. At least that could explain why server has no more clients on storage

Comment by Etienne Aujames [ 25/Aug/21 ]

Hello,

I have observed that "umount -f <target>" removes clients from last_rcvd.
So at the next "mount" status INACTIVE is displayed (recovery is disabled and errors will be reported on user applications).

Not sure if is related but it could be good to know.

Comment by Mikhail Pershin [ 25/Aug/21 ]

yes, that is exactly the reason for INACTIVE recovery_status, because it means not that target is 'inactive' but that recovery is 'inactive' which can be in two cases - it is yet not started, i.e. server didn't read client data from storage OR it was not started at all because server has no clients in last_rcvd.
In reports mentioned in description servers were unmounted with '-f' option so that explains what we have seen. That also means patch is not quite related though it still fixes existing problem

Comment by Gerrit Updater [ 25/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44610/
Subject: LU-14930 mdt: abort_recov_mdt shouldn't abort client recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6fd75f264c5f5c186bbfe559e1a98fb3769d8128

Comment by Peter Jones [ 25/Aug/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 14/Oct/21 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/45245
Subject: LU-14930 mdt: abort_recov_mdt shouldn't abort client recovery
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: b4f396a1efe57029160d13e2419d8730925f3438

Comment by Gerrit Updater [ 22/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48285
Subject: LU-14930 mdt: abort_recov_mdt shouldn't abort client recovery
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f69cc05fca93db87d5c69cfe6b86aae0fe1ce6a2

Generated at Sat Feb 10 03:13:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.