[LU-8649] Print console message in recovery when waiting for first client Created: 29/Sep/16  Updated: 19/Mar/18  Resolved: 09/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

When the server is in recovery but waiting for the first client to connect, it would be useful to print a message on the console every 10 minutes or so, in case the admins are waiting for recovery to complete. This has been reported a few times in the past, and it isn't obvious that recovery will not start until the first client connects.

It would also be useful to print something similar in the recovery_status file in /proc.



 Comments   
Comment by Andreas Dilger [ 17/Feb/17 ]

It would also be useful to improve the recovery_status file by adding WAITING_FOR_CLIENTS when recovery hasn't started yet would make it more clear to the administrator that recovery won't start until the clients connect. The difficulty is that I don't think we can replace RECOVERING completely, since userspace scripts may be checking this.

Comment by Andreas Dilger [ 23/Feb/17 ]

jgarlough wrote:

Another confusing terminology in recovery_status is INACTIVE. It doesn't provide me a reminder that this is normal and that it's really a type of COMPLETE, but nothing needed to be done.

Adding something like "NOT_NEEDED", or "UNNECESSARY", or "CLEAN_STARTUP" would be make this more clear.

Comment by Andreas Dilger [ 14/May/17 ]

Another user hit this problem (from lustre-discuss):

I've do a update of lustre on one mdt with one ost server .
The ost server go quikly to the state of "recovery" to "running".
However the mdt server stay in the recovery state, for more than 12
hours. I can't see any error message, so i think it's just on waiting.
My asking is: is ther a way to know what is doing the mdt? And for how
much time it need to get on running?

The file /proc/fs/lustre/mdt/scratch-MDT0000/recovery_status show me this information:

status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/1
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 115964117881
Comment by Peter Jones [ 14/Dec/17 ]

Emoly

Could you please look into this one?

Thanks

Peter

Comment by Gerrit Updater [ 25/Dec/17 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/30656
Subject: LU-8649 recovery: print some useful messages in recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ac49caadd328cfd345b00e1553882c7fa3b1a404

Comment by Gerrit Updater [ 09/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30656/
Subject: LU-8649 recovery: print some useful messages in recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 614e0def46be607cc53bbec9624d5b039068ba7c

Comment by Minh Diep [ 09/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 09/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30811
Subject: LU-8649 recovery: print some useful messages in recovery
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: c7b23159c96b33cca70abb7a2d8cb87c6765d5ff

Comment by Gerrit Updater [ 19/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30811/
Subject: LU-8649 recovery: print some useful messages in recovery
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 5c5bcfdb6ea715039d32d72965c7415f5de3d2fc

Generated at Sat Feb 10 02:19:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.