[LU-2464] recovery can't be finished forever Created: 11/Dec/12  Updated: 08/Apr/13  Resolved: 08/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre-1.8.8, Infiniband


Attachments: Text File LFS01-MDS-ALPL105-msg.log    
Severity: 3
Rank (Obsolete): 5809

 Description   

we have "never recovery finished" conditions at the customer site. Even it took a couple of hours after MDT starts, it was still RECOVERING in recovery_status. We tried umount and remount, but it was still same situation and denied new clients connection Finally, we did "-o abort_recovery" to mount options, to fix this problem. So, why the recovery can't finished in reasonable time?

# cat /proc/fs/lustre/mds/*/recovery_status
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/2
delayed_clients: 0/2
completed_clients: 0/2
replayed_requests: 0/??
queued_requests: 0
next_transno: 38672353440


 Comments   
Comment by Johann Lombardi (Inactive) [ 11/Dec/12 ]

Ihara, the recovery timer only starts once the first client reconnects. Since none of the clients have reconnected yet, recovery is still in progress.

Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ]

which clients MDS was waiting for recovery?

Comment by Johann Lombardi (Inactive) [ 11/Dec/12 ]
connected_clients: 0/2

It is waiting for 2 clients, but i can't tell you which ones (we only store the UUID in the last_rcvd file).

Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ]

So, if we can't find these two clients or can't connect clients by some reasones (e.g. h/w or s/w problems), but don't want to finish recovery, is abort_recovery only way?

Comment by Andreas Dilger [ 11/Dec/12 ]

Ihara, to avoid the problem where the server is disconnected from the network, the recovery timer does not start until any client tries to connect to the server. If you know that no clients will connect then abort_recovery will speed this up. Otherwise, recovery will start when the first client tries to mount the filesystem.

Comment by Peter Jones [ 12/Dec/12 ]

Assigning to Bruno for any follow on questions

Comment by Bruno Faccini (Inactive) [ 22/Dec/12 ]

Ihara,
Is there anything more we can do on this ticket ?? If not, can we close it ??

Comment by Shuichi Ihara (Inactive) [ 22/Dec/12 ]

Bruno,

we had same problem at same customer twice after you reviewed. but we couldn't get crashdump due to some hardware configuration problem. but now it should work and once we hit same problem again, we should be able to give you for more deep analysis.

please keep this open and will you updates.

Comment by Bruno Faccini (Inactive) [ 25/Jan/13 ]

Ihara, No news ?

Comment by Shuichi Ihara (Inactive) [ 25/Jan/13 ]

Bruno, the last problem was fixed by abort_recovery and we don't need additinal investigation on this. please close this ticket and let me open the new ticket if we see same problme at this customer.
Thanks!

Generated at Sat Feb 10 01:25:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.