[LU-2464] recovery can't be finished forever Created: 11/Dec/12 Updated: 08/Apr/13 Resolved: 08/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre-1.8.8, Infiniband |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5809 |
| Description |
|
we have "never recovery finished" conditions at the customer site. Even it took a couple of hours after MDT starts, it was still RECOVERING in recovery_status. We tried umount and remount, but it was still same situation and denied new clients connection Finally, we did "-o abort_recovery" to mount options, to fix this problem. So, why the recovery can't finished in reasonable time? # cat /proc/fs/lustre/mds/*/recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/2 delayed_clients: 0/2 completed_clients: 0/2 replayed_requests: 0/?? queued_requests: 0 next_transno: 38672353440 |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 11/Dec/12 ] |
|
Ihara, the recovery timer only starts once the first client reconnects. Since none of the clients have reconnected yet, recovery is still in progress. |
| Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ] |
|
which clients MDS was waiting for recovery? |
| Comment by Johann Lombardi (Inactive) [ 11/Dec/12 ] |
connected_clients: 0/2 It is waiting for 2 clients, but i can't tell you which ones (we only store the UUID in the last_rcvd file). |
| Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ] |
|
So, if we can't find these two clients or can't connect clients by some reasones (e.g. h/w or s/w problems), but don't want to finish recovery, is abort_recovery only way? |
| Comment by Andreas Dilger [ 11/Dec/12 ] |
|
Ihara, to avoid the problem where the server is disconnected from the network, the recovery timer does not start until any client tries to connect to the server. If you know that no clients will connect then abort_recovery will speed this up. Otherwise, recovery will start when the first client tries to mount the filesystem. |
| Comment by Peter Jones [ 12/Dec/12 ] |
|
Assigning to Bruno for any follow on questions |
| Comment by Bruno Faccini (Inactive) [ 22/Dec/12 ] |
|
Ihara, |
| Comment by Shuichi Ihara (Inactive) [ 22/Dec/12 ] |
|
Bruno, we had same problem at same customer twice after you reviewed. but we couldn't get crashdump due to some hardware configuration problem. but now it should work and once we hit same problem again, we should be able to give you for more deep analysis. please keep this open and will you updates. |
| Comment by Bruno Faccini (Inactive) [ 25/Jan/13 ] |
|
Ihara, No news ? |
| Comment by Shuichi Ihara (Inactive) [ 25/Jan/13 ] |
|
Bruno, the last problem was fixed by abort_recovery and we don't need additinal investigation on this. please close this ticket and let me open the new ticket if we see same problme at this customer. |