[LU-2206] OSS stuck in recovery Created: 17/Oct/12 Updated: 01/Nov/12 Resolved: 01/Nov/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | topsequoia | ||
| Environment: |
Older iteration of orion branch, lustre 2.3.49.54-68chaos |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5251 | ||||||||
| Description |
|
We keep getting OSS nodes stuck in recovery on Sequoia's filesystem. the recovery_stat file reports: $ cat /proc/fs/lustre/obdfilter/ls1-OST0005/recovery_status status: RECOVERING recovery_start: 1350515102 time_remaining: 0 connected_clients: 357/787 req_replay_clients: 32 lock_repay_clients: 131 completed_clients: 226 evicted_clients: 430 replayed_requests: 23 queued_requests: 32 next_transno: 12885381895 On the console we're seeing clearly bad messages like: Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) reconnecting, waiting for 787 clients in recovery for -64:-32 Lustre: Skipped 2006 previous similar messages Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 1908 previous similar messages Keep in mind, this is still the older orion branch code, our version 2.3.49.54-68chaos. |
| Comments |
| Comment by Peter Jones [ 18/Oct/12 ] |
|
Alex Another Sequoia issue to review.... Peter |
| Comment by Christopher Morrone [ 29/Oct/12 ] |
|
Similar negative recovery times in |
| Comment by Andreas Dilger [ 01/Nov/12 ] |
|
I'm closing this as a duplicate of |