[LU-2206] OSS stuck in recovery Created: 17/Oct/12  Updated: 01/Nov/12  Resolved: 01/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: topsequoia
Environment:

Older iteration of orion branch, lustre 2.3.49.54-68chaos


Issue Links:
Duplicate
duplicates LU-2104 conf-sanity test 47 never completes, ... Resolved
Severity: 3
Rank (Obsolete): 5251

 Description   

We keep getting OSS nodes stuck in recovery on Sequoia's filesystem. the recovery_stat file reports:

$ cat /proc/fs/lustre/obdfilter/ls1-OST0005/recovery_status 
status: RECOVERING
recovery_start: 1350515102
time_remaining: 0
connected_clients: 357/787
req_replay_clients: 32
lock_repay_clients: 131
completed_clients: 226
evicted_clients: 430
replayed_requests: 23
queued_requests: 32
next_transno: 12885381895

On the console we're seeing clearly bad messages like:

Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) reconnecting, waiting for 787 clients in recovery for -64:-32
Lustre: Skipped 2006 previous similar messages
Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1908 previous similar messages

Keep in mind, this is still the older orion branch code, our version 2.3.49.54-68chaos.



 Comments   
Comment by Peter Jones [ 18/Oct/12 ]

Alex

Another Sequoia issue to review....

Peter

Comment by Christopher Morrone [ 29/Oct/12 ]

Similar negative recovery times in LU-2104.

Comment by Andreas Dilger [ 01/Nov/12 ]

I'm closing this as a duplicate of LU-2104, since that bug has more debug info, and this one has almost nothing.

Generated at Sat Feb 10 01:23:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.