Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.4.0
-
Older iteration of orion branch, lustre 2.3.49.54-68chaos
-
3
-
5251
Description
We keep getting OSS nodes stuck in recovery on Sequoia's filesystem. the recovery_stat file reports:
$ cat /proc/fs/lustre/obdfilter/ls1-OST0005/recovery_status status: RECOVERING recovery_start: 1350515102 time_remaining: 0 connected_clients: 357/787 req_replay_clients: 32 lock_repay_clients: 131 completed_clients: 226 evicted_clients: 430 replayed_requests: 23 queued_requests: 32 next_transno: 12885381895
On the console we're seeing clearly bad messages like:
Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) reconnecting, waiting for 787 clients in recovery for -64:-32 Lustre: Skipped 2006 previous similar messages Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 1908 previous similar messages
Keep in mind, this is still the older orion branch code, our version 2.3.49.54-68chaos.
Attachments
Issue Links
- duplicates
-
LU-2104 conf-sanity test 47 never completes, negative time to recovery
- Resolved