[LU-61] MDT can't connect to OST after hardware event: oscc recovery failed: -116 Created: 04/Feb/11 Updated: 28/Jun/11 Resolved: 04/Feb/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 1.8.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 1 |
| Rank (Obsolete): | 10068 |
| Description |
|
Hi WC, There was a hardware failure at Purdue today that took out a 6620 controller. After fixing the issue, the MDT fails to connect to one OST and have intermittent connections with another. fid2dentry is getting passed an obd_id of 0 which causes it to return a ESTALE to the MDT. I looked in bz, but I couldn't find anything similar. Have you seen anything or do you have any ideas on how to get it online? We've tried rebooting the MDS and OSS, but after recovery it still has this issue. Would aborting recovery help? How about the CATALOGS trick? Let me know if other logs would help. Thanks, Relevant MDT logs: lctl dl Relevant OST logs: |
| Comments |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
Yes, I would try aborting recovery. It would be best to have the full log for a mount attempt. |
| Comment by Kit Westneat (Inactive) [ 04/Feb/11 ] |
|
logs from MDS and OSS before MDS and OSS reboot |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
There are 4 days of logs here. Can you tell me exactly when the issue started? When did you have the hardware failure? |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
I am consulting with engineering - have you run fsck on the OSTs? This issue may indicate an issue there. |
| Comment by Kit Westneat (Inactive) [ 04/Feb/11 ] |
|
The OSSes see IO errors around Feb 3 13:37:10. After the first reboot at Feb 3 14:46:41, Lustre isn't started again until Feb 4 17:27:43. That's when you can first see the -ESTALE errors. I'll ask about the customer if they have done an fsck, I thought they had, but maybe not. |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
Engineering confirms - you should run fsck the object 2283250 may be damaged. |
| Comment by Kit Westneat (Inactive) [ 04/Feb/11 ] |
|
That was it, oops! Sorry for not checking that earlier. Hopefully others can learn from my mistakes... Thanks for your help! |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
Great! glad it sorted out so easy, I will close this |
| Comment by Cliff White (Inactive) [ 04/Feb/11 ] |
|
Customer fsck'd OST, problem solved |
| Comment by Peter Jones [ 04/Feb/11 ] |
|
As per Kit ok to close |