[LU-61] MDT can't connect to OST after hardware event: oscc recovery failed: -116 Created: 04/Feb/11  Updated: 28/Jun/11  Resolved: 04/Feb/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Kit Westneat (Inactive) Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-61.tar.gz    
Severity: 1
Rank (Obsolete): 10068

 Description   

Hi WC,

There was a hardware failure at Purdue today that took out a 6620 controller. After fixing the issue, the MDT fails to connect to one OST and have intermittent connections with another. fid2dentry is getting passed an obd_id of 0 which causes it to return a ESTALE to the MDT. I looked in bz, but I couldn't find anything similar. Have you seen anything or do you have any ideas on how to get it online?

We've tried rebooting the MDS and OSS, but after recovery it still has this issue. Would aborting recovery help? How about the CATALOGS trick? Let me know if other logs would help.

Thanks,
Kit

Relevant MDT logs:
Feb 4 17:50:53 mds-a01 kernel: [13485616.075071] LustreError: 12085:0:(osc_create.c:585:osc_create()) lustrefatal: invalid object id A-OST0001-osc: oscc recovery failed: -116
Feb 4 17:50:53 mds-a01 kernel: [13485616.075526] LustreError: 12085:0:(lov_obd.c:1131:lov_clear_orphans()) error in orphan recovery on OST idx 1/36: rc = -116
Feb 4 17:50:53 mds-a01 kernel: [13485616.076025] LustreError: 12085:0:(mds_lov.c:1062:__mds_lov_synchronize()) lustreA-OST0001_UUID failed at mds_lov_clear_orphans: -116
Feb 4 17:50:53 mds-a01 kernel: [13485616.076482] LustreError: 12085:0:(mds_lov.c:1071:__mds_lov_synchronize()) lustreA-OST0001_UUID sync failed -116, deactivating
Feb 4 17:51:39 mds-a01 kernel: [13485661.612612] LustreError: 12408:0:(osc_create.c:585:osc_create()) lustreA-OST0001-osc: oscc recovery failed: -116

lctl dl
...
28 UP osc lustreA-OST000b-osc lustreA-mdtlov_UUID 5
29 UP osc lustreA-OST0000-osc lustreA-mdtlov_UUID 5
30 IN osc lustreA-OST0001-osc lustreA-mdtlov_UUID 5
31 UP osc lustreA-OST0002-osc lustreA-mdtlov_UUID 5
32 UP osc lustreA-OST0003-osc lustreA-mdtlov_UUID 5
...

Relevant OST logs:
Feb 4 17:43:53 oss-a01 kernel: [ 1333.618994] LustreError: 10635:0:(filter.c:1428:filter_fid2dentry()) lustreA-OST0001: object 2283250:0 lookup error: rc -116
Feb 4 17:43:53 oss-a01 kernel: [ 1333.619430] LustreError: 10635:0:(filter.c:1428:filter_fid2dentry()) Skipped 1 previous similar message
Feb 4 17:43:55 oss-a01 kernel: [ 1336.503075] LustreError: 9981:0:(filter_lvb.c:90:filter_lvbo_init()) lustreA-OST0001: bad object 2283250/0: rc -116
Feb 4 17:43:55 oss-a01 kernel: [ 1336.503630] LustreError: 9981:0:(ldlm_resource.c:860:ldlm_resource_add()) lvbo_init failed for resource 2283250: rc -116
Feb 4 17:43:55 oss-a01 kernel: [ 1336.504092] LustreError: 9981:0:(ldlm_resource.c:860:ldlm_resource_add()) Skipped 37 previous similar messages



 Comments   
Comment by Cliff White (Inactive) [ 04/Feb/11 ]

Yes, I would try aborting recovery. It would be best to have the full log for a mount attempt.

Comment by Kit Westneat (Inactive) [ 04/Feb/11 ]

logs from MDS and OSS before MDS and OSS reboot

Comment by Cliff White (Inactive) [ 04/Feb/11 ]

There are 4 days of logs here. Can you tell me exactly when the issue started? When did you have the hardware failure?

Comment by Cliff White (Inactive) [ 04/Feb/11 ]

I am consulting with engineering - have you run fsck on the OSTs? This issue may indicate an issue there.

Comment by Kit Westneat (Inactive) [ 04/Feb/11 ]

The OSSes see IO errors around Feb 3 13:37:10. After the first reboot at Feb 3 14:46:41, Lustre isn't started again until Feb 4 17:27:43. That's when you can first see the -ESTALE errors.

I'll ask about the customer if they have done an fsck, I thought they had, but maybe not.

Comment by Cliff White (Inactive) [ 04/Feb/11 ]

Engineering confirms - you should run fsck the object 2283250 may be damaged.

Comment by Kit Westneat (Inactive) [ 04/Feb/11 ]

That was it, oops! Sorry for not checking that earlier. Hopefully others can learn from my mistakes... Thanks for your help!

Comment by Cliff White (Inactive) [ 04/Feb/11 ]

Great! glad it sorted out so easy, I will close this

Comment by Cliff White (Inactive) [ 04/Feb/11 ]

Customer fsck'd OST, problem solved

Comment by Peter Jones [ 04/Feb/11 ]

As per Kit ok to close

Generated at Sat Feb 10 01:03:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.