[LU-3380] Failure to activate deactivated OST Created: 22/May/13  Updated: 27/May/13  Resolved: 27/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chirstopher Beggio Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux Server release 5.5 (Tikanga), Kernel 2.6.18-108redsky_chaos, Lustre:
lustre: 1.8.5
kernel: patchless_client
build: 1.8.5-6chaos


Epic/Theme: Quota
Severity: 3
Rank (Obsolete): 8366

 Description   

We are observing a failure to activate one Lustre 1.8.5 target in order to run 'lfs quota -u ...'. Five OSTs have been deactivated for performance reasons, and four of five activate correctly but one (OST0008) continuously fails to activate, despite fsck'ing both the OST and the MDT. The sequence and errors are shown below. We are aware that both kernel and lustre versions are terribly old and we are in the process of getting to version 1.8.9, and then 2.1.x. Any assistance or suggestions is appreciated.

Deactivated targets:
7 IN osc scratch1-OST0002-osc scratch1-mdtlov_UUID 5
9 IN osc scratch1-OST0004-osc scratch1-mdtlov_UUID 5
13 IN osc scratch1-OST0008-osc scratch1-mdtlov_UUID 5
15 IN osc scratch1-OST000a-osc scratch1-mdtlov_UUID 5
28 IN osc scratch1-OST0017-osc scratch1-mdtlov_UUID 5

Attempt to activate targets on the MDS results in OST0008 remaining inactive:
13 IN osc scratch1-OST0008-osc scratch1-mdtlov_UUID 5

Errors on OSS:
2013-05-22 12:11:45 xxxxxx [47215.258366] Lustre: scratch1-OST0008: received MDS connection from xx.x.xx.x@o2ib <kern.warning>
2013-05-22 12:11:45 xxxxx [47215.259576] LustreError: 8201:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0008: ignoring bogus orphan destroy request: obdid 15039753 last_id 15086581 <kern.err>

Errors on MDS:
2013-05-22 12:11:45 xxxxx [1792441.005047] LustreError: 10973:0:(osc_create.c:589:osc_create()) scratch1-OST0008-osc: oscc recovery failed: -22 <kern.err>
2013-05-22 12:11:45 xxxxx [1792441.005056] Lustre: scratch1-OST0008_UUID: Failed to clear orphan objects on OST: -22 <kern.warning>
2013-05-22 12:11:45 xxxxx [1792441.005058] Lustre: scratch1-OST0008_UUID: Sync failed deactivating: rc -22 <kern.warning>



 Comments   
Comment by Peter Jones [ 22/May/13 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Chirstopher Beggio [ 22/May/13 ]

I should add that the target has also been remounted aborting recovery (-o abort_recov) to eliminate that as a possible solution.

Comment by Chirstopher Beggio [ 22/May/13 ]

It appears this problem and solution are documented in LU-2018. I will review and implement that fix and report the result.

Comment by Niu Yawei (Inactive) [ 23/May/13 ]

yes, it looks same problem of LU-2018, you need to make sure if the lov_objid on MDS or the last_id on OST is wrong, and change them correctly following the instructions described in LU-2018. Thanks.

Comment by Chirstopher Beggio [ 24/May/13 ]

The issue was similar to LU-2018. The difference was that the lov_objid, last on-disk object ID, and LAST_ID were all different, and the MDT lov_objid for that OST and the on-disk last object ID for that target were very close, and the MDT lov_objid was advanced of the on-disk last object ID. The LAST_ID on the OST was way off, so I reset that to the last on-disk object ID and that corrected the issue. If this is the wrong solution to this problem, please let me know. It seems to me that the last on-disk object ID should be the gold standard, so I used that. Unless I have made an incorrect assumption, this may be closed.

Comment by Niu Yawei (Inactive) [ 27/May/13 ]

Right, we should use the last object ID as standard in this case. Thanks, Chris.

Comment by Niu Yawei (Inactive) [ 27/May/13 ]

Duplicated with LU-2018.

Generated at Sat Feb 10 01:33:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.