[LU-5297] osp_sync_thread can't handle invalid record gracefully Created: 04/Jul/14  Updated: 22/Sep/15  Resolved: 19/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Niu Yawei (Inactive) Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6269 Unable to mount /nobackupp8 Resolved
is related to LU-6687 ALL osp-sync in D state Resolved
Severity: 3
Rank (Obsolete): 14781

 Description   

osp_sync_process_queues() now assumes all records to be processed are not corrupted, it's lack of error handling code for invalid records.

One solution could be regarding the invalid record as committed and skipping processing on it.



 Comments   
Comment by Andreas Dilger [ 04/Jul/14 ]

Is this related to LU-5188 "osp: return 1 if osp_sync_xxx_job issue RPC"? What is the severity of this problem (i.e. what impact does it have on normal operation)? Will it cause OST objects to be leaked if they are unlinked when the OST is offline, and still offline when the MDS processes the llog records?

Comment by Niu Yawei (Inactive) [ 07/Jul/14 ]

Is this related to LU-5188 "osp: return 1 if osp_sync_xxx_job issue RPC"? What is the severity of this problem (i.e. what impact does it have on normal operation)? Will it cause OST objects to be leaked if they are unlinked when the OST is offline, and still offline when the MDS processes the llog records?

That patch was trying to handle error for osp thread, but looks it's not complete. I think the severity isn't high, because there isn't any corrupted record in normal usage.

Comment by Alex Zhuravlev [ 07/Jul/14 ]

Niu, could you explain in what part the patch is not complete? what specific cases it doesn't handle?

Comment by Niu Yawei (Inactive) [ 07/Jul/14 ]

Niu, could you explain in what part the patch is not complete? what specific cases it doesn't handle?

That patch decreases opd_syn_rpc_in_flight & opd_syn_rpc_in_progress when skipping a invalid record, that's not enough, because opd_syn_changes isn't decreased and the sync thread will break unexpectedly, and invalid record isn't deleted at the end, I'm afraid it can cause further trouble.

Comment by Andreas Dilger [ 09/Jul/14 ]

Niu, were you planning to make a patch for this, or should this bug be moved to 2.7.0?

Comment by Niu Yawei (Inactive) [ 10/Jul/14 ]

I don't have plan to make patch yet, probably we'd move it to 2.7.

Comment by Gerrit Updater [ 25/May/15 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/14925
Subject: LU-5297 osp: decrease opd_syn_changes for an invalid record
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 25ceff8e1de80f98e701f1ed5630a9b9308c2ddf

Comment by Gerrit Updater [ 19/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14925/
Subject: LU-5297 osp: process unsuccessful osp sync records properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 591f8771df00e1c3279019281e5f7d2e7c7e4877

Comment by Peter Jones [ 19/Sep/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:50:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.