HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3685] some paths in ll_ioc_copy_{start,end} set hpk_errval non-zero but don't set HP_FLAG_COMPLETED Created: 01/Aug/13  Updated: 19/Aug/13  Resolved: 19/Aug/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Technical task Priority: Major
Reporter: John Hammond Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Rank (Obsolete): 9517

 Description   

Running racer with HSM operations I see messages of the form:

LustreError: 4158:0:(ldlm_resource.c:1188:ldlm_resource_get()) lustre-OST0001: lvbo_init failed for resource 0x1936:0x0: rc = -2
LustreError: 11-0: lustre-OST0001-osc-ffff8801f01ff000: Communicating with 0@lo, operation ost_getattr failed with -12.
LustreError: 4140:0:(mdt_coordinator.c:1500:mdt_hsm_update_request_state()) lustre-MDT0000: Progress on [0x200000401:0x972f:0x0] for cookie 0x51faf3c6 action=ARCHIVE is not coherent (err=12 and not completed (flags=2))

after which the coordinator just stops sending actions to the copytool.

The coordinator seems to just drop these incoherent progress kernels. Is there a use case for a HPK with hpk_errval != 0 but which is not complete?

Do not be distracted by the specific errno here. The node is not really OOM, it's just that somewhere in the OST code a NULL something is misinterpreted as meaning -ENOMEM, whereas really it means -ENOENT or something.



 Comments   
Comment by Aurelien Degremont (Inactive) [ 05/Aug/13 ]

Hi John,

I've looked at this. Indeed, HP_FLAG_COMPLETED is missing on error cases for copy_start(), but everything seems fine for copy_end(). Could you confirm?
By the way, do you have a way to reproduce this?

Comment by Aurelien Degremont (Inactive) [ 05/Aug/13 ]

You could assign this ticket to me.

Comment by John Hammond [ 05/Aug/13 ]

Hi Aurelien,

You are correct about ll_ioc_copy_end(). My mistake.

I reproduced this by adding an HSM archive, release, restore loop to racer. But it can be done more specifically by racing unlink versus archive.

It seems that I cannot assign this issue to you since JIRA does not consider you to be a "developer." My condolences. I will and see about adding you to that group.

Comment by jacques-charles lafoucriere [ 05/Aug/13 ]

"It seems that I cannot assign this issue to you since JIRA does not consider you to be a "developer." My condolences. I will and see about adding you to that group"
In the past it was restricted to Whamcloud/Intel employees (because this group may see things non coorp guys like us should not see). If possible add Thomas, henri and myself

Comment by Aurelien Degremont (Inactive) [ 07/Aug/13 ]

Patch for this: http://review.whamcloud.com/7265

Comment by John Hammond [ 19/Aug/13 ]

Patch landed to master.

Generated at Sat Feb 10 01:36:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.