[LU-4804] Test failure sanity test_17n: create remote dir error Created: 24/Mar/14  Updated: 05/Nov/19  Resolved: 05/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathaniel Clark Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 13219

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/97cf2bae-b21d-11e3-9a4b-52540035b04c.

The sub-test test_17n failed with the following error:

create remote dir error 0

Info required for matching: sanity 17n



 Comments   
Comment by Jodi Levi (Inactive) [ 24/Mar/14 ]

Di,
Can you please comment on this ticket?
Thank you!

Comment by Di Wang [ 27/Mar/14 ]

Just checking the debug log, LWP(between other MDTs and MDT0) is evicted, but I am not sure which RPC is failed because of this, we already add failover support for seq lookup in LU-4571, so it has to be some other RPC. Unfortunately, I can not find further information in the debug log. Probably we should go with option2 suggested in LU-4571, i.e. not evict LWP during reconnect after reboot, but then it needs quota to set special flag not to replay the quota lock during recovery. Niu, Could you please comment here?

Comment by Bruno Faccini (Inactive) [ 27/May/14 ]

Di,
Am I correct when I think the scenario you describe can also apply to other stages of sanity/test_17n, like the one ending with a "destroy remote dir" error ? I am thinking about this because one of my test session triggered this error (https://maloo.whamcloud.com/test_sets/92c4a4ee-e4fa-11e3-a294-52540035b04c), like a few others I found into Maloo stats/search, and the related Lustre debug-logs seem to indicate that an -ESTALE/-116 was returned due to some communication failure between MDSs.

Comment by Di Wang [ 27/May/14 ]

Bruno,

I did not see the connection between MDT0 and MDT1 is being evicted according to the information in https://maloo.whamcloud.com/test_sets/92c4a4ee-e4fa-11e3-a294-52540035b04c, probably different problem as I see. Hmm, I saw this in the debug message,

Lustre: lustre-MDT0001-osp-MDT0000: Connection to lustre-MDT0001 (at 10.1.5.234@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0001-osp-MDT0000: Connection restored to lustre-MDT0001 (at 10.1.5.234@tcp)
LustreError: 11613:0:(osp_object.c:355:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x440000400:0x22:0x0]: rc = -116

You probably need create a new ticket, or is there one already? Thanks.

Comment by Nathaniel Clark [ 02/Jun/14 ]

Bruno,
I created LU-5130 for "destroy remote dir" error

Comment by Andreas Dilger [ 05/Nov/19 ]

Haven't seen this in a long time.

Generated at Sat Feb 10 01:45:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.