[LU-994] 1.8<->2.1.54 Test failure on test suite replay-single 62 Created: 15/Jan/12  Updated: 19/Mar/12  Resolved: 19/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-1036 1.8.7<->2.1.54 Test failure on test s... Resolved
Severity: 3
Rank (Obsolete): 6488

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c460154a-3e73-11e1-b417-5254004bbbd3.



 Comments   
Comment by Peter Jones [ 15/Jan/12 ]

Bobi

Could you look into this one please?

Thanks

Peter

Comment by Zhenyu Xu [ 19/Jan/12 ]

Just for the record.

MDS recoery start at 1326480370.284875, setting recovery windows in 60 seconds.

Client replay req x1390912056905666/t304942678018 (LDLM_ENQUEUE) at 1326480370.283516, and its timeout is 62 seconds

At 1326480430.284992 MDS closed its recovery window, while the replay req timedout at 1326480432.535461, so the client was evicted and failed the test.

The failure due to that the replay req's timeout time is over the MDS recovery window.

Comment by Zhenyu Xu [ 30/Jan/12 ]

I cannot reproduce it on my vm machines.

Sarah, would you mind checking what timeout values used on server and client, do they accord? Does AT (adaptive timeout) opened on client side?

Comment by Sarah Liu [ 16/Feb/12 ]

I use default value for both obd_timeout and ldlm_timeout. So I guess they accord? Yes, AT is enabled on client side. TIMEOUT in my config file is also set to 20.

Comment by Zhenyu Xu [ 16/Feb/12 ]

dup of LU-1036

Comment by Peter Jones [ 17/Mar/12 ]

Are you sure that this is a duplicate of LU1036? It seems to relate to a different test # (62 rather than 52) and seems to still be occurring in RC1...

Comment by Zhenyu Xu [ 19/Mar/12 ]

it looks like a recovery timeout extension issue, and LU-1036 also involve the same issue. LU-889 reworks the recovery timeout extension logic.

Does the rehit in RC1 contains patch of LU-889?

Comment by Peter Jones [ 19/Mar/12 ]

Yes.http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=8d4b77e5961c06847f9603ebc607118742ea1a51

Comment by Peter Jones [ 19/Mar/12 ]

As per Oleg ok to close - this appears because the version of 1.8.x being tested with does not contain this fix

Generated at Sat Feb 10 01:12:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.