[LU-3761] sanity-hsm 55: request on 0x2000013a3:0x2:0x0 is not FAILED Created: 15/Aug/13 Updated: 19/Mar/14 Resolved: 19/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HSM | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9693 | ||||||||||||
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b716e4c2-0523-11e3-925a-52540035b04c. The sub-test test_55 failed with the following error:
Info required for matching: sanity-hsm 55
|
| Comments |
| Comment by Jodi Levi (Inactive) [ 16/Aug/13 ] |
|
Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 20/Aug/13 ] |
|
It turns out the way wait_request_state() implemented is unreliable. It depends on the HSM llog state for the progress but if the operation already finished when the client is acquiring the state, the check will fail because there is nothing in agent_actions. Probably we can change HSM llog a little bit to delay deleting items. It should be good for testing purpose. |
| Comment by jacques-charles lafoucriere [ 22/Aug/13 ] |
|
There is already a tunnable cdt_delay which set the number of seconds before a llog entry in a finished state (SUCCESS ou FAILURE) is removed. The default is 60s, sanity-hsm change it to 10s, so this value is too small. |
| Comment by Jinshan Xiong (Inactive) [ 22/Aug/13 ] |
|
Is this parameter for testing purpose only? Because I can't think of a reason why coordinator needs to delay deleting an entry. |
| Comment by Oleg Drokin [ 23/Aug/13 ] |
|
So I wonder why is this a blocker, can somebody explain the nature of the problem in some detail please? |
| Comment by Jinshan Xiong (Inactive) [ 23/Aug/13 ] |
|
I tend to think this is a timing issue where we expect that archive would fail when a truncate is on going, however, the archive operation may have already finished before truncate is started. In that case, this test case will fail. I will work out a patch. |
| Comment by jacques-charles lafoucriere [ 25/Aug/13 ] |
|
The posix CT has a parameter --bandwith which limit the visible BW on read an writes, so given a file size you have an idea of the archive/restore time. If the file is large enough the race will disappear |
| Comment by Jinshan Xiong (Inactive) [ 07/Oct/13 ] |
|
This issue should have been fixed at |