[LU-3761] sanity-hsm 55: request on 0x2000013a3:0x2:0x0 is not FAILED Created: 15/Aug/13  Updated: 19/Mar/14  Resolved: 19/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: HSM

Issue Links:
Related
is related to LU-3863 sanity-hsm test_111a: FAIL: request o... Resolved
is related to LU-3969 Test failure on test suite sanity-hsm... Closed
Severity: 3
Rank (Obsolete): 9693

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b716e4c2-0523-11e3-925a-52540035b04c.

The sub-test test_55 failed with the following error:

request on 0x2000013a3:0x2:0x0 is not FAILED

Info required for matching: sanity-hsm 55

== sanity-hsm test 55: Truncate during an archive cancels it == 08:22:27 (1376493747)
Purging archive
Starting copytool
lhsmtool_posix --hsm-root /tmp/arc --daemon --bandwidth 1 /mnt/lustre
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.464772 s, 4.5 MB/s
CMD: wtm-10vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | grep 0x2000013a3:0x2:0x0 | grep action=ARCHIVE | cut -f 13 -d ' ' | cut -f 2 -d =
[...]
CMD: wtm-10vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | grep 0x2000013a3:0x2:0x0 | grep action=ARCHIVE | cut -f 13 -d ' ' | cut -f 2 -d =
Update not seen after 100s: wanted 'FAILED' got ''
sanity-hsm test_55: @@@@@@ FAIL: request on 0x2000013a3:0x2:0x0 is not FAILED



 Comments   
Comment by Jodi Levi (Inactive) [ 16/Aug/13 ]

Jinshan,
Can you have a look at this one and upgrade to blocker if needed?
Thank you!

Comment by Jinshan Xiong (Inactive) [ 20/Aug/13 ]

It turns out the way wait_request_state() implemented is unreliable. It depends on the HSM llog state for the progress but if the operation already finished when the client is acquiring the state, the check will fail because there is nothing in agent_actions.

Probably we can change HSM llog a little bit to delay deleting items. It should be good for testing purpose.

Comment by jacques-charles lafoucriere [ 22/Aug/13 ]

There is already a tunnable cdt_delay which set the number of seconds before a llog entry in a finished state (SUCCESS ou FAILURE) is removed. The default is 60s, sanity-hsm change it to 10s, so this value is too small.

Comment by Jinshan Xiong (Inactive) [ 22/Aug/13 ]

Is this parameter for testing purpose only? Because I can't think of a reason why coordinator needs to delay deleting an entry.

Comment by Oleg Drokin [ 23/Aug/13 ]

So I wonder why is this a blocker, can somebody explain the nature of the problem in some detail please?

Comment by Jinshan Xiong (Inactive) [ 23/Aug/13 ]

I tend to think this is a timing issue where we expect that archive would fail when a truncate is on going, however, the archive operation may have already finished before truncate is started. In that case, this test case will fail.

I will work out a patch.

Comment by jacques-charles lafoucriere [ 25/Aug/13 ]

The posix CT has a parameter --bandwith which limit the visible BW on read an writes, so given a file size you have an idea of the archive/restore time. If the file is large enough the race will disappear

Comment by Jinshan Xiong (Inactive) [ 07/Oct/13 ]

This issue should have been fixed at LU-3815 where it addressed a similar issue. Let's close it for now.

Generated at Sat Feb 10 01:36:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.