[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING' Created: 11/Oct/13  Updated: 15/Sep/14  Resolved: 10/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM, zfs

Issue Links:
Duplicate
is duplicated by LU-4235 Test failure on test suite sanity-hsm... Resolved
Related
is related to LU-4126 sanity-hsm test_15 failure: 'request... Resolved
Severity: 3
Rank (Obsolete): 10999

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.

The sub-test test_24d failed with the following error:

Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
:
:
CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print \$13 }' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED

Info required for matching: sanity-hsm 24d



 Comments   
Comment by Bruno Faccini (Inactive) [ 21/Oct/13 ]

In fact it is not only sanity-hsm/test_24d which fails due to this "FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED" error, but also sub-tests test_26/test_27b/test_28 !!

This error always happen during a file Archive, when similar actions in intermediate sub-tests were successful.

sanity-hsm/test_33 also failed during the same session for issue ("request on 0x200000401:0x28:0x0 is not CANCELED") addressed by LU-4086.

I tried to setup a test platform (2 Clients, single MGS+MDS, single OSS, ZFS-only, running this particular build), but I am currently unable to reproduce.

Comment by Bruno Faccini (Inactive) [ 22/Oct/13 ]

I have been able to reproduce problem on the local platform I described before, this by running "sanity-hsm.sh 24c 24d 26 27b 28" test/sub-tests in a loop. And it appears the failure (ie requests/actions staying in "WAITING" state from their start/send and during 100s) is caused by "hsm/max_requests" (default=3) orphan requests which still appear with action "STARTED" state (and also in "/proc/fs/lustre/mdt/<MDT>/hsm/requests") preventing new requests to start.

Doing more debugging I found that the orphan requests come from test_24c and are the latest hsm_archive requests in the sub-test that do not wait end of request and call copytool_cleanup in a raw which may cause this last request to never end nor be cleared from log.

Thus every following hsm_archive will fail in following sub-tests like 24d/26/27b/28 with the same msg/error symptom than for this ticket, this until test_40 clears all orphans using/calling "cdt_purge" (ie "echo purge > /proc/fs/lustre/mdt/<MDT>/hsm_control") and/or request_timeout expires ...

Will try to add a systematic cdt_shutdown/cdt_enable (ie, "echo [shutdown,enable] > /proc/fs/lustre/mdt/<MDT>/hsm_control") in copytool_[setup,cleanup]() and see if it fixes as I expect.

Comment by Bruno Faccini (Inactive) [ 23/Oct/13 ]

Found more infos that confirm or better explain earlier findings :

_ in fact cdt_purge only clear actions but not requests ... Thus need to cdt_[shutdown+enable] (or better cdt_restart!) to clear orphan requests immediately.

_ orphan requests can also clear after timeout, this is confirmed by the timings found in the test suite run that caused this ticket to be created, and where I found that some point of time (around 1 hour/3600s later than test_24c ran and test_24d 1st triggered the WAITING hsm_archive requests state) during test_30 hsm_archive requests restarted to be handled ... :

.......
 [/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c]
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 2.39885 s, 874 kB/s
0x200008101:0x1e:0x0
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask
running as uid/gid/euid/egid 500/500/500/500, groups:
 [/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c]
Cannot send HSM request (use of /mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c): Operation not permitted
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask=archive
running as uid/gid/euid/egid 500/500/500/500, groups:
 [/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c]  <<<<<< last archive request in test_24c

CMD: wtm-21vm5 pkill -INT -x lhsmtool_posix  <<<<<< immediate copytool_cleanup
Copytool is stopped on wtm-21vm5
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.user_request_mask=RESTORE
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.group_request_mask=RESTORE
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask=RESTORE
Resetting fail_loc on all nodes...CMD: wtm-21vm3,wtm-21vm4,wtm-21vm5,wtm-21vm6.rosso.whamcloud.com lctl set_param -n fail_loc=0 2>/dev/null || true
done.
CMD: wtm-21vm3,wtm-21vm4,wtm-21vm5 rc=\$([ -f /proc/sys/lnet/catastrophe ] &&
		echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
		if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
		exit \$rc
PASS 24c (11s)

== sanity-hsm test 24d: check that read-only mounts are respected == 23:38:25 (1381473505)
CMD: wtm-21vm5 pkill -CONT -x lhsmtool_posix
Wakeup copytool agt1 on wtm-21vm5
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 1.79625 s, 1.2 MB/s
Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'WAITING'  <<<<< 1st valid archive request in test_24d is WAITING direct!!!
Waiting 100 secs for update
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=

.......

Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
 sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED  <<<<< test_24d fails at 1st archive

.......

Same for test_26/27b/28 (only following tests with archives...) at their 1st archive request !!!

.......

== sanity-hsm test 30b: Restore at exec (release case) == 00:38:49 (1381477129)
CMD: wtm-21vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-21vm5
CMD: wtm-21vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1/*
Starting copytool agt1 on wtm-21vm5
CMD: wtm-21vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1
CMD: wtm-21vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-10/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18725__-69981897718180-082939/sanity-hsm.test_30b.copytool_log.wtm-21vm5.log 2>&1
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'WAITING'
Waiting 100 secs for update

.......

CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 96s: from 'WAITING' to 'STARTED'  <<<<<< archives only restarted during test_30b about 1h/3600s (default timeout!)
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Updated after 97s: wanted 'SUCCEED' got 'SUCCEED'

........

_ what is the best way to fix this ? As I already thought and commented "add a systematic cdt_shutdown/cdt_enable (or cdt_restart) in copytool_cleanup()" or only ensure to allow copytool to get enough time to handle requests before cleanup (like in test_24c) ?

_ also I need to check, if it is it allowed to kill/end copytool when there are requests being processed and thus if it is expected for these requests to stay orphan until they reach their timeout or CDT restarts ? BTW, running tests in this configuration causes CDT to complain with these (normal/expected?) msgs :

LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x2590:0x0] for cookie 0x52678bbd action=ARCHIVE
LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message
LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) lustrez-MDT0000: Cannot find running request for cookie 0x52678e26 on fid=[0x200000401:0x25a2:0x0]
LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) Skipped 1 previous similar message
LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x25a2:0x0] for cookie 0x52678e26 action=ARCHIVE
LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message
LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) lustrez-MDT0000: Cannot find running request for cookie 0x5267908a on fid=[0x200000401:0x25b4:0x0]
LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) Skipped 1 previous similar message
LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x25b4:0x0] for cookie 0x5267908a action=ARCHIVE
LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message
Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ]

Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157.

Comment by Andreas Dilger [ 18/Nov/13 ]

Patch 8157 has landed on 2013-11-13, but I still see tests failing with LU-4093/LU-4126 in the past few days. Is that just because the test queue is so long that the results we are seeing today are for patches that were based on code not including the fix?

If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that LU-4093 shows up as lustre-hsm passing only 73/91 tests. There are still cases with 90/91 tests passing, so that needs to be a separate bug.

Comment by Bruno Faccini (Inactive) [ 19/Nov/13 ]

Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) …
The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !!
Just pushed http://review.whamcloud.com/8329 to fix this.

Comment by Bruno Faccini (Inactive) [ 28/Nov/13 ]

http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ]

No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

Comment by James Nunez (Inactive) [ 23/Apr/14 ]

It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

Comment by Bruno Faccini (Inactive) [ 13/Sep/14 ]

Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !

And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !

So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Comment by Jian Yu [ 14/Sep/14 ]

If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Thank you, Bruno. Please do.

Comment by Bruno Faccini (Inactive) [ 15/Sep/14 ]

Just created LU-5622 to address this new problem.

Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ...

Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

Comment by Jian Yu [ 15/Sep/14 ]

Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

Agreed. Just changed. Thank you, Bruno.

Generated at Sat Feb 10 01:39:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.