[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING' Created: 11/Oct/13 Updated: 15/Sep/14 Resolved: 10/Dec/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM, zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 10999 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c. The sub-test test_24d failed with the following error:
Info required for matching: sanity-hsm 24d |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 21/Oct/13 ] |
|
In fact it is not only sanity-hsm/test_24d which fails due to this "FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED" error, but also sub-tests test_26/test_27b/test_28 !! This error always happen during a file Archive, when similar actions in intermediate sub-tests were successful. sanity-hsm/test_33 also failed during the same session for issue ("request on 0x200000401:0x28:0x0 is not CANCELED") addressed by I tried to setup a test platform (2 Clients, single MGS+MDS, single OSS, ZFS-only, running this particular build), but I am currently unable to reproduce. |
| Comment by Bruno Faccini (Inactive) [ 22/Oct/13 ] |
|
I have been able to reproduce problem on the local platform I described before, this by running "sanity-hsm.sh 24c 24d 26 27b 28" test/sub-tests in a loop. And it appears the failure (ie requests/actions staying in "WAITING" state from their start/send and during 100s) is caused by "hsm/max_requests" (default=3) orphan requests which still appear with action "STARTED" state (and also in "/proc/fs/lustre/mdt/<MDT>/hsm/requests") preventing new requests to start. Doing more debugging I found that the orphan requests come from test_24c and are the latest hsm_archive requests in the sub-test that do not wait end of request and call copytool_cleanup in a raw which may cause this last request to never end nor be cleared from log. Thus every following hsm_archive will fail in following sub-tests like 24d/26/27b/28 with the same msg/error symptom than for this ticket, this until test_40 clears all orphans using/calling "cdt_purge" (ie "echo purge > /proc/fs/lustre/mdt/<MDT>/hsm_control") and/or request_timeout expires ... Will try to add a systematic cdt_shutdown/cdt_enable (ie, "echo [shutdown,enable] > /proc/fs/lustre/mdt/<MDT>/hsm_control") in copytool_[setup,cleanup]() and see if it fixes as I expect. |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/13 ] |
|
Found more infos that confirm or better explain earlier findings : _ in fact cdt_purge only clear actions but not requests ... Thus need to cdt_[shutdown+enable] (or better cdt_restart!) to clear orphan requests immediately. _ orphan requests can also clear after timeout, this is confirmed by the timings found in the test suite run that caused this ticket to be created, and where I found that some point of time (around 1 hour/3600s later than test_24c ran and test_24d 1st triggered the WAITING hsm_archive requests state) during test_30 hsm_archive requests restarted to be handled ... : .......
[/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c]
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 2.39885 s, 874 kB/s
0x200008101:0x1e:0x0
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask
running as uid/gid/euid/egid 500/500/500/500, groups:
[/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c]
Cannot send HSM request (use of /mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c): Operation not permitted
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask=archive
running as uid/gid/euid/egid 500/500/500/500, groups:
[/usr/bin/lfs] [hsm_archive] [/mnt/lustre/d0.sanity-hsm/d24/f.sanity-hsm.24c] <<<<<< last archive request in test_24c
CMD: wtm-21vm5 pkill -INT -x lhsmtool_posix <<<<<< immediate copytool_cleanup
Copytool is stopped on wtm-21vm5
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.user_request_mask=RESTORE
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.group_request_mask=RESTORE
CMD: wtm-21vm3 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.hsm.other_request_mask=RESTORE
Resetting fail_loc on all nodes...CMD: wtm-21vm3,wtm-21vm4,wtm-21vm5,wtm-21vm6.rosso.whamcloud.com lctl set_param -n fail_loc=0 2>/dev/null || true
done.
CMD: wtm-21vm3,wtm-21vm4,wtm-21vm5 rc=\$([ -f /proc/sys/lnet/catastrophe ] &&
echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
exit \$rc
PASS 24c (11s)
== sanity-hsm test 24d: check that read-only mounts are respected == 23:38:25 (1381473505)
CMD: wtm-21vm5 pkill -CONT -x lhsmtool_posix
Wakeup copytool agt1 on wtm-21vm5
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 1.79625 s, 1.2 MB/s
Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'WAITING' <<<<< 1st valid archive request in test_24d is WAITING direct!!!
Waiting 100 secs for update
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
.......
Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED <<<<< test_24d fails at 1st archive
.......
Same for test_26/27b/28 (only following tests with archives...) at their 1st archive request !!!
.......
== sanity-hsm test 30b: Restore at exec (release case) == 00:38:49 (1381477129)
CMD: wtm-21vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-21vm5
CMD: wtm-21vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1/*
Starting copytool agt1 on wtm-21vm5
CMD: wtm-21vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1
CMD: wtm-21vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-10/082939-69981897718180/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-10/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18725__-69981897718180-082939/sanity-hsm.test_30b.copytool_log.wtm-21vm5.log 2>&1
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'WAITING'
Waiting 100 secs for update
.......
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 96s: from 'WAITING' to 'STARTED' <<<<<< archives only restarted during test_30b about 1h/3600s (default timeout!)
CMD: wtm-21vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Updated after 97s: wanted 'SUCCEED' got 'SUCCEED'
........
_ what is the best way to fix this ? As I already thought and commented "add a systematic cdt_shutdown/cdt_enable (or cdt_restart) in copytool_cleanup()" or only ensure to allow copytool to get enough time to handle requests before cleanup (like in test_24c) ? _ also I need to check, if it is it allowed to kill/end copytool when there are requests being processed and thus if it is expected for these requests to stay orphan until they reach their timeout or CDT restarts ? BTW, running tests in this configuration causes CDT to complain with these (normal/expected?) msgs : LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x2590:0x0] for cookie 0x52678bbd action=ARCHIVE LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) lustrez-MDT0000: Cannot find running request for cookie 0x52678e26 on fid=[0x200000401:0x25a2:0x0] LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) Skipped 1 previous similar message LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x25a2:0x0] for cookie 0x52678e26 action=ARCHIVE LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) lustrez-MDT0000: Cannot find running request for cookie 0x5267908a on fid=[0x200000401:0x25b4:0x0] LustreError: 22856:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) Skipped 1 previous similar message LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) lustrez-MDT0000: Cannot cleanup timeouted request: [0x200000401:0x25b4:0x0] for cookie 0x5267908a action=ARCHIVE LustreError: 22856:0:(mdt_coordinator.c:338:mdt_coordinator_cb()) Skipped 1 previous similar message |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ] |
|
Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157. |
| Comment by Andreas Dilger [ 18/Nov/13 ] |
|
Patch 8157 has landed on 2013-11-13, but I still see tests failing with If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that |
| Comment by Bruno Faccini (Inactive) [ 19/Nov/13 ] |
|
Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) … |
| Comment by Bruno Faccini (Inactive) [ 28/Nov/13 ] |
|
http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close. |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ] |
|
No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for |
| Comment by James Nunez (Inactive) [ 23/Apr/14 ] |
|
It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 13/Sep/14 ] |
|
Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d ! And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup. So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup(). |
| Comment by Jian Yu [ 14/Sep/14 ] |
Thank you, Bruno. Please do. |
| Comment by Bruno Faccini (Inactive) [ 15/Sep/14 ] |
|
Just created Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ... Also, I think we need to change the link of the failures you listed from |
| Comment by Jian Yu [ 15/Sep/14 ] |
Agreed. Just changed. Thank you, Bruno. |