[LU-4086] Test failure on test suite sanity-hsm, subtest test_33 Created: 10/Oct/13 Updated: 08/Oct/14 Resolved: 21/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Severity: | 3 |
| Rank (Obsolete): | 10982 |
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/78b59cb0-3123-11e3-8ec6-52540035b04c. The sub-test test_33 failed with the following error:
Info required for matching: sanity-hsm 33 |
| Comments |
| Comment by Jian Yu [ 31/Oct/13 ] |
|
Another instance: https://maloo.whamcloud.com/test_sets/14278fd4-40fe-11e3-a1e8-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 09/Nov/13 ] |
|
+1: https://maloo.whamcloud.com/test_sets/c4f891ea-4924-11e3-a79b-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 09/Nov/13 ] |
|
And according to the copytool log : lhsmtool_posix[18608]: '[0x200000401:0x29:0x0]' action CANCEL reclen 72, cookie=0x527df027 lhsmtool_posix[18608]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33' lhsmtool_posix[18608]: cancel not implemented for file system '/mnt/lustre' which may be a good start to explain why the RESTORE request never reached the CANCELED state !!… Also interesting, the "md5sum" command intended to start the implicit RESTORE has not been killed as expected too : CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Updated after 34s: wanted 'SUCCEED' got 'SUCCEED'
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 80 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 70 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
/usr/lib64/lustre/tests/test-framework.sh: line 1904: 21375 Terminated md5sum $f > /dev/null
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 32s: from 'STARTED' to 'SUCCEED'
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 60 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 49s: from 'SUCCEED' to ''
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 50 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 40 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 30 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 20 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-26vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x29:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'CANCELED' got ''
sanity-hsm test_33: @@@@@@ FAIL: request on 0x200000401:0x29:0x0 is not CANCELED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
……..
But finally, since ps command output is trashed, is it really expected to be ?? For the record/history and to highlight my previous comments here is current sanity-hsm/test_33() sub-test code : test_33() {
# test needs a running copytool
copytool_setup
mkdir -p $DIR/$tdir
local f=$DIR/$tdir/$tfile
local fid=$(make_large_for_progress $f)
$LFS hsm_archive --archive $HSM_ARCHIVE_NUMBER $f
wait_request_state $fid ARCHIVE SUCCEED
$LFS hsm_release $f
md5sum $f >/dev/null &
local pid=$!
wait_request_state $fid RESTORE STARTED
kill -15 $pid
sleep 1
# Check restore trigger process was killed
local killed=$(ps -o pid,comm hp $pid >/dev/null)
$LFS hsm_cancel $f
wait_request_state $fid RESTORE CANCELED
wait_request_state $fid CANCEL SUCCEED
[ -z $killed ] ||
error "Cannot kill process waiting for restore ($killed)"
copytool_cleanup
}
run_test 33 "Kill a restore waiting process"
|
| Comment by jacques-charles lafoucriere [ 11/Nov/13 ] |
|
The logic of unsupported CANCEl is a little bit tricky: |
| Comment by Bruno Faccini (Inactive) [ 26/Nov/13 ] |
|
J-C, From https://maloo.whamcloud.com/test_sets/78b59cb0-3123-11e3-8ec6-52540035b04c : lhsmtool_posix[23890]: waiting for message from kernel
lhsmtool_posix[23893]: '[0x200000401:0x28:0x0]' action RESTORE reclen 72, cookie=0x52556a97 <<<< RESTORE request received
lhsmtool_posix[23893]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[23893]: reading stripe rules from '/home/autotest/.autotest/shared_dir/2013-10-08/203349-69989071297660/arc1/0028/0000/0401/0000/0002/0000/0x200000401:0x28:0x0.lov' for '/home/autotest/.autotest/shared_dir/2013-10-08/203349-69989071297660/arc1/0028/0000/0401/0000/0002/0000/0x200000401:0x28:0x0'
lhsmtool_posix[23893]: restoring data from '/home/autotest/.autotest/shared_dir/2013-10-08/203349-69989071297660/arc1/0028/0000/0401/0000/0002/0000/0x200000401:0x28:0x0' to '{VOLATILE}=[0x200000402:0xf3:0x0]'
lhsmtool_posix[23893]: going to copy data from '/home/autotest/.autotest/shared_dir/2013-10-08/203349-69989071297660/arc1/0028/0000/0401/0000/0002/0000/0x200000401:0x28:0x0' to '{VOLATILE}=[0x200000402:0xf3:0x0]'
lhsmtool_posix[23893]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[23893]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[23893]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[23893]: %83
lhsmtool_posix[23893]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[23893]: data restore from '/home/autotest/.autotest/shared_dir/2013-10-08/203349-69989071297660/arc1/0028/0000/0401/0000/0002/0000/0x200000401:0x28:0x0' to '{VOLATILE}=[0x200000402:0xf3:0x0]' done
lhsmtool_posix[23893]: Action completed, notifying coordinator cookie=0x52556a97, FID=[0x200000401:0x28:0x0], hp_flags=0 err=0
lhsmtool_posix[23890]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[23890]: waiting for message from kernel
lhsmtool_posix[23893]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x28:0x0' ok (rc=0) <<<< RETORE done
lhsmtool_posix[23898]: '[0x200000401:0x28:0x0]' action CANCEL reclen 72, cookie=0x52556a97 <<<< CANCEL request received
lhsmtool_posix[23898]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[23898]: cancel not implemented for file system '/mnt/lustre'
exiting: Interrupt
From https://maloo.whamcloud.com/test_sets/14278fd4-40fe-11e3-a1e8-52540035b04c: lhsmtool_posix[2462]: waiting for message from kernel
lhsmtool_posix[2466]: '[0x2000088d1:0x29:0x0]' action RESTORE reclen 72, cookie=0x527042b9 <<<< RESTORE request received
lhsmtool_posix[2466]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[2466]: reading stripe rules from '/home/cgearing/.autotest/shared_dir/2013-10-28/222039-69843390548440/arc1/0029/0000/88d1/0000/0002/0000/0x2000088d1:0x29:0x0.lov' for '/home/cgearing/.autotest/shared_dir/2013-10-28/222039-69843390548440/arc1/0029/0000/88d1/0000/0002/0000/0x2000088d1:0x29:0x0'
lhsmtool_posix[2466]: restoring data from '/home/cgearing/.autotest/shared_dir/2013-10-28/222039-69843390548440/arc1/0029/0000/88d1/0000/0002/0000/0x2000088d1:0x29:0x0' to '{VOLATILE}=[0x2000088d2:0xf6:0x0]'
lhsmtool_posix[2466]: going to copy data from '/home/cgearing/.autotest/shared_dir/2013-10-28/222039-69843390548440/arc1/0029/0000/88d1/0000/0002/0000/0x2000088d1:0x29:0x0' to '{VOLATILE}=[0x2000088d2:0xf6:0x0]'
lhsmtool_posix[2466]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[2466]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[2466]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[2466]: %86
lhsmtool_posix[2466]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[2466]: data restore from '/home/cgearing/.autotest/shared_dir/2013-10-28/222039-69843390548440/arc1/0029/0000/88d1/0000/0002/0000/0x2000088d1:0x29:0x0' to '{VOLATILE}=[0x2000088d2:0xf6:0x0]' done
lhsmtool_posix[2466]: Action completed, notifying coordinator cookie=0x527042b9, FID=[0x2000088d1:0x29:0x0], hp_flags=0 err=0
lhsmtool_posix[2466]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x2000088d1:0x29:0x0' ok (rc=0) <<<< RESTORE done
lhsmtool_posix[2462]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[2462]: waiting for message from kernel
lhsmtool_posix[2469]: '[0x2000088d1:0x29:0x0]' action CANCEL reclen 72, cookie=0x527042b9 <<<< CANCEL request received
lhsmtool_posix[2469]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[2469]: cancel not implemented for file system '/mnt/lustre'
exiting: Interrupt
From https://maloo.whamcloud.com/test_sets/c4f891ea-4924-11e3-a79b-52540035b04c : lhsmtool_posix[18601]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[18601]: waiting for message from kernel
lhsmtool_posix[18606]: '[0x200000401:0x29:0x0]' action RESTORE reclen 72, cookie=0x527df027 <<<< RESTORE request received
lhsmtool_posix[18606]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[18606]: reading stripe rules from '/home/autotest/.autotest/shared_dir/2013-11-08/154450-70330946069980/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0.lov' for '/home/autotest/.autotest/shared_dir/2013-11-08/154450-70330946069980/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0'
lhsmtool_posix[18606]: restoring data from '/home/autotest/.autotest/shared_dir/2013-11-08/154450-70330946069980/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]'
lhsmtool_posix[18606]: going to copy data from '/home/autotest/.autotest/shared_dir/2013-11-08/154450-70330946069980/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]'
lhsmtool_posix[18606]: bandwith control: excess=2.026880E+05 sleep for 193298us
lhsmtool_posix[18606]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[18606]: bandwith control: excess=2.026880E+05 sleep for 193298us
lhsmtool_posix[18606]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[18606]: bandwith control: excess=2.026880E+05 sleep for 193298us
lhsmtool_posix[18606]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[18606]: %83
lhsmtool_posix[18606]: bandwith control: excess=2.026880E+05 sleep for 193298us
lhsmtool_posix[18606]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[18606]: data restore from '/home/autotest/.autotest/shared_dir/2013-11-08/154450-70330946069980/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]' done
lhsmtool_posix[18606]: Action completed, notifying coordinator cookie=0x527df027, FID=[0x200000401:0x29:0x0], hp_flags=0 err=0
lhsmtool_posix[18601]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[18601]: waiting for message from kernel
lhsmtool_posix[18606]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x29:0x0' ok (rc=0) <<<< RESTORE done
lhsmtool_posix[18608]: '[0x200000401:0x29:0x0]' action CANCEL reclen 72, cookie=0x527df027 <<<< CANCEL request received
lhsmtool_posix[18608]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[18608]: cancel not implemented for file system '/mnt/lustre'
exiting: Interrupt
When copytool log for a successful run of test_33 looks more like : lhsmtool_posix[21009]: waiting for message from kernel
lhsmtool_posix[21011]: '[0x200000401:0x29:0x0]' action RESTORE reclen 72, cookie=0x52904448 <<<< RESTORE request received
lhsmtool_posix[21011]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[21011]: reading stripe rules from '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0.lov' for '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0'
lhsmtool_posix[21011]: restoring data from '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]'
lhsmtool_posix[21011]: going to copy data from '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]'
lhsmtool_posix[21009]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[21009]: waiting for message from kernel
lhsmtool_posix[21012]: '[0x200000401:0x29:0x0]' action CANCEL reclen 72, cookie=0x52904448 <<<< CANCEL request received
lhsmtool_posix[21012]: processing file 'd0.sanity-hsm/d33/f.sanity-hsm.33'
lhsmtool_posix[21012]: cancel not implemented for file system '/mnt/lustre'
lhsmtool_posix[21011]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[21011]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[21011]: bandwith control: excess=1.251264E+06 sleep for 1193298us
lhsmtool_posix[21011]: %86
lhsmtool_posix[21011]: progress ioctl for copy '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0'->'{VOLATILE}=[0x200000402:0xf4:0x0]' failed: Operation canceled (125)
lhsmtool_posix[21011]: cannot copy data from '/home/cgearing/.autotest/shared_dir/2013-11-22/125116-70304641718120/arc1/0029/0000/0401/0000/0002/0000/0x200000401:0x29:0x0' to '{VOLATILE}=[0x200000402:0xf4:0x0]': Operation canceled (125)
lhsmtool_posix[21011]: Action completed, notifying coordinator cookie=0x52904448, FID=[0x200000401:0x29:0x0], hp_flags=0 err=125 <<<< RESTORE request end with ECANCELED errno
lhsmtool_posix[21011]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x29:0x0' ok (rc=0)
exiting: Interrupt
|
| Comment by Bruno Faccini (Inactive) [ 29/Nov/13 ] |
|
I am puzzled with the "waiting for message from kernel" and "copytool fs=lustre archive#=2 item_count=1" copytool main-thread msgs sequence here, there should be some buffering (with CT_TRACE()/llapi_[error,printf][_callback]() usage) causing some out-of-order log recording of msgs from copytool main and dynamically/async created threads to serve requests. But even with that in mind, how can we explain that copytool main thread/loop could be blocked during more than 30s allowing the RESTORE to finish and before to receive/handle the CANCEL request ?? Or may be this delay comes from the Client requesting the CANCEL and or the MDT/CDT side ?? BTW, the failed auto-tests sessions Lustre debug-logs do not help very much since they do not trace HSM activity, and particularly the main copytool thread. As far as I can see CANCEL request leave Client on-time, but then I am a bit lost with the sequence of llog record creation/update and communucations between CDT and CT … Anyway will try to find a way to retrieve where the CANCEL request handling delay comes from. |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ] |
|
Now I see 2 possible ways to fix this issue : _ in case there are some "orphan" active_requests from previous sub-test, add +1 to max_requests before running the hsm_cancel command. This may handle the case where there is only one "real" slot left that will cause the Restore and Cancel operations to be serialized. _ Use "wait_all_done()" instead of "wait_request_state $fid RESTORE CANCELED" + ""wait_request_state $fid CANCEL SUCCEED", and then check/handle case of RESTORE is SUCCEED/CANCELED and CANCEL is SUCCEED/FAILED with get_request_state(). This will allow to detect if Cancel was too late thus not fail. Will require a cdt_purge at beginning of test to ensure wait_all_done success. |
| Comment by Bruno Faccini (Inactive) [ 12/Mar/14 ] |
|
Seem that all failures of sanity-hsm/test_33 subtest, since mid-december, are not single errors but part of bunch of subtests errors caused by some other issue. But anyway, in order to strengthen test_33 subtest, I pushed a patch, to implement the 2nd way I described before, at http://review.whamcloud.com/9600. |
| Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ] |
|
Patch has landed. No more Maloo/auto-tests failures. Marking ticket as solved. To be re-opened if problem re-occurs. |
| Comment by James Nunez (Inactive) [ 13/Jun/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/10712/ |