HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Technical task | Priority: | Major |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 10416 | ||||||||||||
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b8769b2e-1b6b-11e3-a00a-52540035b04c. The sub-test test_40 failed with the following error:
Info required for matching: sanity-hsm 40 |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 12/Sep/13 ] |
|
John, I had a look to the very verbose output for this test, and it seems that the list of archive requests is draining/reducing but very slowly and thus easily reach the 100s limit for the test !! |
| Comment by Jinshan Xiong (Inactive) [ 12/Sep/13 ] |
|
Indeed. I can see that from https://maloo.whamcloud.com/test_logs/27a32fc6-1b6c-11e3-a00a-52540035b04c/show_text by the change of FID. Should we fix this by getting rid of `--bandwidth 1' from copy tool, at least for the test where lots of file have to archived? |
| Comment by Keith Mannthey (Inactive) [ 12/Sep/13 ] |
|
Another one https://maloo.whamcloud.com/test_sets/41ff0516-1b8a-11e3-bede-52540035b04c Are the 52-57 and 90 errors related or do we need a new LU? |
| Comment by Bruno Faccini (Inactive) [ 13/Sep/13 ] |
|
No, I strongly suspect failures in following tests are likely to be due to the archive requests back-log still being processed. |
| Comment by Keith Mannthey (Inactive) [ 13/Sep/13 ] |
|
Can test 40 clean up in it's error state such what when it fails it damage is mostly contained? |
| Comment by Jinshan Xiong (Inactive) [ 13/Sep/13 ] |
|
We can cancel the requests but from my understanding, it's not necessary to do that because there is nothing wrong, it's simply too slow to access the NFS share. |
| Comment by Bruno Faccini (Inactive) [ 19/Sep/13 ] |
|
Pushed http://review.whamcloud.com/7703 to force sanity-hsm/test_40 to use a local/tmp hsm-root/HSM_ARCHIVE anyway. |
| Comment by Bruno Faccini (Inactive) [ 20/Sep/13 ] |
|
Humm seems that actually auto-tests/maloo imperatively bypass sanity-hsm/test_40 (via $SANITY_HSM_EXCEPT env var usage ?) due to these current/concerned failures. But then, and even if my local testing was successful, this prevents my patch exposure to auto-tests ... |
| Comment by John Hammond [ 24/Sep/13 ] |
|
Bruno I have a suggestion here. It may seem like a farce but I think it will work.
|
| Comment by Keith Mannthey (Inactive) [ 24/Sep/13 ] |
|
If it is a fortestonly type of patch you could also just make it test_1000 (or whatever makes sense). |
| Comment by Bruno Faccini (Inactive) [ 26/Sep/13 ] |
|
http://review.whamcloud.com/7772 just submitted to insert sanity-hsm/test_40 in ALWAYS_EXCEPT list. When landed, it will allow us to ask the Tools team to revert their changes in TEI-570, and then to remove it on-demand from exception list ... On the other hand, I am waiting from current auto-tests results of http://review.whamcloud.com/7703, to address comments and any other issues before to push a new+definitive patch-set. |
| Comment by Bruno Faccini (Inactive) [ 01/Oct/13 ] |
|
Submitted patch-set #4 of http://review.whamcloud.com/7703, including a hack to avoid TEI-570 changes effect and thus allow my changes to sanity-hsm/test_40 to be exposed in auto-tests environment. On the other hand I am trying to get test_40 be included in sanity-hsm sub-tests exception list, by the mean of http://review.whamcloud.com/7772 or better within Jinshan's change http://review.whamcloud.com/7374 for |
| Comment by Bruno Faccini (Inactive) [ 02/Oct/13 ] |
|
Jinshan just added test_40 to sanity-hsm exclusion-list in his change http://review.whamcloud.com/7374 for Need to wait now for Jinshan's patch to land (+ a few days to wait everybody time to rebase !!) in order to ask Tools-team to revert their changes in TEI-570. |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ] |
|
Since change #7374 for |
| Comment by Bruno Faccini (Inactive) [ 25/Nov/13 ] |
|
Submitted definitive version/patch-set #6 of http://review.whamcloud.com/7703, which additionally re-enable test_40. |
| Comment by Bruno Faccini (Inactive) [ 26/Nov/13 ] |
|
Humm, even with my change line to re-enable test_40 in patch-set #6, auto-tests report still show it as "skipping excluded test 40" … Could it be that TEI-570 setting was not reverted by TEI-1041 as expected ?? |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ] |
|
Now that TEI-1041 has definitely reverted ALL auto-tests settings causing sanity-hsm/test_40 sub-test to be disabled, I have re-triggered build (can not only re-trigger auto-tests since build has been removed during interval) of Change #7703/patch-set #6. |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ] |
|
BTW, new auto-test session Change #7703/patch-set #6, now show successful run of test_40 with a local/tmp hsm-root/HSM_ARCHIVE. |
| Comment by John Hammond [ 16/Dec/13 ] |
|
Hi Bruno, I have a suggestion here. Can we determine which tests actually require that the HSM archive be on shared storage? I was thinking that perhaps only those tests should use shared storage and the other (normal) tests should use local /tmp on the copytool client. Based on a sanity-hsm run on 4 of my own VMs it seems that most tests are fine with a local HSM archive. |
| Comment by Jian Yu [ 08/Jan/14 ] |
|
While validating patches for Lustre b2_5 branch, this failure occurred frequently: Back-port patch http://review.whamcloud.com/7703/ to Lustre b2_5 branch: http://review.whamcloud.com/8771 |
| Comment by Bob Glossman (Inactive) [ 13/Feb/14 ] |
|
seen in master: |
| Comment by Bruno Faccini (Inactive) [ 21/Feb/14 ] |
|
Bob this failure, even if it looks of the same kind of problem, is against test_90 not test_40. |
| Comment by Peter Jones [ 21/Feb/14 ] |
|
Fixed for 2.5.1 and 2.6. Similar fixes for other tests can be tracked under a separate ticket. |
| Comment by Nathaniel Clark [ 27/May/14 ] |
|
This issue is still occurring on master: review-zfs: full (ldiskfs): |
| Comment by Bruno Faccini (Inactive) [ 06/Jun/14 ] |
|
The logs of these auto-tests failures clearly show that hsm_root is in an NFS mounted fie-system == sanity-hsm test 40: Parallel archive requests == 20:25:46 (1394594746) CMD: shadow-7vm6 pkill -CONT -x lhsmtool_posix Purging archive on shadow-7vm6 CMD: shadow-7vm6 rm -rf /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1/* Starting copytool agt1 on shadow-7vm6 CMD: shadow-7vm6 mkdir -p /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 CMD: shadow-7vm6 lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-03-11/lustre-master-el6-x86_64-vs-lustre-b2_5-el6-x86_64--full--2_9_1__1937__-70358207241380-095905/sanity-hsm.test_40.copytool_log.shadow-7vm6.log 2>&1 ........... and again a too slow draining of the 100 archive requests for test_40 ... So could it be that something has changed in auto-tests/Maloo tools env. vars set, causing #7703 patch to need more to be done to force hsm_root in a local file-system ? |
| Comment by Bruno Faccini (Inactive) [ 07/Nov/14 ] |
|
Maloo reports show no new occurrence for this particular failure of sanity-hsm/test_40 since 2014-07-19 04:36:22 UTC. This last failure shows slow but constant evolution of the 400 archive requests, even with local/tmp filesystem usage. So I think we can assume that an other issue (last 4 failures between 2014-05-24 01:41:13 UTC and 2014-07-19 04:36:22 UTC have occured during review-zfs sessions, so any ZFS related slowness?) was causing slow archives even on a non-NFS/local archive area and it has been fixed. And does anybody agree we can we close this issue as fixed ? |
| Comment by James Nunez (Inactive) [ 17/Nov/14 ] |
|
I've just noticed that the patch for b2_5 does not enable test 40; sanity-hsm.sh still has: # bug number for skipped test: 3815 3939 ALWAYS_EXCEPT="$SANITY_HSM_EXCEPT 34 35 36 40" Is there any reason why test 40 is not enabled for b2_5? |
| Comment by Jian Yu [ 17/Nov/14 ] |
On master branch, test 40 was enabled by patch http://review.whamcloud.com/7703. While back-porting the patch to Lustre b2_5 branch in http://review.whamcloud.com/8771 (patch set 1), test 40 was not disabled at that time. After that, patch http://review.whamcloud.com/7374 was cherry-picked to Lustre b2_5 branch to make test 40 disabled. So, this is caused by the order of patch landing. Here is the patch to enable test 40 on Lustre b2_5 branch: http://review.whamcloud.com/12754 |
| Comment by Gerrit Updater [ 17/Nov/14 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12754 |
| Comment by Peter Jones [ 16/Jan/15 ] |
|
It sounds like everything has landed to master and it is just landings to maintenance branches for interop testing still in flight |