HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3939] Test failure on test suite sanity-hsm, subtest test_40 Created: 12/Sep/13  Updated: 19/Mar/15  Resolved: 16/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.7.0

Type: Technical task Priority: Major
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Issue Links:
Related
is related to LU-5474 Test failure sanity-hsm test_90: requ... Resolved
is related to LU-4126 sanity-hsm test_15 failure: 'request... Resolved
Rank (Obsolete): 10416

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b8769b2e-1b6b-11e3-a00a-52540035b04c.

The sub-test test_40 failed with the following error:

requests did not complete

Info required for matching: sanity-hsm 40



 Comments   
Comment by Bruno Faccini (Inactive) [ 12/Sep/13 ]

John, I had a look to the very verbose output for this test, and it seems that the list of archive requests is draining/reducing but very slowly and thus easily reach the 100s limit for the test !!

Comment by Jinshan Xiong (Inactive) [ 12/Sep/13 ]

Indeed. I can see that from https://maloo.whamcloud.com/test_logs/27a32fc6-1b6c-11e3-a00a-52540035b04c/show_text by the change of FID.

Should we fix this by getting rid of `--bandwidth 1' from copy tool, at least for the test where lots of file have to archived?

Comment by Keith Mannthey (Inactive) [ 12/Sep/13 ]

Another one https://maloo.whamcloud.com/test_sets/41ff0516-1b8a-11e3-bede-52540035b04c
Also later tests are failing in sanity-hsm

Are the 52-57 and 90 errors related or do we need a new LU?

Comment by Bruno Faccini (Inactive) [ 13/Sep/13 ]

No, I strongly suspect failures in following tests are likely to be due to the archive requests back-log still being processed.

Comment by Keith Mannthey (Inactive) [ 13/Sep/13 ]

Can test 40 clean up in it's error state such what when it fails it damage is mostly contained?

Comment by Jinshan Xiong (Inactive) [ 13/Sep/13 ]

We can cancel the requests but from my understanding, it's not necessary to do that because there is nothing wrong, it's simply too slow to access the NFS share.

Comment by Bruno Faccini (Inactive) [ 19/Sep/13 ]

Pushed http://review.whamcloud.com/7703 to force sanity-hsm/test_40 to use a local/tmp hsm-root/HSM_ARCHIVE anyway.

Comment by Bruno Faccini (Inactive) [ 20/Sep/13 ]

Humm seems that actually auto-tests/maloo imperatively bypass sanity-hsm/test_40 (via $SANITY_HSM_EXCEPT env var usage ?) due to these current/concerned failures. But then, and even if my local testing was successful, this prevents my patch exposure to auto-tests ...

Comment by John Hammond [ 24/Sep/13 ]

Bruno I have a suggestion here. It may seem like a farce but I think it will work.

  1. Submit a patch to add 40 and 251 to ALWAYS_EXCEPT in sanity-hsm.sh.
  2. Wait for it to land.
  3. Wait for one week after that.
  4. Ask Mike to revert the changes in TEI-570.
  5. Wait for that to go in to effect.
  6. Update http://review.whamcloud.com/7703 to remove 40 from ALWAYS_EXCEPT and resubmit.
Comment by Keith Mannthey (Inactive) [ 24/Sep/13 ]

If it is a fortestonly type of patch you could also just make it test_1000 (or whatever makes sense).

Comment by Bruno Faccini (Inactive) [ 26/Sep/13 ]

http://review.whamcloud.com/7772 just submitted to insert sanity-hsm/test_40 in ALWAYS_EXCEPT list.

When landed, it will allow us to ask the Tools team to revert their changes in TEI-570, and then to remove it on-demand from exception list ...

On the other hand, I am waiting from current auto-tests results of http://review.whamcloud.com/7703, to address comments and any other issues before to push a new+definitive patch-set.

Comment by Bruno Faccini (Inactive) [ 01/Oct/13 ]

Submitted patch-set #4 of http://review.whamcloud.com/7703, including a hack to avoid TEI-570 changes effect and thus allow my changes to sanity-hsm/test_40 to be exposed in auto-tests environment.

On the other hand I am trying to get test_40 be included in sanity-hsm sub-tests exception list, by the mean of http://review.whamcloud.com/7772 or better within Jinshan's change http://review.whamcloud.com/7374 for LU-3815, in order to be able to ask Tools-team for TEI-570 changes to be reverted.

Comment by Bruno Faccini (Inactive) [ 02/Oct/13 ]

Jinshan just added test_40 to sanity-hsm exclusion-list in his change http://review.whamcloud.com/7374 for LU-3815, so I will abandon http://review.whamcloud.com/7772.

Need to wait now for Jinshan's patch to land (+ a few days to wait everybody time to rebase !!) in order to ask Tools-team to revert their changes in TEI-570.

Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ]

Since change #7374 for LU-3815, excluding test_40 sub-test in sanity-hsm, has land since about 10 days now, it sounds reasonable to ask Tools-team to revert their changes in TEI-570. So I re-open TEI-570.

Comment by Bruno Faccini (Inactive) [ 25/Nov/13 ]

Submitted definitive version/patch-set #6 of http://review.whamcloud.com/7703, which additionally re-enable test_40.

Comment by Bruno Faccini (Inactive) [ 26/Nov/13 ]

Humm, even with my change line to re-enable test_40 in patch-set #6, auto-tests report still show it as "skipping excluded test 40" … Could it be that TEI-570 setting was not reverted by TEI-1041 as expected ??

Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ]

Now that TEI-1041 has definitely reverted ALL auto-tests settings causing sanity-hsm/test_40 sub-test to be disabled, I have re-triggered build (can not only re-trigger auto-tests since build has been removed during interval) of Change #7703/patch-set #6.

Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ]

BTW, new auto-test session Change #7703/patch-set #6, now show successful run of test_40 with a local/tmp hsm-root/HSM_ARCHIVE.

Comment by John Hammond [ 16/Dec/13 ]

Hi Bruno, I have a suggestion here. Can we determine which tests actually require that the HSM archive be on shared storage? I was thinking that perhaps only those tests should use shared storage and the other (normal) tests should use local /tmp on the copytool client. Based on a sanity-hsm run on 4 of my own VMs it seems that most tests are fine with a local HSM archive.

Comment by Jian Yu [ 08/Jan/14 ]

While validating patches for Lustre b2_5 branch, this failure occurred frequently:
https://maloo.whamcloud.com/test_sets/631324a2-76bc-11e3-9ce8-52540035b04c
https://maloo.whamcloud.com/test_sets/a365b8f0-7737-11e3-9ce8-52540035b04c

Back-port patch http://review.whamcloud.com/7703/ to Lustre b2_5 branch: http://review.whamcloud.com/8771

Comment by Bob Glossman (Inactive) [ 13/Feb/14 ]

seen in master:
https://maloo.whamcloud.com/test_sessions/bcd51864-9459-11e3-b8a9-52540035b04c

Comment by Bruno Faccini (Inactive) [ 21/Feb/14 ]

Bob this failure, even if it looks of the same kind of problem, is against test_90 not test_40.
BTW, it is a good point for John's remark that others tests that do not require a shared storage should use the same fix I applied for test_40.

Comment by Peter Jones [ 21/Feb/14 ]

Fixed for 2.5.1 and 2.6. Similar fixes for other tests can be tracked under a separate ticket.

Comment by Nathaniel Clark [ 27/May/14 ]

This issue is still occurring on master:

review-zfs:
https://maloo.whamcloud.com/test_sets/0aa1078c-e2f3-11e3-8561-52540035b04c
https://maloo.whamcloud.com/test_sets/47ce19b2-dff4-11e3-9854-52540035b04c
https://maloo.whamcloud.com/test_sets/75f2680a-c5f7-11e3-a760-52540035b04c

full (ldiskfs):
https://maloo.whamcloud.com/test_sets/6637dbbc-aa80-11e3-bd80-52540035b04c
https://maloo.whamcloud.com/test_sets/be4e5838-a5e1-11e3-aac5-52540035b04c

Comment by Bruno Faccini (Inactive) [ 06/Jun/14 ]

The logs of these auto-tests failures clearly show that hsm_root is in an NFS mounted fie-system

== sanity-hsm test 40: Parallel archive requests == 20:25:46 (1394594746)
CMD: shadow-7vm6 pkill -CONT -x lhsmtool_posix
Purging archive on shadow-7vm6
CMD: shadow-7vm6 rm -rf /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1/*
Starting copytool agt1 on shadow-7vm6
CMD: shadow-7vm6 mkdir -p /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1
CMD: shadow-7vm6 lhsmtool_posix  --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-03-11/lustre-master-el6-x86_64-vs-lustre-b2_5-el6-x86_64--full--2_9_1__1937__-70358207241380-095905/sanity-hsm.test_40.copytool_log.shadow-7vm6.log 2>&1
...........

and again a too slow draining of the 100 archive requests for test_40 ...

So could it be that something has changed in auto-tests/Maloo tools env. vars set, causing #7703 patch to need more to be done to force hsm_root in a local file-system ?
How can I check current auto-tools master sessions environment/configuration ?

Comment by Bruno Faccini (Inactive) [ 07/Nov/14 ]

Maloo reports show no new occurrence for this particular failure of sanity-hsm/test_40 since 2014-07-19 04:36:22 UTC. This last failure shows slow but constant evolution of the 400 archive requests, even with local/tmp filesystem usage.

So I think we can assume that an other issue (last 4 failures between 2014-05-24 01:41:13 UTC and 2014-07-19 04:36:22 UTC have occured during review-zfs sessions, so any ZFS related slowness?) was causing slow archives even on a non-NFS/local archive area and it has been fixed.

And does anybody agree we can we close this issue as fixed ?

Comment by James Nunez (Inactive) [ 17/Nov/14 ]

I've just noticed that the patch for b2_5 does not enable test 40; sanity-hsm.sh still has:

# bug number for skipped test:    3815     3939
ALWAYS_EXCEPT="$SANITY_HSM_EXCEPT 34 35 36 40"

Is there any reason why test 40 is not enabled for b2_5?

Comment by Jian Yu [ 17/Nov/14 ]

Is there any reason why test 40 is not enabled for b2_5?

On master branch, test 40 was enabled by patch http://review.whamcloud.com/7703. While back-porting the patch to Lustre b2_5 branch in http://review.whamcloud.com/8771 (patch set 1), test 40 was not disabled at that time. After that, patch http://review.whamcloud.com/7374 was cherry-picked to Lustre b2_5 branch to make test 40 disabled. So, this is caused by the order of patch landing.

Here is the patch to enable test 40 on Lustre b2_5 branch: http://review.whamcloud.com/12754

Comment by Gerrit Updater [ 17/Nov/14 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12754
Subject: LU-3939 tests: enable sanity-hsm test 40
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: f82e4e9449211c5be5c60006fe9c7b7c442cd58f

Comment by Peter Jones [ 16/Jan/15 ]

It sounds like everything has landed to master and it is just landings to maintenance branches for interop testing still in flight

Generated at Sat Feb 10 01:38:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.