Details

    • Technical task
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.0
    • 10416

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b8769b2e-1b6b-11e3-a00a-52540035b04c.

      The sub-test test_40 failed with the following error:

      requests did not complete

      Info required for matching: sanity-hsm 40

      Attachments

        Issue Links

          Activity

            [LU-3939] Test failure on test suite sanity-hsm, subtest test_40
            pjones Peter Jones added a comment -

            It sounds like everything has landed to master and it is just landings to maintenance branches for interop testing still in flight

            pjones Peter Jones added a comment - It sounds like everything has landed to master and it is just landings to maintenance branches for interop testing still in flight

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12754
            Subject: LU-3939 tests: enable sanity-hsm test 40
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: f82e4e9449211c5be5c60006fe9c7b7c442cd58f

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12754 Subject: LU-3939 tests: enable sanity-hsm test 40 Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: f82e4e9449211c5be5c60006fe9c7b7c442cd58f
            yujian Jian Yu added a comment - - edited

            Is there any reason why test 40 is not enabled for b2_5?

            On master branch, test 40 was enabled by patch http://review.whamcloud.com/7703. While back-porting the patch to Lustre b2_5 branch in http://review.whamcloud.com/8771 (patch set 1), test 40 was not disabled at that time. After that, patch http://review.whamcloud.com/7374 was cherry-picked to Lustre b2_5 branch to make test 40 disabled. So, this is caused by the order of patch landing.

            Here is the patch to enable test 40 on Lustre b2_5 branch: http://review.whamcloud.com/12754

            yujian Jian Yu added a comment - - edited Is there any reason why test 40 is not enabled for b2_5? On master branch, test 40 was enabled by patch http://review.whamcloud.com/7703 . While back-porting the patch to Lustre b2_5 branch in http://review.whamcloud.com/8771 (patch set 1), test 40 was not disabled at that time. After that, patch http://review.whamcloud.com/7374 was cherry-picked to Lustre b2_5 branch to make test 40 disabled. So, this is caused by the order of patch landing. Here is the patch to enable test 40 on Lustre b2_5 branch: http://review.whamcloud.com/12754

            I've just noticed that the patch for b2_5 does not enable test 40; sanity-hsm.sh still has:

            # bug number for skipped test:    3815     3939
            ALWAYS_EXCEPT="$SANITY_HSM_EXCEPT 34 35 36 40"
            

            Is there any reason why test 40 is not enabled for b2_5?

            jamesanunez James Nunez (Inactive) added a comment - I've just noticed that the patch for b2_5 does not enable test 40; sanity-hsm.sh still has: # bug number for skipped test: 3815 3939 ALWAYS_EXCEPT="$SANITY_HSM_EXCEPT 34 35 36 40" Is there any reason why test 40 is not enabled for b2_5?

            Maloo reports show no new occurrence for this particular failure of sanity-hsm/test_40 since 2014-07-19 04:36:22 UTC. This last failure shows slow but constant evolution of the 400 archive requests, even with local/tmp filesystem usage.

            So I think we can assume that an other issue (last 4 failures between 2014-05-24 01:41:13 UTC and 2014-07-19 04:36:22 UTC have occured during review-zfs sessions, so any ZFS related slowness?) was causing slow archives even on a non-NFS/local archive area and it has been fixed.

            And does anybody agree we can we close this issue as fixed ?

            bfaccini Bruno Faccini (Inactive) added a comment - Maloo reports show no new occurrence for this particular failure of sanity-hsm/test_40 since 2014-07-19 04:36:22 UTC. This last failure shows slow but constant evolution of the 400 archive requests, even with local/tmp filesystem usage. So I think we can assume that an other issue (last 4 failures between 2014-05-24 01:41:13 UTC and 2014-07-19 04:36:22 UTC have occured during review-zfs sessions, so any ZFS related slowness?) was causing slow archives even on a non-NFS/local archive area and it has been fixed. And does anybody agree we can we close this issue as fixed ?

            The logs of these auto-tests failures clearly show that hsm_root is in an NFS mounted fie-system

            == sanity-hsm test 40: Parallel archive requests == 20:25:46 (1394594746)
            CMD: shadow-7vm6 pkill -CONT -x lhsmtool_posix
            Purging archive on shadow-7vm6
            CMD: shadow-7vm6 rm -rf /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1/*
            Starting copytool agt1 on shadow-7vm6
            CMD: shadow-7vm6 mkdir -p /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1
            CMD: shadow-7vm6 lhsmtool_posix  --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-03-11/lustre-master-el6-x86_64-vs-lustre-b2_5-el6-x86_64--full--2_9_1__1937__-70358207241380-095905/sanity-hsm.test_40.copytool_log.shadow-7vm6.log 2>&1
            ...........
            

            and again a too slow draining of the 100 archive requests for test_40 ...

            So could it be that something has changed in auto-tests/Maloo tools env. vars set, causing #7703 patch to need more to be done to force hsm_root in a local file-system ?
            How can I check current auto-tools master sessions environment/configuration ?

            bfaccini Bruno Faccini (Inactive) added a comment - The logs of these auto-tests failures clearly show that hsm_root is in an NFS mounted fie-system == sanity-hsm test 40: Parallel archive requests == 20:25:46 (1394594746) CMD: shadow-7vm6 pkill -CONT -x lhsmtool_posix Purging archive on shadow-7vm6 CMD: shadow-7vm6 rm -rf /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1/* Starting copytool agt1 on shadow-7vm6 CMD: shadow-7vm6 mkdir -p /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 CMD: shadow-7vm6 lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-03-11/095905-70358207241380/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-03-11/lustre-master-el6-x86_64-vs-lustre-b2_5-el6-x86_64--full--2_9_1__1937__-70358207241380-095905/sanity-hsm.test_40.copytool_log.shadow-7vm6.log 2>&1 ........... and again a too slow draining of the 100 archive requests for test_40 ... So could it be that something has changed in auto-tests/Maloo tools env. vars set, causing #7703 patch to need more to be done to force hsm_root in a local file-system ? How can I check current auto-tools master sessions environment/configuration ?
            utopiabound Nathaniel Clark added a comment - This issue is still occurring on master: review-zfs: https://maloo.whamcloud.com/test_sets/0aa1078c-e2f3-11e3-8561-52540035b04c https://maloo.whamcloud.com/test_sets/47ce19b2-dff4-11e3-9854-52540035b04c https://maloo.whamcloud.com/test_sets/75f2680a-c5f7-11e3-a760-52540035b04c full (ldiskfs): https://maloo.whamcloud.com/test_sets/6637dbbc-aa80-11e3-bd80-52540035b04c https://maloo.whamcloud.com/test_sets/be4e5838-a5e1-11e3-aac5-52540035b04c
            pjones Peter Jones added a comment -

            Fixed for 2.5.1 and 2.6. Similar fixes for other tests can be tracked under a separate ticket.

            pjones Peter Jones added a comment - Fixed for 2.5.1 and 2.6. Similar fixes for other tests can be tracked under a separate ticket.

            Bob this failure, even if it looks of the same kind of problem, is against test_90 not test_40.
            BTW, it is a good point for John's remark that others tests that do not require a shared storage should use the same fix I applied for test_40.

            bfaccini Bruno Faccini (Inactive) added a comment - Bob this failure, even if it looks of the same kind of problem, is against test_90 not test_40. BTW, it is a good point for John's remark that others tests that do not require a shared storage should use the same fix I applied for test_40.
            bogl Bob Glossman (Inactive) added a comment - seen in master: https://maloo.whamcloud.com/test_sessions/bcd51864-9459-11e3-b8a9-52540035b04c

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: