Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-797

Test failure on test suite ost-pools, subtest test_14, test_18, test_23

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.2.0, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.8
    • None
    • 3
    • 4734

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f6c809ee-00ec-11e1-bb4f-52540025f9af.

      The sub-test test_14 failed with the following error:

      test failed to respond and timed out

      Info required for matching: ost-pools 14

      Attachments

        Issue Links

          Activity

            [LU-797] Test failure on test suite ost-pools, subtest test_14, test_18, test_23
            yujian Jian Yu added a comment -

            Lustre Branch: b1_8
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/236/
            Distro/Arch: RHEL5.8/x86_64
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            OSTSIZE=149723171

            The ost-pools test 14 timed out: https://maloo.whamcloud.com/test_sets/f8ded3ce-5187-11e2-bbc3-52540035b04c

            yujian Jian Yu added a comment - Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/236/ Distro/Arch: RHEL5.8/x86_64 Network: TCP (1GigE) ENABLE_QUOTA=yes OSTSIZE=149723171 The ost-pools test 14 timed out: https://maloo.whamcloud.com/test_sets/f8ded3ce-5187-11e2-bbc3-52540035b04c
            emoly.liu Emoly Liu added a comment -

            The related patches for b1_8 are merged into http://review.whamcloud.com/4898

            emoly.liu Emoly Liu added a comment - The related patches for b1_8 are merged into http://review.whamcloud.com/4898
            pjones Peter Jones added a comment -

            Fix landed for 2.1.4 and 2.4

            pjones Peter Jones added a comment - Fix landed for 2.1.4 and 2.4
            emoly.liu Emoly Liu added a comment - http://review.whamcloud.com/#change,4474 b2_1 port is at http://review.whamcloud.com/4831
            yujian Jian Yu added a comment -

            Lustre Branch: b2_1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148
            Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.14.1.el6)
            Network: TCP (1GigE)

            The same issue occurred:
            https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

            yujian Jian Yu added a comment - Lustre Branch: b2_1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148 Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.14.1.el6) Network: TCP (1GigE) The same issue occurred: https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c
            emoly.liu Emoly Liu added a comment -

            remove the space of "lfs df" output interllay: http://review.whamcloud.com/4705

            b2_1 port is at http://review.whamcloud.com/#change,4784

            emoly.liu Emoly Liu added a comment - remove the space of "lfs df" output interllay: http://review.whamcloud.com/4705 b2_1 port is at http://review.whamcloud.com/#change,4784

            remove the space of "lfs df" output interllay: http://review.whamcloud.com/4705

            niu Niu Yawei (Inactive) added a comment - remove the space of "lfs df" output interllay: http://review.whamcloud.com/4705
            niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/#change,4474

            Look at the recent failure:

            == ost-pools test 23b: OST pools and OOS ============================================================= 21:49:32 (1345438172)
            running as uid/gid/euid/egid 500/500/500/500, groups:
             [touch] [/mnt/lustre/d0_runas_test/f13763]
            CMD: client-30vm3 lctl pool_new lustre.testpool
            client-30vm3: Pool lustre.testpool created
            CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool         2>/dev/null || echo foo
            CMD: client-30vm3 lctl pool_add lustre.testpool lustre-OST[0000-0006/3]
            client-30vm3: OST lustre-OST0000_UUID added to pool lustre.testpool
            client-30vm3: OST lustre-OST0003_UUID added to pool lustre.testpool
            client-30vm3: OST lustre-OST0006_UUID added to pool lustre.testpool
            CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool |
                                       sort -u | tr '\n' ' ' 
            1: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 1656.82 seconds, 20.7 MB/s
            2: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 1761.27 seconds, 19.5 MB/s
            3: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 1921.64 seconds, 17.9 MB/s
            4: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 1941.35 seconds, 17.7 MB/s
            5: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 2098.63 seconds, 16.4 MB/s
            6: 32768+0 records in
            32768+0 records out
            34359738368 bytes (34 GB) copied, 2030.07 seconds, 16.9 MB/s
            

            The MAXFREE should be (2000000(KB) * 7(osts)) / (1024 * 1024) = 13GB, which means the available space should be less than 13GB in this pool, however, tests writes much more than the it and didn't fail with ENOSPC. I think we'd add more debug information in the tests, and also break the writting whenever written bytes is greater than available space.

            niu Niu Yawei (Inactive) added a comment - Look at the recent failure: == ost-pools test 23b: OST pools and OOS ============================================================= 21:49:32 (1345438172) running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f13763] CMD: client-30vm3 lctl pool_new lustre.testpool client-30vm3: Pool lustre.testpool created CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool 2>/dev/ null || echo foo CMD: client-30vm3 lctl pool_add lustre.testpool lustre-OST[0000-0006/3] client-30vm3: OST lustre-OST0000_UUID added to pool lustre.testpool client-30vm3: OST lustre-OST0003_UUID added to pool lustre.testpool client-30vm3: OST lustre-OST0006_UUID added to pool lustre.testpool CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool | sort -u | tr '\n' ' ' 1: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1656.82 seconds, 20.7 MB/s 2: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1761.27 seconds, 19.5 MB/s 3: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1921.64 seconds, 17.9 MB/s 4: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1941.35 seconds, 17.7 MB/s 5: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 2098.63 seconds, 16.4 MB/s 6: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 2030.07 seconds, 16.9 MB/s The MAXFREE should be (2000000(KB) * 7(osts)) / (1024 * 1024) = 13GB, which means the available space should be less than 13GB in this pool, however, tests writes much more than the it and didn't fail with ENOSPC. I think we'd add more debug information in the tests, and also break the writting whenever written bytes is greater than available space.
            yujian Jian Yu added a comment - More instances occurred during Lustre 2.1.3 RC2 testing: https://maloo.whamcloud.com/test_sets/9c6295ce-eb32-11e1-ba73-52540035b04c https://maloo.whamcloud.com/test_sets/68c13076-eb3d-11e1-ba73-52540035b04c

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: