Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3230

conf-sanity fails to start run: umount of OST fails

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.4.2, Lustre 2.5.1
    • 3
    • 7893

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite runs:
      http://maloo.whamcloud.com/test_sets/bbe080da-ad17-11e2-bd7c-52540035b04c
      http://maloo.whamcloud.com/test_sets/51e42416-ad76-11e2-b72d-52540035b04c
      http://maloo.whamcloud.com/test_sets/842709fa-ad73-11e2-b72d-52540035b04c

      The sub-test conf-sanity failed with the following error:

      test failed to respond and timed out

      Info required for matching: conf-sanity conf-sanity
      Info required for matching: replay-single test_90

      Attachments

        Issue Links

          Activity

            [LU-3230] conf-sanity fails to start run: umount of OST fails
            yujian Jian Yu added a comment -

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
            Distro/Arch: RHEL6.4/x86_64
            FSTYPE=zfs

            obdfilter-survey hit this failure again:
            https://maloo.whamcloud.com/test_sets/f0db9456-6981-11e3-aabe-52540035b04c

            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs obdfilter-survey hit this failure again: https://maloo.whamcloud.com/test_sets/f0db9456-6981-11e3-aabe-52540035b04c

            Typically, if a patch can be cherry-picked cleanly to the older branches there is no need for a separate patch. No harm in doing this, but it is also possible to ask Oleg to do the cherry-pick into the maintenance branch(es).

            adilger Andreas Dilger added a comment - Typically, if a patch can be cherry-picked cleanly to the older branches there is no need for a separate patch. No harm in doing this, but it is also possible to ask Oleg to do the cherry-pick into the maintenance branch(es).
            utopiabound Nathaniel Clark added a comment - back-port to b2_4 http://review.whamcloud.com/8591
            utopiabound Nathaniel Clark added a comment - - edited

            It looks like this bug is fixed with the landing of #7995. Should I create gerrit patch to port to b2_4 and b2_5?
            It will cherry-pick cleanly to the current heads of both b2_4 and b2_5?

            utopiabound Nathaniel Clark added a comment - - edited It looks like this bug is fixed with the landing of #7995. Should I create gerrit patch to port to b2_4 and b2_5? It will cherry-pick cleanly to the current heads of both b2_4 and b2_5?
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_4 branch: https://maloo.whamcloud.com/test_sets/dcb5daa6-6579-11e3-8518-52540035b04c https://maloo.whamcloud.com/test_sets/6c3ab5e4-6358-11e3-8c76-52540035b04c https://maloo.whamcloud.com/test_sets/d4b0f714-6281-11e3-a8fd-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
            Distro/Arch: RHEL6.4/x86_64 (server), SLES11SP2/x86_64 (client)

            replay-dual test 3 hit this failure:
            https://maloo.whamcloud.com/test_sets/20b3d072-5c98-11e3-956b-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/ Distro/Arch: RHEL6.4/x86_64 (server), SLES11SP2/x86_64 (client) replay-dual test 3 hit this failure: https://maloo.whamcloud.com/test_sets/20b3d072-5c98-11e3-956b-52540035b04c
            yujian Jian Yu added a comment -

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/58/
            Distro/Arch: RHEL6.4/x86_64

            FSTYPE=zfs
            MDSCOUNT=1
            MDSSIZE=2097152
            OSTCOUNT=2
            OSTSIZE=2097152

            obdfilter-survey test 3a hit the same failure:
            https://maloo.whamcloud.com/test_sets/19556f3e-5608-11e3-8e94-52540035b04c

            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/58/ Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs MDSCOUNT=1 MDSSIZE=2097152 OSTCOUNT=2 OSTSIZE=2097152 obdfilter-survey test 3a hit the same failure: https://maloo.whamcloud.com/test_sets/19556f3e-5608-11e3-8e94-52540035b04c
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/7995
            yujian Jian Yu added a comment -

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
            Distro/Arch: RHEL6.4/x86_64

            FSTYPE=zfs
            MDSCOUNT=1
            MDSSIZE=2097152
            OSTCOUNT=2
            OSTSIZE=2097152

            obdfilter-survey test 3a hit the same failure:
            https://maloo.whamcloud.com/test_sets/a488f632-4453-11e3-8472-52540035b04c

            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs MDSCOUNT=1 MDSSIZE=2097152 OSTCOUNT=2 OSTSIZE=2097152 obdfilter-survey test 3a hit the same failure: https://maloo.whamcloud.com/test_sets/a488f632-4453-11e3-8472-52540035b04c

            Debugging patch to try to see if 6988 was on the right track but not broad enough.

            http://review.whamcloud.com/7995

            utopiabound Nathaniel Clark added a comment - Debugging patch to try to see if 6988 was on the right track but not broad enough. http://review.whamcloud.com/7995

            There have been two "recent" (Sept 2013) non conf-sanity/- failures (both in replay-single):

            replay-single/74 https://maloo.whamcloud.com/test_sets/f441c460-227f-11e3-af6a-52540035b04c
            A review-dne-zfs failure on OST0000

            21:28:53:Lustre: DEBUG MARKER: umount -d /mnt/ost1
            21:28:53:Lustre: Failing over lustre-OST0000
            21:28:53:LustreError: 15640:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel
            21:28:53:Lustre: 15640:0:(service.c:2030:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (50:74s); client may timeout.  req@ffff880046d72c00 x1446662193136696/t0(0) o103->cea0ffc2-1873-4321-a1a2-348391764373@10.10.16.253@tcp:0/0 lens 328/192 e 0 to 0 dl 1379651120 ref 1 fl Complete:H/0/0 rc -19/-19
            21:28:53:LustreError: 7671:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel
            21:28:53:Lustre: lustre-OST0000: Not available for connect from 10.10.17.1@tcp (stopping)
            21:28:53:Lustre: Skipped 5 previous similar messages
            21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 7. Is it stuck?
            21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 7. Is it stuck?
            21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 7. Is it stuck?
            21:40:22:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 7. Is it stuck?
            

            The other is review run replay-single/53e https://maloo.whamcloud.com/test_sets/ddb85db2-208b-11e3-b9bc-52540035b04c (NOT ZFS)
            The MGS fails:

            03:55:06:Lustre: DEBUG MARKER: umount -d /mnt/mds1
            03:55:06:LustreError: 166-1: MGC10.10.4.154@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            03:55:07:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
            03:55:31:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
            03:56:05:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
            
            utopiabound Nathaniel Clark added a comment - There have been two "recent" (Sept 2013) non conf-sanity/- failures (both in replay-single): replay-single/74 https://maloo.whamcloud.com/test_sets/f441c460-227f-11e3-af6a-52540035b04c A review-dne-zfs failure on OST0000 21:28:53:Lustre: DEBUG MARKER: umount -d /mnt/ost1 21:28:53:Lustre: Failing over lustre-OST0000 21:28:53:LustreError: 15640:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel 21:28:53:Lustre: 15640:0:(service.c:2030:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (50:74s); client may timeout. req@ffff880046d72c00 x1446662193136696/t0(0) o103->cea0ffc2-1873-4321-a1a2-348391764373@10.10.16.253@tcp:0/0 lens 328/192 e 0 to 0 dl 1379651120 ref 1 fl Complete:H/0/0 rc -19/-19 21:28:53:LustreError: 7671:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel 21:28:53:Lustre: lustre-OST0000: Not available for connect from 10.10.17.1@tcp (stopping) 21:28:53:Lustre: Skipped 5 previous similar messages 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 7. Is it stuck? 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 7. Is it stuck? 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 7. Is it stuck? 21:40:22:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 7. Is it stuck? The other is review run replay-single/53e https://maloo.whamcloud.com/test_sets/ddb85db2-208b-11e3-b9bc-52540035b04c (NOT ZFS) The MGS fails: 03:55:06:Lustre: DEBUG MARKER: umount -d /mnt/mds1 03:55:06:LustreError: 166-1: MGC10.10.4.154@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 03:55:07:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 03:55:31:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 03:56:05:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: