Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3230

conf-sanity fails to start run: umount of OST fails

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.4.2, Lustre 2.5.1
    • 3
    • 7893

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite runs:
      http://maloo.whamcloud.com/test_sets/bbe080da-ad17-11e2-bd7c-52540035b04c
      http://maloo.whamcloud.com/test_sets/51e42416-ad76-11e2-b72d-52540035b04c
      http://maloo.whamcloud.com/test_sets/842709fa-ad73-11e2-b72d-52540035b04c

      The sub-test conf-sanity failed with the following error:

      test failed to respond and timed out

      Info required for matching: conf-sanity conf-sanity
      Info required for matching: replay-single test_90

      Attachments

        Issue Links

          Activity

            [LU-3230] conf-sanity fails to start run: umount of OST fails

            b2_5 issues are different than original issue and better handled by LU-3665

            utopiabound Nathaniel Clark added a comment - b2_5 issues are different than original issue and better handled by LU-3665

            All the b2_5 TIMEOUTs happened in obdfilter-survey/3a, but for each of them, there were errors in test 1c or 2a that I believe left echo-client on the OST that then caused the umount to TIMEOUT.

            utopiabound Nathaniel Clark added a comment - All the b2_5 TIMEOUTs happened in obdfilter-survey/3a, but for each of them, there were errors in test 1c or 2a that I believe left echo-client on the OST that then caused the umount to TIMEOUT.
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_5 branch: https://maloo.whamcloud.com/test_sets/91c9c6da-861a-11e3-a2cb-52540035b04c https://maloo.whamcloud.com/test_sets/09ebb164-8477-11e3-bab5-52540035b04c https://maloo.whamcloud.com/test_sets/2f51a8fa-8477-11e3-bab5-52540035b04c https://maloo.whamcloud.com/test_sets/2cbbedf4-8ecb-11e3-b036-52540035b04c
            yujian Jian Yu added a comment -

            More instance on Lustre b2_5 build #5 with FSTYPE=zfs:
            https://maloo.whamcloud.com/test_sets/a1f6e73a-7671-11e3-a7a8-52540035b04c

            yujian Jian Yu added a comment - More instance on Lustre b2_5 build #5 with FSTYPE=zfs: https://maloo.whamcloud.com/test_sets/a1f6e73a-7671-11e3-a7a8-52540035b04c
            yujian Jian Yu added a comment -

            I've to reopen the ticket because the failure still occurs.

            yujian Jian Yu added a comment - I've to reopen the ticket because the failure still occurs.
            yujian Jian Yu added a comment -

            Although patch http://review.whamcloud.com/7995 was cherry-picked to Lustre b2_5 branch, the failure still occurred while running obdfilter-survey test 3a with FSTYPE=zfs on Lustre b2_5 build #5:

            01:44:19:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
            01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping)
            01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping)
            01:44:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
            01:44:19:Lustre: 4092:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1388655826/real 1388655826]  req@ffff88006090dc00 x1456104749935752/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.39@tcp:12/10 lens 400/544 e 0 to 1 dl 1388655842 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping)
            01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping)
            01:46:34:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping)
            01:46:34:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
            <~snip~>
            01:50:58:INFO: task umount:11612 blocked for more than 120 seconds.
            01:50:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            01:50:58:umount        D 0000000000000001     0 11612  11611 0x00000080
            01:50:58: ffff88002be63ab8 0000000000000082 0000000000000000 000000006e5e0af6
            01:50:58: ffffffffa07ca7f0 ffff88006e68236f ffff88005a64d184 ffffffffa0788975
            01:50:58: ffff8800638d7098 ffff88002be63fd8 000000000000fb88 ffff8800638d7098
            01:50:58:Call Trace:
            01:50:58: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
            01:50:58: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
            01:50:58: [<ffffffffa070aeab>] obd_exports_barrier+0xab/0x180 [obdclass]
            01:50:58: [<ffffffffa0e8194f>] ofd_device_fini+0x5f/0x240 [ofd]
            01:50:58: [<ffffffffa0736493>] class_cleanup+0x573/0xd30 [obdclass]
            01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass]
            01:50:58: [<ffffffffa07381ba>] class_process_config+0x156a/0x1ad0 [obdclass]
            01:50:58: [<ffffffffa0731313>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
            01:50:58: [<ffffffffa0738899>] class_manual_cleanup+0x179/0x6f0 [obdclass]
            01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass]
            01:50:59: [<ffffffffa0773edc>] server_put_super+0x5ec/0xf60 [obdclass]
            01:50:59: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0
            01:50:59: [<ffffffff81183726>] kill_anon_super+0x16/0x60
            01:50:59: [<ffffffffa073a746>] lustre_kill_super+0x36/0x60 [obdclass]
            01:50:59: [<ffffffff81183ec7>] deactivate_super+0x57/0x80
            01:50:59: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110
            01:50:59: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0
            01:50:59: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            Maloo report: https://maloo.whamcloud.com/test_sets/8b620634-73c5-11e3-b4ff-52540035b04c

            yujian Jian Yu added a comment - Although patch http://review.whamcloud.com/7995 was cherry-picked to Lustre b2_5 branch, the failure still occurred while running obdfilter-survey test 3a with FSTYPE=zfs on Lustre b2_5 build #5: 01:44:19:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? 01:44:19:Lustre: 4092:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1388655826/real 1388655826] req@ffff88006090dc00 x1456104749935752/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.39@tcp:12/10 lens 400/544 e 0 to 1 dl 1388655842 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:46:34:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:46:34:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? <~snip~> 01:50:58:INFO: task umount:11612 blocked for more than 120 seconds. 01:50:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:50:58:umount D 0000000000000001 0 11612 11611 0x00000080 01:50:58: ffff88002be63ab8 0000000000000082 0000000000000000 000000006e5e0af6 01:50:58: ffffffffa07ca7f0 ffff88006e68236f ffff88005a64d184 ffffffffa0788975 01:50:58: ffff8800638d7098 ffff88002be63fd8 000000000000fb88 ffff8800638d7098 01:50:58:Call Trace: 01:50:58: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 01:50:58: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 01:50:58: [<ffffffffa070aeab>] obd_exports_barrier+0xab/0x180 [obdclass] 01:50:58: [<ffffffffa0e8194f>] ofd_device_fini+0x5f/0x240 [ofd] 01:50:58: [<ffffffffa0736493>] class_cleanup+0x573/0xd30 [obdclass] 01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass] 01:50:58: [<ffffffffa07381ba>] class_process_config+0x156a/0x1ad0 [obdclass] 01:50:58: [<ffffffffa0731313>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 01:50:58: [<ffffffffa0738899>] class_manual_cleanup+0x179/0x6f0 [obdclass] 01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass] 01:50:59: [<ffffffffa0773edc>] server_put_super+0x5ec/0xf60 [obdclass] 01:50:59: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0 01:50:59: [<ffffffff81183726>] kill_anon_super+0x16/0x60 01:50:59: [<ffffffffa073a746>] lustre_kill_super+0x36/0x60 [obdclass] 01:50:59: [<ffffffff81183ec7>] deactivate_super+0x57/0x80 01:50:59: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110 01:50:59: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0 01:50:59: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Maloo report: https://maloo.whamcloud.com/test_sets/8b620634-73c5-11e3-b4ff-52540035b04c
            yujian Jian Yu added a comment -

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
            Distro/Arch: RHEL6.4/x86_64
            FSTYPE=zfs

            obdfilter-survey hit this failure again:
            https://maloo.whamcloud.com/test_sets/f0db9456-6981-11e3-aabe-52540035b04c

            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs obdfilter-survey hit this failure again: https://maloo.whamcloud.com/test_sets/f0db9456-6981-11e3-aabe-52540035b04c

            Typically, if a patch can be cherry-picked cleanly to the older branches there is no need for a separate patch. No harm in doing this, but it is also possible to ask Oleg to do the cherry-pick into the maintenance branch(es).

            adilger Andreas Dilger added a comment - Typically, if a patch can be cherry-picked cleanly to the older branches there is no need for a separate patch. No harm in doing this, but it is also possible to ask Oleg to do the cherry-pick into the maintenance branch(es).
            utopiabound Nathaniel Clark added a comment - back-port to b2_4 http://review.whamcloud.com/8591
            utopiabound Nathaniel Clark added a comment - - edited

            It looks like this bug is fixed with the landing of #7995. Should I create gerrit patch to port to b2_4 and b2_5?
            It will cherry-pick cleanly to the current heads of both b2_4 and b2_5?

            utopiabound Nathaniel Clark added a comment - - edited It looks like this bug is fixed with the landing of #7995. Should I create gerrit patch to port to b2_4 and b2_5? It will cherry-pick cleanly to the current heads of both b2_4 and b2_5?

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: