Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5539

MGS is waiting for obd_unlinked_exports more than 1024 seconds

Details

    • 3
    • 15416

    Description

      While testing patch http://review.whamcloud.com/11539 based on Lustre b2_5 build #84, unmounting mgs in sanity-lfsck test 0 hung:

      20:00:58:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
      20:00:58:LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.4.57@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      20:00:58:LustreError: Skipped 7 previous similar messages
      20:00:58:LustreError: 166-1: MGC10.1.4.66@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
      20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
      20:00:58:Lustre: 20326:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1408672403/real 1408672403]  req@ffff88007b36dc00 x1477087144042844/t0(0) o250->MGC10.1.4.66@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1408672419 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      20:00:58:Lustre: 20326:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
      20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
      20:00:58:LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.4.57@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      20:00:58:LustreError: Skipped 213 previous similar messages
      20:00:58:INFO: task umount:16206 blocked for more than 120 seconds.
      20:00:58:      Not tainted 2.6.32-431.23.3.el6_lustre.g6035153.x86_64 #1
      20:00:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      20:00:58:umount        D 0000000000000001     0 16206  16205 0x00000080
      20:00:58: ffff880059079aa8 0000000000000082 0000000000000000 ffff88007b874400
      20:00:58: ffffffffa0c34294 0000000000000000 ffff88006c2120c4 ffffffffa0c34294
      20:00:58: ffff880060233af8 ffff880059079fd8 000000000000fbc8 ffff880060233af8
      20:00:58:Call Trace:
      20:00:58: [<ffffffff81529e92>] schedule_timeout+0x192/0x2e0
      20:00:58: [<ffffffff81083f30>] ? process_timeout+0x0/0x10
      20:00:58: [<ffffffffa0bb5e9b>] obd_exports_barrier+0xab/0x180 [obdclass]
      20:00:58: [<ffffffffa16e152e>] mgs_device_fini+0xfe/0x580 [mgs]
      20:00:58: [<ffffffffa0be19f3>] class_cleanup+0x573/0xd30 [obdclass]
      20:00:58: [<ffffffffa0bb8036>] ? class_name2dev+0x56/0xe0 [obdclass]
      20:00:58: [<ffffffffa0be371a>] class_process_config+0x156a/0x1ad0 [obdclass]
      20:00:58: [<ffffffffa0bdc873>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
      20:00:58: [<ffffffffa0be3df9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
      20:00:58: [<ffffffffa0bb8036>] ? class_name2dev+0x56/0xe0 [obdclass]
      20:00:58: [<ffffffffa0c1f2dd>] server_put_super+0x45d/0xf60 [obdclass]
      20:00:58: [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
      20:00:58: [<ffffffff8118b326>] kill_anon_super+0x16/0x60
      20:00:58: [<ffffffffa0be5ca6>] lustre_kill_super+0x36/0x60 [obdclass]
      20:00:58: [<ffffffff8118bac7>] deactivate_super+0x57/0x80
      20:00:58: [<ffffffff811ab4cf>] mntput_no_expire+0xbf/0x110
      20:00:58: [<ffffffff811ac01b>] sys_umount+0x7b/0x3a0
      20:00:58: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/37948628-29b7-11e4-8657-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-5539] MGS is waiting for obd_unlinked_exports more than 1024 seconds

            Closing as a duplicate of LU-4772.

            adilger Andreas Dilger added a comment - Closing as a duplicate of LU-4772 .
            yujian Jian Yu added a comment -

            Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/106/
            Distro/Arch: RHEL6.5/x86_64 (server), SLES11SP3/x86_64 (client)

            replay-single test 35 hit the same failure:
            https://testing.hpdd.intel.com/test_sets/1d491550-81ea-11e4-a79f-5254006e85c2

            yujian Jian Yu added a comment - Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/106/ Distro/Arch: RHEL6.5/x86_64 (server), SLES11SP3/x86_64 (client) replay-single test 35 hit the same failure: https://testing.hpdd.intel.com/test_sets/1d491550-81ea-11e4-a79f-5254006e85c2
            yujian Jian Yu added a comment -

            While verifying patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, recovery-small test 107 hit the same failure:

            Lustre: DEBUG MARKER: umount -d /mnt/mds1
            LustreError: 166-1: MGC10.1.4.101@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
            Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
            Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
            Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
            Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
            Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1418409787/real 1418409787]  req@ffff880057dc1c00 x1487306634660884/t0(0) o250->MGC10.1.4.101@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1418409812 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 17 previous similar messages 
            INFO: task umount:12953 blocked for more than 120 seconds. 
                  Not tainted 2.6.32-431.29.2.el6_lustre.g6b22a20.x86_64 #1
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
            umount        D 0000000000000001     0 12953  12952 0x00000080
             ffff880079a09aa8 0000000000000082 0000000000000000 ffff88006c40b800
             ffffffffa0b39471 0000000000000000 ffff88006c349184 ffffffffa0b39471
             ffff88006cd4b098 ffff880079a09fd8 000000000000fbc8 ffff88006cd4b098
            Call Trace:
             [<ffffffff8152a532>] schedule_timeout+0x192/0x2e0
             [<ffffffff81083f30>] ? process_timeout+0x0/0x10
             [<ffffffffa0ab9efb>] obd_exports_barrier+0xab/0x180 [obdclass]
             [<ffffffffa170755e>] mgs_device_fini+0xfe/0x580 [mgs]
             [<ffffffffa0ae6833>] class_cleanup+0x573/0xd30 [obdclass]
             [<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass]
             [<ffffffffa0ae855a>] class_process_config+0x156a/0x1ad0 [obdclass]
             [<ffffffffa09942f8>] ? libcfs_log_return+0x28/0x40 [libcfs] 
             [<ffffffffa0ae16f2>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]
             [<ffffffffa0ae8c39>] class_manual_cleanup+0x179/0x6f0 [obdclass]
             [<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass]
             [<ffffffffa0b241ed>] server_put_super+0x45d/0xf60 [obdclass]
             [<ffffffff8118b63b>] generic_shutdown_super+0x5b/0xe0
             [<ffffffff8118b726>] kill_anon_super+0x16/0x60
             [<ffffffffa0aeaae6>] lustre_kill_super+0x36/0x60 [obdclass]
             [<ffffffff8118bec7>] deactivate_super+0x57/0x80
             [<ffffffff811ab8cf>] mntput_no_expire+0xbf/0x110
             [<ffffffff811ac41b>] sys_umount+0x7b/0x3a0
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            Maloo report: https://testing.hpdd.intel.com/test_sets/a1d5447c-82e1-11e4-9195-5254006e85c2

            yujian Jian Yu added a comment - While verifying patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, recovery-small test 107 hit the same failure: Lustre: DEBUG MARKER: umount -d /mnt/mds1 LustreError: 166-1: MGC10.1.4.101@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1418409787/real 1418409787] req@ffff880057dc1c00 x1487306634660884/t0(0) o250->MGC10.1.4.101@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1418409812 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 17 previous similar messages INFO: task umount:12953 blocked for more than 120 seconds. Not tainted 2.6.32-431.29.2.el6_lustre.g6b22a20.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000001 0 12953 12952 0x00000080 ffff880079a09aa8 0000000000000082 0000000000000000 ffff88006c40b800 ffffffffa0b39471 0000000000000000 ffff88006c349184 ffffffffa0b39471 ffff88006cd4b098 ffff880079a09fd8 000000000000fbc8 ffff88006cd4b098 Call Trace: [<ffffffff8152a532>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa0ab9efb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa170755e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0ae6833>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0ae855a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa09942f8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0ae16f2>] ? lustre_cfg_new+0x312/0x6e0 [obdclass] [<ffffffffa0ae8c39>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0b241ed>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b63b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b726>] kill_anon_super+0x16/0x60 [<ffffffffa0aeaae6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bec7>] deactivate_super+0x57/0x80 [<ffffffff811ab8cf>] mntput_no_expire+0xbf/0x110 [<ffffffff811ac41b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Maloo report: https://testing.hpdd.intel.com/test_sets/a1d5447c-82e1-11e4-9195-5254006e85c2
            yujian Jian Yu added a comment -

            Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/84/
            Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_4/73/ (2.4.3)
            Distro/Arch: RHEL6.5/x86_64
            FSTYPE=ldiskfs

            sanity-scrub test 1c also hit the same failure:
            https://testing.hpdd.intel.com/test_sets/34be42e2-2aa5-11e4-b21b-5254006e85c2

            yujian Jian Yu added a comment - Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_4/73/ (2.4.3) Distro/Arch: RHEL6.5/x86_64 FSTYPE=ldiskfs sanity-scrub test 1c also hit the same failure: https://testing.hpdd.intel.com/test_sets/34be42e2-2aa5-11e4-b21b-5254006e85c2

            People

              wc-triage WC Triage
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: