Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4772

MGS is waiting for obd_unlinked_exports

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0
    • None
    • OpenSFS cluster with one MGS/MDS (MDT is a partition of /dev/sda), one OSS with two OSTs, a node running robinhood, one client and agent node and one client
    • 3
    • 13125

    Description

      Running acceptance-small for HSM testing on 2.5.1-RC4, replay-single hangs in test 53g. The test results are at https://maloo.whamcloud.com/test_sets/23724c76-abad-11e3-a696-52540035b04c , but there are no logs for test 53g, except at the end of the suite_log:

      == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
      fail_loc=0x119
      fail_loc=0x80000115
      fail_loc=0
      Replay barrier on lscratch-MDT0000
      Failing mds1 on c16
      Stopping /lustre/lscratch/mdt0 (opts:) on c16
      

      Looking at dmesg on the MDT:

      Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
      Lustre: *** cfs_fail_loc=119, val=2147483648***
      Lustre: Skipped 1 previous similar message
      LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) @@@ dropping reply  req@ffff88081e682000 x1462504360037684/t274877906958(0) o36->97aaf730-d78d-08d9-43ce-9e768c9c685f@192.168.2.120@o2ib:0/0 lens 488/448 e 0 to 0 dl 1394761114 ref 1 fl Interpret:/0/0 rc 0/0
      LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) Skipped 1 previous similar message
      Lustre: *** cfs_fail_loc=115, val=2147483648***
      Turning device sda (0x800003) read-only
      Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lscratch-MDT0000
      Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
      Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
      LustreError: 166-1: MGC192.168.2.116@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
      Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
      Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761310/real 1394761310]  req@ffff8808227c9400 x1462503945016504/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 66 previous similar messages
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761910/real 1394761910]  req@ffff8808049d6000 x1462503945016540/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761965 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 120 previous similar messages
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task umount:18492 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount        D 0000000000000000     0 18492  18491 0x00000080
       ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
       ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
       ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
      Call Trace:
       [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
       [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
       [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
       [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b326>] kill_anon_super+0x16/0x60
       [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bac7>] deactivate_super+0x57/0x80
       [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 113 previous similar messages
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762585/real 1394762585]  req@ffff880819252800 x1462503945016576/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394762640 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      Lustre: MGS is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.117@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 120 previous similar messages
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763260/real 1394763260]  req@ffff8808202e3800 x1462503945016612/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763315 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 116 previous similar messages
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763935/real 1394763935]  req@ffff8808040e9800 x1462503945016648/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763990 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 113 previous similar messages
      

      On the client, we see

      Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
      Lustre: DEBUG MARKER: local REPLAY BARRIER on lscratch-MDT0000
      Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761618/real 1394761618]  req@ffff88080f538800 x1462504360038020/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394761673 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
      Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762293/real 1394762293]  req@ffff880820549400 x1462504360038380/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394762348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: