Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4637

Deactivating an OST causes the MDS system load to continually increase and the fs to hang

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • Lustre 2.5.0
    • None
    • Centos 6.3 software raid
    • 3
    • 12686

    Description

      When we deactivate an OST on the mds, the mds system load sky-rockets and the file system hangs.

      This is what we see from the logs

      Feb 17 11:50:41 kmet0002 kernel: Lustre: setting import kl2-OST0000_UUID INACTIVE by administrator request
      Feb 17 11:50:48 kmet0002 kernel: Lustre: setting import kl2-OST0001_UUID INACTIVE by administrator request
      Feb 17 11:50:50 kmet0002 kernel: Lustre: setting import kl2-OST0002_UUID INACTIVE by administrator request
      Feb 17 11:50:56 kmet0002 kernel: Lustre: setting import kl2-OST0004_UUID INACTIVE by administrator request
      Feb 17 11:51:04 kmet0002 kernel: Lustre: setting import kl2-OST0006_UUID INACTIVE by administrator request
      Feb 17 11:52:40 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
      Feb 17 11:54:20 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
      Feb 17 11:54:44 kmet0002 kernel: LNet: Service thread pid 11465 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb 17 11:54:44 kmet0002 kernel: Pid: 11465, comm: mdt00_118
      Feb 17 11:54:44 kmet0002 kernel:
      Feb 17 11:54:44 kmet0002 kernel: Call Trace:
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0faf14c>] osp_precreate_reserve+0x5dc/0x1ef0 [osp]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0fa8b75>] osp_declare_object_create+0x155/0x4f0 [osp]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef02dd>] lod_qos_declare_object_on+0xed/0x480 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef1169>] lod_alloc_qos.clone.0+0xaf9/0x1100 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef2ccf>] lod_qos_prep_create+0x77f/0x1aa0 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097adfa>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097eb32>] ? fld_server_lookup+0x72/0x430 [fld]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eecb2b>] lod_declare_striped_object+0x14b/0x880 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05916>] ? osd_xattr_get+0x226/0x2e0 [osd_ldiskfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eed721>] lod_declare_object_create+0x4c1/0x790 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f4a8ef>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f59eae>] mdd_declare_create+0x4e/0x870 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f583cf>] ? mdd_linkea_prepare+0x24f/0x4e0 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f5ae91>] mdd_create+0x7c1/0x1730 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05787>] ? osd_xattr_get+0x97/0x2e0 [osd_ldiskfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ee9560>] ? lod_index_lookup+0x0/0x30 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e26dc8>] mdo_create+0x18/0x50 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e31031>] mdt_reint_open+0x1351/0x20a0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050ae16>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa062b600>] ? lu_ucred_global_init+0x0/0x30 [obdclass]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e19eb1>] mdt_reint_rec+0x41/0xe0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e01c93>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e0221d>] mdt_intent_reint+0x1ed/0x520 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfd8ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0747461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa077017f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfdd96>] mdt_enqueue+0x46/0xe0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e04a8a>] mdt_handle_common+0x52a/0x1470 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e3ec55>] mds_regular_handle+0x15/0x20 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa079fe25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa04ef4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050027f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07974c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81051439>] ? __wake_up_common+0x59/0x90
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a118d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a06a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Feb 17 11:54:44 kmet0002 kernel:
      Feb 17 11:54:44 kmet0002 kernel: LustreError: dumping log to /tmp/lustre-log.1392609284.11465
      Feb 17 11:54:45 kmet0002 kernel: LNet: Service thread pid 11417 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

      We reactivated the OST when system load hit 80 and the load came down to almost 0 within a few minutes.

      iostat showed no IO

      Attachments

        Activity

          [LU-4637] Deactivating an OST causes the MDS system load to continually increase and the fs to hang

          The reason the patch hasn't landed yet is because it is causing failures in our testing. The problem is likely related to the test script itself, and not a reflection of the patch, but we can't land a patch that is causing test failures or it will interfere with all the other patches trying to land.

          adilger Andreas Dilger added a comment - The reason the patch hasn't landed yet is because it is causing failures in our testing. The problem is likely related to the test script itself, and not a reflection of the patch, but we can't land a patch that is causing test failures or it will interfere with all the other patches trying to land.
          tomtervo Tommi Tervo added a comment -

          Hi,

          I think we hit this bug yesterday and I'd to use pools as a workaround. Is there any reason why this one liner is not committed to master?

          tomtervo Tommi Tervo added a comment - Hi, I think we hit this bug yesterday and I'd to use pools as a workaround. Is there any reason why this one liner is not committed to master?

          Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/12886
          Subject: LU-4637 osp: Report disconnected OSTs as unhealthy
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 9fa724db631e0c064ac249f4ef315cbc97a91acc

          gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/12886 Subject: LU-4637 osp: Report disconnected OSTs as unhealthy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9fa724db631e0c064ac249f4ef315cbc97a91acc
          adilger Andreas Dilger added a comment - http://review.whamcloud.com/11552 http://review.whamcloud.com/11553

          People

            wc-triage WC Triage
            sdm900 Stuart Midgley (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: