Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4637

Deactivating an OST causes the MDS system load to continually increase and the fs to hang

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.5.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Centos 6.3 software raid
    • Severity:
      3
    • Rank (Obsolete):
      12686

      Description

      When we deactivate an OST on the mds, the mds system load sky-rockets and the file system hangs.

      This is what we see from the logs

      Feb 17 11:50:41 kmet0002 kernel: Lustre: setting import kl2-OST0000_UUID INACTIVE by administrator request
      Feb 17 11:50:48 kmet0002 kernel: Lustre: setting import kl2-OST0001_UUID INACTIVE by administrator request
      Feb 17 11:50:50 kmet0002 kernel: Lustre: setting import kl2-OST0002_UUID INACTIVE by administrator request
      Feb 17 11:50:56 kmet0002 kernel: Lustre: setting import kl2-OST0004_UUID INACTIVE by administrator request
      Feb 17 11:51:04 kmet0002 kernel: Lustre: setting import kl2-OST0006_UUID INACTIVE by administrator request
      Feb 17 11:52:40 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
      Feb 17 11:54:20 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
      Feb 17 11:54:44 kmet0002 kernel: LNet: Service thread pid 11465 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb 17 11:54:44 kmet0002 kernel: Pid: 11465, comm: mdt00_118
      Feb 17 11:54:44 kmet0002 kernel:
      Feb 17 11:54:44 kmet0002 kernel: Call Trace:
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0faf14c>] osp_precreate_reserve+0x5dc/0x1ef0 [osp]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0fa8b75>] osp_declare_object_create+0x155/0x4f0 [osp]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef02dd>] lod_qos_declare_object_on+0xed/0x480 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef1169>] lod_alloc_qos.clone.0+0xaf9/0x1100 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef2ccf>] lod_qos_prep_create+0x77f/0x1aa0 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097adfa>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097eb32>] ? fld_server_lookup+0x72/0x430 [fld]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eecb2b>] lod_declare_striped_object+0x14b/0x880 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05916>] ? osd_xattr_get+0x226/0x2e0 [osd_ldiskfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eed721>] lod_declare_object_create+0x4c1/0x790 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f4a8ef>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f59eae>] mdd_declare_create+0x4e/0x870 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f583cf>] ? mdd_linkea_prepare+0x24f/0x4e0 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f5ae91>] mdd_create+0x7c1/0x1730 [mdd]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05787>] ? osd_xattr_get+0x97/0x2e0 [osd_ldiskfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ee9560>] ? lod_index_lookup+0x0/0x30 [lod]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e26dc8>] mdo_create+0x18/0x50 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e31031>] mdt_reint_open+0x1351/0x20a0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050ae16>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa062b600>] ? lu_ucred_global_init+0x0/0x30 [obdclass]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e19eb1>] mdt_reint_rec+0x41/0xe0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e01c93>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e0221d>] mdt_intent_reint+0x1ed/0x520 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfd8ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0747461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa077017f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfdd96>] mdt_enqueue+0x46/0xe0 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e04a8a>] mdt_handle_common+0x52a/0x1470 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e3ec55>] mds_regular_handle+0x15/0x20 [mdt]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa079fe25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa04ef4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050027f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07974c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81051439>] ? __wake_up_common+0x59/0x90
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a118d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a06a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Feb 17 11:54:44 kmet0002 kernel:
      Feb 17 11:54:44 kmet0002 kernel: LustreError: dumping log to /tmp/lustre-log.1392609284.11465
      Feb 17 11:54:45 kmet0002 kernel: LNet: Service thread pid 11417 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

      We reactivated the OST when system load hit 80 and the load came down to almost 0 within a few minutes.

      iostat showed no IO

        Attachments

          Activity

            People

            • Assignee:
              wc-triage WC Triage
              Reporter:
              sdm900 Stuart Midgley
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: