Details
-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
Lustre 2.5.0
-
None
-
Centos 6.3 software raid
-
3
-
12686
Description
When we deactivate an OST on the mds, the mds system load sky-rockets and the file system hangs.
This is what we see from the logs
Feb 17 11:50:41 kmet0002 kernel: Lustre: setting import kl2-OST0000_UUID INACTIVE by administrator request
Feb 17 11:50:48 kmet0002 kernel: Lustre: setting import kl2-OST0001_UUID INACTIVE by administrator request
Feb 17 11:50:50 kmet0002 kernel: Lustre: setting import kl2-OST0002_UUID INACTIVE by administrator request
Feb 17 11:50:56 kmet0002 kernel: Lustre: setting import kl2-OST0004_UUID INACTIVE by administrator request
Feb 17 11:51:04 kmet0002 kernel: Lustre: setting import kl2-OST0006_UUID INACTIVE by administrator request
Feb 17 11:52:40 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
Feb 17 11:54:20 kmet0002 kernel: Lustre: kl2-OST0005-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19
Feb 17 11:54:44 kmet0002 kernel: LNet: Service thread pid 11465 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb 17 11:54:44 kmet0002 kernel: Pid: 11465, comm: mdt00_118
Feb 17 11:54:44 kmet0002 kernel:
Feb 17 11:54:44 kmet0002 kernel: Call Trace:
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0faf14c>] osp_precreate_reserve+0x5dc/0x1ef0 [osp]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0fa8b75>] osp_declare_object_create+0x155/0x4f0 [osp]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef02dd>] lod_qos_declare_object_on+0xed/0x480 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef1169>] lod_alloc_qos.clone.0+0xaf9/0x1100 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ef2ccf>] lod_qos_prep_create+0x77f/0x1aa0 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097adfa>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa097eb32>] ? fld_server_lookup+0x72/0x430 [fld]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eecb2b>] lod_declare_striped_object+0x14b/0x880 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05916>] ? osd_xattr_get+0x226/0x2e0 [osd_ldiskfs]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0eed721>] lod_declare_object_create+0x4c1/0x790 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f4a8ef>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f59eae>] mdd_declare_create+0x4e/0x870 [mdd]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f583cf>] ? mdd_linkea_prepare+0x24f/0x4e0 [mdd]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0f5ae91>] mdd_create+0x7c1/0x1730 [mdd]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0d05787>] ? osd_xattr_get+0x97/0x2e0 [osd_ldiskfs]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0ee9560>] ? lod_index_lookup+0x0/0x30 [lod]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e26dc8>] mdo_create+0x18/0x50 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e31031>] mdt_reint_open+0x1351/0x20a0 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050ae16>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa062b600>] ? lu_ucred_global_init+0x0/0x30 [obdclass]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e19eb1>] mdt_reint_rec+0x41/0xe0 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e01c93>] mdt_reint_internal+0x4c3/0x780 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e0221d>] mdt_intent_reint+0x1ed/0x520 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfd8ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0747461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa077017f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0dfdd96>] mdt_enqueue+0x46/0xe0 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e04a8a>] mdt_handle_common+0x52a/0x1470 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa0e3ec55>] mds_regular_handle+0x15/0x20 [mdt]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa079fe25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa04ef4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa050027f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07974c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81051439>] ? __wake_up_common+0x59/0x90
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a118d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffffa07a06a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Feb 17 11:54:44 kmet0002 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Feb 17 11:54:44 kmet0002 kernel:
Feb 17 11:54:44 kmet0002 kernel: LustreError: dumping log to /tmp/lustre-log.1392609284.11465
Feb 17 11:54:45 kmet0002 kernel: LNet: Service thread pid 11417 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
We reactivated the OST when system load hit 80 and the load came down to almost 0 within a few minutes.
iostat showed no IO
The reason the patch hasn't landed yet is because it is causing failures in our testing. The problem is likely related to the test script itself, and not a reflection of the patch, but we can't land a patch that is causing test failures or it will interfere with all the other patches trying to land.