Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
None
-
3
-
5205
Description
Got this happening twice already.
Racer would run for some time, then one of the MDT threads gets stuck in lod_qos code.
The most recent example happened on a crashdump enabled node, so I dumped kernel core for subsequent analysis:
[ 3000.572084] INFO: task mdt00_003:3541 blocked for more than 120 seconds. [ 3000.572364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3000.572831] mdt00_003 D 0000000000000001 2416 3541 2 0x00000080 [ 3000.573180] ffff8800af919630 0000000000000046 ffff8800af9195d0 ffffffff810385d8 [ 3000.573620] ffff8800ffffffff 000005900aedb3e6 ffff8800b0995df0 ffff8800af919630 [ 3000.574062] ffff8800b48908b8 ffff8800af919fd8 000000000000fba8 ffff8800b48908b8 [ 3000.574500] Call Trace: [ 3000.574696] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 [ 3000.574959] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 [ 3000.575230] [<ffffffff814fad23>] rwsem_down_write_failed+0x23/0x30 [ 3000.575489] [<ffffffff8127c133>] call_rwsem_down_write_failed+0x13/0x20 [ 3000.575751] [<ffffffff814f9e85>] ? down_write+0x45/0x50 [ 3000.576002] [<ffffffffa0e06d82>] lod_alloc_qos.clone.0+0x182/0x1210 [lod] [ 3000.576271] [<ffffffff814f970c>] ? __mutex_lock_slowpath+0x21c/0x2c0 [ 3000.576558] [<ffffffffa0e08578>] lod_qos_prep_create+0x768/0x1af4 [lod] [ 3000.576906] [<ffffffffa0cc4ff2>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs] [ 3000.577364] [<ffffffffa0e03a8b>] lod_declare_striped_object+0x14b/0x920 [lod] [ 3000.577782] [<ffffffffa0e04568>] lod_declare_object_create+0x308/0x4f0 [lod] [ 3000.578062] [<ffffffffa0beac4f>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd] [ 3000.578519] [<ffffffffa0bfbaca>] mdd_create+0x39a/0x1550 [mdd] [ 3000.578790] [<ffffffffa0d54bc9>] mdt_reint_open+0x1079/0x1860 [mdt] [ 3000.579077] [<ffffffffa0c0186e>] ? md_ucred+0x1e/0x60 [mdd] [ 3000.579330] [<ffffffffa0d1f655>] ? mdt_ucred+0x15/0x20 [mdt] [ 3000.579582] [<ffffffffa0d3f0a1>] mdt_reint_rec+0x41/0xe0 [mdt] [ 3000.579839] [<ffffffffa0d38483>] mdt_reint_internal+0x4e3/0x7e0 [mdt] [ 3000.580107] [<ffffffffa0d38a4d>] mdt_intent_reint+0x1ed/0x500 [mdt] [ 3000.580369] [<ffffffffa0d343fe>] mdt_intent_policy+0x38e/0x770 [mdt] [ 3000.580694] [<ffffffffa07c8dda>] ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [ 3000.581100] [<ffffffffa07f00d7>] ldlm_handle_enqueue0+0x4e7/0x1010 [ptlrpc] [ 3000.581411] [<ffffffffa0d34936>] mdt_enqueue+0x46/0x130 [mdt] [ 3000.581661] [<ffffffffa0d281f2>] mdt_handle_common+0x932/0x1740 [mdt] [ 3000.581926] [<ffffffffa0d290d5>] mdt_regular_handle+0x15/0x20 [mdt] [ 3000.582712] [<ffffffffa081e483>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 3000.583164] [<ffffffffa049b66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 3000.583440] [<ffffffffa0817171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 3000.583707] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 3000.583965] [<ffffffffa082101a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 3000.584247] [<ffffffffa0820480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 3000.584542] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 3000.584864] [<ffffffffa0820480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 3000.585226] [<ffffffffa0820480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 3000.585503] [<ffffffff8100c140>] ? child_rip+0x0/0x20
.
The core and modules are in /exports/crashdumps/192.168.10.218-2012-10-14-02:15:42
Another occurence happened a bit before that on a different node with no crashdump ability, so I only have these traces:
[ 2044.498003] Pid: 11763, comm: mdt01_013 [ 2044.498606] [ 2044.498607] Call Trace:[ 2044.499226] [<ffffffff810a9ee2>] ? __module_text_address+0x12/0x60 [ 2044.500232] [<ffffffffa01e8b22>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs] [ 2044.501324] [<ffffffff81044f4e>] ? kernel_map_pages+0xfe/0x110 [ 2044.502229] [<ffffffff81161d3f>] ? kmem_cache_free+0x7f/0x280 [ 2044.503134] [<ffffffff81161792>] ? cache_alloc_debugcheck_after+0x152/0x290 [ 2044.504245] [<ffffffffa01e8b22>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs] [ 2044.505351] [<ffffffffa01e8afe>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [ 2044.506395] [<ffffffffa0bb04c1>] ? osp_object_free+0x181/0x190 [osp] [ 2044.507418] [<ffffffffa03abcb3>] ? lu_object_free+0x123/0x1b0 [obdclass] [ 2044.508561] [<ffffffffa03ac4ed>] ? lu_object_put+0xad/0x350 [obdclass] [ 2044.509607] [<ffffffffa0bb1903>] ? osp_declare_object_create+0x163/0x540 [os p] [ 2044.510737] [<ffffffffa0b7f6fc>] ? lod_qos_declare_object_on+0x16c/0x4c0 [lod] [ 2044.511904] [<ffffffffa0b819f7>] ? lod_alloc_qos.clone.0+0xdf7/0x1210 [lod] [ 2044.513025] [<ffffffffa0b82578>] ? lod_qos_prep_create+0x768/0x1af4 [lod] [ 2044.514130] [<ffffffffa0a59ff2>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs] [ 2044.515330] [<ffffffffa0b7da8b>] ? lod_declare_striped_object+0x14b/0x920 [lod] [ 2044.516492] [<ffffffffa0b7e568>] ? lod_declare_object_create+0x308/0x4f0 [lo [ 2044.517638] [<ffffffffa0498c4f>] ? mdd_declare_object_create_internal+0xaf/0x1d0 [mdd] [ 2044.518870] [<ffffffffa04a9aca>] ? mdd_create+0x39a/0x1550 [mdd] [ 2044.519822] [<ffffffffa0acebc9>] ? mdt_reint_open+0x1079/0x1860 [mdt] [ 2044.520851] [<ffffffffa04af86e>] ? md_ucred+0x1e/0x60 [mdd] [ 2044.521751] [<ffffffffa0a99655>] ? mdt_ucred+0x15/0x20 [mdt] [ 2044.522649] [<ffffffffa0ab90a1>] ? mdt_reint_rec+0x41/0xe0 [mdt] [ 2044.523601] [<ffffffffa0ab2483>] ? mdt_reint_internal+0x4e3/0x7e0 [mdt] [ 2044.524665] [<ffffffffa0ab2a4d>] ? mdt_intent_reint+0x1ed/0x500 [mdt] [ 2044.525708] [<ffffffffa0aae3fe>] ? mdt_intent_policy+0x38e/0x770 [mdt] [ 2044.526761] [<ffffffffa04fadda>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [ 2044.527846] [<ffffffffa05220d7>] ? ldlm_handle_enqueue0+0x4e7/0x1010 [ptlrpc] [ 2044.529031] [<ffffffffa0aae936>] ? mdt_enqueue+0x46/0x130 [mdt] [ 2044.529986] [<ffffffffa0aa21f2>] ? mdt_handle_common+0x932/0x1740 [mdt] [ 2044.531032] [<ffffffffa0aa30d5>] ? mdt_regular_handle+0x15/0x20 [mdt] [ 2044.532070] [<ffffffffa0550483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 2044.533340] [<ffffffffa01e866e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 2044.534339] [<ffffffffa0549171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 2044.535394] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 2044.536212] [<ffffffffa055301a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 2044.537257] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2044.538225] [<ffffffff8100c14a>] ? child_rip+0xa/0x20