Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
None
-
lustre-2.8.0_5.chaos-2.ch6.x86_64
kernel 3.10.0-514.0.0.1chaos.ch6.x86_64
-
3
-
9223372036854775807
Description
Console reports first this:
LustreError: 7526:0:(cl_object.c:735:cl_env_attach()) ASSERTION( rc == 0 ) failed: LustreError: 7526:0:(cl_object.c:735:cl_env_attach()) LBUG Pid: 7526, comm: ldlm_bl_02 Call Trace: libcfs_debug_dumpstack+0x53/0x80 [libcfs] lbug_with_loc+0x45/0xc0 [libcfs] cl_env_percpu_get+0xc2/0xd0 [obdclass] ll_invalidatepage+0x41/0x170 [lustre] vvp_page_discard+0xbd/0x160 [lustre] cl_page_invoid+0x68/0x170 [obdclass] cl_page_discard+0x13/0x20 [obdclass] discard_cb+0x67/0x190 [osc] osc_page_gang_lookup+0x1e0/0x320 [osc] ? discard_cb+0x0/0x190 [osc] osc_lock_discard_pages+0x119/0x22d [osc] ? discard_cb+0x0/0x190 [osc] osc_lock_flush+0x89/0x280 [osc] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc] ldlm_cancel_callback+0x8a/0x2e0 [ptlrpc] ? dequeue_entity+0x11c/0x5d0 ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc] ldlm_cli_cancel+0xab/0x3d0 [ptlrpc] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] ? __schedule+0x3b8/0x9c0 ldlm_handle_bl_callback+0xcf/0x410 [ptlrpc] ldlm_bl_thread_main+0x531/0x700 [ptlrpc] ? default_wake_function+0x0/0x20 ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc] kthread+0xcf/0xe0 ? kthread+0x0/0xe0 ret_from_fork+0x58/0x90 ? kthread+0x0/0xe0
is followed by
BUG: sleeping function called from invalid context at mm/slub.c:941 in_atomic(): 1, irqs_disabled(): 0, pid: 7526, name: ldlm_bl_02 CPU: 23 PID: 7526 Comm: ldlm_bl_02 Tainted: G OE ------------ 3.10.0-514.0.0.1chaos.ch6.x86_64 #1 Call Trace: dump_stack+0x19/0x1b __might_sleep+0xd9/0x100 kmem_cache_alloc_trace+0x4a/0x250 ? call_usermodehelper_setup+0x3f/0xa0 call_usermodehelper_setup+0x3f/0xa0 call_usermodehelper+0x31/0x60 libcfs_run_upcall+0x9e/0x3b0 [libcfs] ? snprintf+0x49/0x70 libcfs_run_lbug_upcall+0x7d/0x100 [libcfs] lbug_with_loc+0x57/0xc0 [libcfs] cl_env_percpu_get+0xc2/0xd0 [obdclass] ll_invalidatepage+0x41/0x170 [lustre] vvp_page_discard+0xbd/0x160 [lustre] cl_page_invoid+0x68/0x170 [obdclass] cl_page_discard+0x13/0x20 [obdclass] discard_cb+0x67/0x190 [osc] osc_page_gang_lookup+0x1e0/0x320 [osc] ? check_and_discard_cb+0x150/0x150 [osc] osc_lock_discard_pages+0x119/0x22d [osc] ? check_and_discard_cb+0x150/0x150 [osc] osc_lock_flush+0x89/0x280 [osc] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc] ldlm_cancel_callback+0x8a/0x2e0 [ptlrpc] ? dequeue_entity+0x11c/0x5d0 ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc] ldlm_cli_cancel+0xab/0x3d0 [ptlrpc] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] ? __schedule+0x3b8/0x9c0 ldlm_handle_bl_callback+0xcf/0x410 [ptlrpc] ldlm_bl_thread_main+0x531/0x700 [ptlrpc] ? wake_up_state+0x20/0x20 ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] kthread+0xcf/0xe0 ? kthread_create_on_node+0x140/0x140 ret_from_fork+0x58/0x90 ? kthread_create_on_node+0x140/0x140
I'm not certain whether there is a particular workload that triggers this. We've been running concurrent mdtest and ior jobs, using remote directories but not striped directories.
The frequency is high; running on 300 clients for about 2 hours triggered this bug in 1/3 of the nodes.
Attachments
Issue Links
- duplicates
-
LU-8509 drop_caches hangs in cl_inode_fini()
-
- Resolved
-
Jinshan,
Thanks for the explanation. We landed patch 24351 to our patch stack and the high frequency LBUGs have stopped.
Is there any reason for us not to just revert the change from
LU-8509in its entirety? It looks to me like there were a few bits not reverted by your patch, but that they have no effect - things like initializing a variable that gets set before it is read anyway.I looked at
LU-4257. For the benefit of anyone else watching this ticket,LU-4257has 5 changes, collectively alter several hundred lines of code, but produced very large performance improvements, particularly for small file IO.