Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.0.0
-
None
-
3
-
4973
Description
This LBUG has been triggered on a node running a parallel application where at least 2 tasks were reading the same Lustre file.
Crash-dump analysis shows that :
_ the LBUG/Assert occured/failed because (lock->l_ast_data != data) in osc_set_lock_data_with_check() when it was just found/checked equal/ok in osc_enqueue_base().
_ 2 tasks, including the panic'ing one, are working with the same ldlm_lock struct reference found in their respective stacks.
_ the task that triggered the LBUG has the following stack :
PID: 24833 TASK: ffff88086c05f340 CPU: 8 COMMAND: "gonel_Bordelman"
#0 [ffff88086c2eb580] machine_kexec at ffffffff8102e77b
#1 [ffff88086c2eb5e0] crash_kexec at ffffffff810a6c78
#2 [ffff88086c2eb6b0] panic at ffffffff81466791
#3 [ffff88086c2eb730] lbug_with_loc at ffffffffa03dceeb
#4 [ffff88086c2eb780] libcfs_assertion_failed at ffffffffa03e8826
#5 [ffff88086c2eb7d0] osc_set_lock_data_with_check at ffffffffa0719102
#6 [ffff88086c2eb800] osc_enqueue_base at ffffffffa071958e
#7 [ffff88086c2eb8b0] osc_lock_enqueue at ffffffffa0734e8f
#8 [ffff88086c2eb960] cl_enqueue_try at ffffffffa04ab1bb
#9 [ffff88086c2eb9e0] lov_lock_enqueue at ffffffffa0790005
#10 [ffff88086c2ebab0] cl_enqueue_try at ffffffffa04ab1bb
#11 [ffff88086c2ebb30] cl_enqueue_locked at ffffffffa04ab4af
#12 [ffff88086c2ebba0] cl_lock_request at ffffffffa04ac0be
#13 [ffff88086c2ebc30] cl_io_lock at ffffffffa04b143a
#14 [ffff88086c2ebcc0] cl_io_loop at ffffffffa04b176a
#15 [ffff88086c2ebd30] ll_file_io_generic at ffffffffa07d1e82
#16 [ffff88086c2ebdd0] ll_file_aio_read at ffffffffa07d213c
#17 [ffff88086c2ebe60] ll_file_read at ffffffffa07d8721
#18 [ffff88086c2ebef0] vfs_read at ffffffff81158a05
#19 [ffff88086c2ebf30] sys_read at ffffffff81158b41
#20 [ffff88086c2ebf80] system_call_fastpath at ffffffff8100c172
_ the 2nd task has about the same stack :
PID: 24834 TASK: ffff88086c05eb20 CPU: 24 COMMAND: "gonel_Bordelman"
— <NMI exception stack> —
#6 [ffff8807d188d6c8] _spin_lock at ffffffff81469618
#7 [ffff8807d188d6d0] lock_res_and_lock at ffffffffa053f128
#8 [ffff8807d188d700] __ldlm_handle2lock at ffffffffa0541fc8
#9 [ffff8807d188d760] osc_lock_upcall at ffffffffa0734996
#10 [ffff8807d188d800] osc_enqueue_base at ffffffffa0719597
#11 [ffff8807d188d8b0] osc_lock_enqueue at ffffffffa0734e8f
#12 [ffff8807d188d960] cl_enqueue_try at ffffffffa04ab1bb
#13 [ffff8807d188d9e0] lov_lock_enqueue at ffffffffa0790005
#14 [ffff8807d188dab0] cl_enqueue_try at ffffffffa04ab1bb
#15 [ffff8807d188db30] cl_enqueue_locked at ffffffffa04ab4af
#16 [ffff8807d188dba0] cl_lock_request at ffffffffa04ac0be
#17 [ffff8807d188dc30] cl_io_lock at ffffffffa04b143a
#18 [ffff8807d188dcc0] cl_io_loop at ffffffffa04b176a
#19 [ffff8807d188dd30] ll_file_io_generic at ffffffffa07d1e82
#20 [ffff8807d188ddd0] ll_file_aio_read at ffffffffa07d213c
#21 [ffff8807d188de60] ll_file_read at ffffffffa07d8721
#22 [ffff8807d188def0] vfs_read at ffffffff81158a05
#23 [ffff8807d188df30] sys_read at ffffffff81158b41
#24 [ffff8807d188df80] system_call_fastpath at ffffffff8100c172
_ this indicate that the 2 tasks are executing the same code path in parallel, where they both may have found/elected the same ldlm_lock struct with (l_ast_data == NULL), then called osc_set_lock_data_with_check() to set l_ast_data (under "late" l_lock/osc_ast_guard spin-locks protection !) but since the same (l_ast_data == data) check/assertion is done there, the 2nd task should LBUG ...
_ so this seems that the l_lock/osc_ast_guard spin-locks protection has to be done at the osc_enqueue_base() level, around the "if (matched->l_ast_data == NULL || matched->l_ast_data == einfo->ei_cbdata)" statement, instead in osc_set_lock_data_with_check().