Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-442

Client LBUG - (osc_request.c:3087:osc_set_lock_data_with_check()) ASSERTION(lock->l_ast_data == NULL || lock->l_ast_data == data) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • 3
    • 4973

    Description

      This LBUG has been triggered on a node running a parallel application where at least 2 tasks were reading the same Lustre file.
      Crash-dump analysis shows that :

      _ the LBUG/Assert occured/failed because (lock->l_ast_data != data) in osc_set_lock_data_with_check() when it was just found/checked equal/ok in osc_enqueue_base().
      _ 2 tasks, including the panic'ing one, are working with the same ldlm_lock struct reference found in their respective stacks.
      _ the task that triggered the LBUG has the following stack :

      PID: 24833 TASK: ffff88086c05f340 CPU: 8 COMMAND: "gonel_Bordelman"
      #0 [ffff88086c2eb580] machine_kexec at ffffffff8102e77b
      #1 [ffff88086c2eb5e0] crash_kexec at ffffffff810a6c78
      #2 [ffff88086c2eb6b0] panic at ffffffff81466791
      #3 [ffff88086c2eb730] lbug_with_loc at ffffffffa03dceeb
      #4 [ffff88086c2eb780] libcfs_assertion_failed at ffffffffa03e8826
      #5 [ffff88086c2eb7d0] osc_set_lock_data_with_check at ffffffffa0719102
      #6 [ffff88086c2eb800] osc_enqueue_base at ffffffffa071958e
      #7 [ffff88086c2eb8b0] osc_lock_enqueue at ffffffffa0734e8f
      #8 [ffff88086c2eb960] cl_enqueue_try at ffffffffa04ab1bb
      #9 [ffff88086c2eb9e0] lov_lock_enqueue at ffffffffa0790005
      #10 [ffff88086c2ebab0] cl_enqueue_try at ffffffffa04ab1bb
      #11 [ffff88086c2ebb30] cl_enqueue_locked at ffffffffa04ab4af
      #12 [ffff88086c2ebba0] cl_lock_request at ffffffffa04ac0be
      #13 [ffff88086c2ebc30] cl_io_lock at ffffffffa04b143a
      #14 [ffff88086c2ebcc0] cl_io_loop at ffffffffa04b176a
      #15 [ffff88086c2ebd30] ll_file_io_generic at ffffffffa07d1e82
      #16 [ffff88086c2ebdd0] ll_file_aio_read at ffffffffa07d213c
      #17 [ffff88086c2ebe60] ll_file_read at ffffffffa07d8721
      #18 [ffff88086c2ebef0] vfs_read at ffffffff81158a05
      #19 [ffff88086c2ebf30] sys_read at ffffffff81158b41
      #20 [ffff88086c2ebf80] system_call_fastpath at ffffffff8100c172

      _ the 2nd task has about the same stack :

      PID: 24834 TASK: ffff88086c05eb20 CPU: 24 COMMAND: "gonel_Bordelman"
      — <NMI exception stack> —
      #6 [ffff8807d188d6c8] _spin_lock at ffffffff81469618
      #7 [ffff8807d188d6d0] lock_res_and_lock at ffffffffa053f128
      #8 [ffff8807d188d700] __ldlm_handle2lock at ffffffffa0541fc8
      #9 [ffff8807d188d760] osc_lock_upcall at ffffffffa0734996
      #10 [ffff8807d188d800] osc_enqueue_base at ffffffffa0719597
      #11 [ffff8807d188d8b0] osc_lock_enqueue at ffffffffa0734e8f
      #12 [ffff8807d188d960] cl_enqueue_try at ffffffffa04ab1bb
      #13 [ffff8807d188d9e0] lov_lock_enqueue at ffffffffa0790005
      #14 [ffff8807d188dab0] cl_enqueue_try at ffffffffa04ab1bb
      #15 [ffff8807d188db30] cl_enqueue_locked at ffffffffa04ab4af
      #16 [ffff8807d188dba0] cl_lock_request at ffffffffa04ac0be
      #17 [ffff8807d188dc30] cl_io_lock at ffffffffa04b143a
      #18 [ffff8807d188dcc0] cl_io_loop at ffffffffa04b176a
      #19 [ffff8807d188dd30] ll_file_io_generic at ffffffffa07d1e82
      #20 [ffff8807d188ddd0] ll_file_aio_read at ffffffffa07d213c
      #21 [ffff8807d188de60] ll_file_read at ffffffffa07d8721
      #22 [ffff8807d188def0] vfs_read at ffffffff81158a05
      #23 [ffff8807d188df30] sys_read at ffffffff81158b41
      #24 [ffff8807d188df80] system_call_fastpath at ffffffff8100c172

      _ this indicate that the 2 tasks are executing the same code path in parallel, where they both may have found/elected the same ldlm_lock struct with (l_ast_data == NULL), then called osc_set_lock_data_with_check() to set l_ast_data (under "late" l_lock/osc_ast_guard spin-locks protection !) but since the same (l_ast_data == data) check/assertion is done there, the 2nd task should LBUG ...

      _ so this seems that the l_lock/osc_ast_guard spin-locks protection has to be done at the osc_enqueue_base() level, around the "if (matched->l_ast_data == NULL || matched->l_ast_data == einfo->ei_cbdata)" statement, instead in osc_set_lock_data_with_check().

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            louveta Alexandre Louvet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: