Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5994

DT transaction start and object lock ordering

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 16718

    Description

      lfsck_layout_slave_conditional_destroy()

      and

      lfsck_layout_slave_repair_pfid()

      both call dt_trans_start_local() while holding a DT write lock on an object. This was found by code inspection.

      I also ran with the following:

      diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
      index 80d1e67..ed15b0c 100644
      --- a/lustre/osd-ldiskfs/osd_handler.c
      +++ b/lustre/osd-ldiskfs/osd_handler.c
      @@ -1058,6 +1058,8 @@ int osd_trans_start(const struct lu_env *env, struct dt_device *d,
               if (!IS_ERR(jh)) {
                       oh->ot_handle = jh;
                       LASSERT(oti->oti_txns == 0);
      +               LASSERT(oti->oti_w_locks == 0);
      +               LASSERT(oti->oti_r_locks == 0);
                       lu_context_init(&th->th_ctx, th->th_tags);
                       lu_context_enter(&th->th_ctx);
      

      Running sanity-lfsck.sh I was able to trigger a crash from

      lfsck_layout_slave_repair_pfid()

      :

      [ 1827.711965] Lustre: DEBUG MARKER: == sanity-lfsck test 19b: OST-object inconsistency self repair == 11:01:32 (1417798892)
      [ 1827.813194] Lustre: DEBUG MARKER: cancel_lru_locks osc start
      [ 1827.838214] Lustre: *** cfs_fail_loc=1611, val=0***
      [ 1827.839821] Lustre: Skipped 3 previous similar messages
      [ 1827.877070] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      [ 1827.957214] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed:
      [ 1827.960100] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) LBUG
      [ 1827.961698] Pid: 21862, comm: inconsistency_v
      [ 1827.962700]
      [ 1827.962702] Call Trace:
      [ 1827.963692]  [<ffffffffa052f8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [ 1827.965393]  [<ffffffffa052fec7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [ 1827.966913]  [<ffffffffa0d98809>] osd_trans_start+0x389/0x730 [osd_ldiskfs]
      [ 1827.968553]  [<ffffffffa0c44ed2>] lfsck_layout_slave_in_notify+0x982/0xcf0 [lfsck]
      [ 1827.969963]  [<ffffffffa0c0cd37>] lfsck_in_notify+0xf7/0x5a0 [lfsck]
      [ 1827.971155]  [<ffffffffa0fe6037>] ofd_inconsistency_verification_main+0x367/0xdb0 [ofd]
      [ 1827.972639]  [<ffffffff8105e59d>] ? finish_task_switch+0x7d/0x110
      [ 1827.973774]  [<ffffffff8105e568>] ? finish_task_switch+0x48/0x110
      [ 1827.974896]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
      [ 1827.976039]  [<ffffffffa0fe5cd0>] ? ofd_inconsistency_verification_main+0x0/0xdb0 [ofd]
      [ 1827.977539]  [<ffffffff8109e856>] kthread+0x96/0xa0
      [ 1827.978448]  [<ffffffff8100c30a>] child_rip+0xa/0x20
      [ 1827.979368]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
      [ 1827.980507]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [ 1827.981704]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
      [ 1827.982787]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
      [ 1827.983912]
      

      I also found that OFD also does this in enough places that I added

       || strstr(current->comm, "ll_ost") != NULL

      to the assertions above. Should OFD be following the same order?

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: