Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4197

deadlock in recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.6
    • RHEL6
    • 3
    • 11384

    Description

      Bug originally hit during xyratex testing.

      BUG: spinlock lockup on CPU#3, tgt_recov/8159, ffff880099c1ca90 (Tainted: G        W  ----------------  )
      Pid: 8159, comm: tgt_recov Tainted: G        W  ----------------   2.6.32-131.17.1-lustre #0
      Call Trace:
       [<ffffffff8128c2da>] ? _raw_spin_lock+0x16a/0x180
       [<ffffffff81500ff6>] ? _spin_lock+0x56/0x70
       [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass]
       [<ffffffffa03ea572>] ? cfs_hash_del+0xa2/0x1d0 [libcfs]
       [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass]
       [<ffffffffa056875d>] ? class_disconnect+0x15d/0x3d0 [obdclass]
       [<ffffffffa06bfd17>] ? server_disconnect_export+0x37/0x1a0 [ptlrpc]
       [<ffffffffa0c9630f>] ? filter_disconnect+0xbf/0x380 [obdfilter]
       [<ffffffffa056db97>] ? class_disconnect_export_list+0x347/0x680 [obdclass]
       [<ffffffffa056e027>] ? class_disconnect_stale_exports+0x157/0x380 [obdclass]
       [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffffa06bc490>] ? check_for_clients+0x0/0x80 [ptlrpc]
       [<ffffffffa06bf04b>] ? target_recovery_overseer+0x15b/0x2d0 [ptlrpc]
       [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81091a80>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa06c4b90>] ? target_recovery_thread+0x460/0x15d0 [ptlrpc]
       [<ffffffff810563bd>] ? finish_task_switch+0x7d/0x110
       [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc]
       [<ffffffff8100c2ca>] ? child_rip+0xa/0x20
       [<ffffffff81500d50>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bc10>] ? restore_args+0x0/0x30
       [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc]
       [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
      

      discovering an bug found an commit with backporting an LU-1522.
      quick look say that bug exist in target_handle_connect() function on b2_1 also

              cfs_spin_lock(&target->obd_recovery_task_lock);
              if (target->obd_recovering && !export->exp_in_recovery &&
                  !export->exp_disconnected) {
                      cfs_spin_lock(&export->exp_lock);
                      /* possible race with class_disconnect_stale_exports,
                       * export may be already in the eviction process */
                      if (export->exp_failed) {
                              cfs_spin_unlock(&export->exp_lock);
                              GOTO(out, rc = -ENODEV);
                      }
      

      so if we have race with disconnect stale export we will exit from obd_recovery_task_lock held, that kill recovery and node at all.

      Attachments

        Activity

          People

            dmiter Dmitry Eremin (Inactive)
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: