Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1522

ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed

    XMLWordPrintable

Details

    • 3
    • 4514

    Description

      Our lustre server crashed multiple times a day. This is one of the failures:

      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f3996000 x1404476930188631/t0(0) o-1->736da151-8a99-44ed-0646-bb0e3daa974e@NET_0x500000a970f63_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654056 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) Skipped 147 previous similar messages
      <4>Lustre: 7606:0:(ldlm_lib.c:1562:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      <0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
      <0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) LBUG
      <4>Pid: 7606, comm: tgt_recov
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0578855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0578e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      <4> [<ffffffffa0583da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      <4> [<ffffffffa0732d53>] target_recovery_thread+0xed3/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c14a>] child_rip+0xa/0x20
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 7606, comm: tgt_recov Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
      <4>Call Trace:
      <4> [<ffffffff81520c76>] ? panic+0x78/0x164
      <4> [<ffffffffa0578eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
      <4> [<ffffffffa0583da6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
      <4> [<ffffffffa0732d53>] ? target_recovery_thread+0xed3/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [22]kdb>

      Here is the line that LBUG'ed
      LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0);
      in *target_next_replay_req():

      static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd)
      {
      struct ptlrpc_request *req = NULL;
      ENTRY;

      CDEBUG(D_HA, "Waiting for transno "LPD64"\n",
      obd->obd_next_recovery_transno);

      if (target_recovery_overseer(obd, check_for_next_transno,
      exp_req_replay_healthy))

      { abort_req_replay_queue(obd); abort_lock_replay_queue(obd); }

      cfs_spin_lock(&obd->obd_recovery_task_lock);
      if (!cfs_list_empty(&obd->obd_req_replay_queue))

      { req = cfs_list_entry(obd->obd_req_replay_queue.next, struct ptlrpc_request, rq_list); cfs_list_del_init(&req->rq_list); obd->obd_requests_queued_for_recovery--; cfs_spin_unlock(&obd->obd_recovery_task_lock); }

      else

      { cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); }

      RETURN(req);
      }

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: