Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1522

ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed

Details

    • 3
    • 4514

    Description

      Our lustre server crashed multiple times a day. This is one of the failures:

      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f3996000 x1404476930188631/t0(0) o-1->736da151-8a99-44ed-0646-bb0e3daa974e@NET_0x500000a970f63_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654056 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
      <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) Skipped 147 previous similar messages
      <4>Lustre: 7606:0:(ldlm_lib.c:1562:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      <0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
      <0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) LBUG
      <4>Pid: 7606, comm: tgt_recov
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0578855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0578e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      <4> [<ffffffffa0583da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      <4> [<ffffffffa0732d53>] target_recovery_thread+0xed3/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c14a>] child_rip+0xa/0x20
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 7606, comm: tgt_recov Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
      <4>Call Trace:
      <4> [<ffffffff81520c76>] ? panic+0x78/0x164
      <4> [<ffffffffa0578eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
      <4> [<ffffffffa0583da6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
      <4> [<ffffffffa0732d53>] ? target_recovery_thread+0xed3/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      <4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [22]kdb>

      Here is the line that LBUG'ed
      LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0);
      in *target_next_replay_req():

      static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd)
      {
      struct ptlrpc_request *req = NULL;
      ENTRY;

      CDEBUG(D_HA, "Waiting for transno "LPD64"\n",
      obd->obd_next_recovery_transno);

      if (target_recovery_overseer(obd, check_for_next_transno,
      exp_req_replay_healthy))

      { abort_req_replay_queue(obd); abort_lock_replay_queue(obd); }

      cfs_spin_lock(&obd->obd_recovery_task_lock);
      if (!cfs_list_empty(&obd->obd_req_replay_queue))

      { req = cfs_list_entry(obd->obd_req_replay_queue.next, struct ptlrpc_request, rq_list); cfs_list_del_init(&req->rq_list); obd->obd_requests_queued_for_recovery--; cfs_spin_unlock(&obd->obd_recovery_task_lock); }

      else

      { cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); }

      RETURN(req);
      }

      Attachments

        Issue Links

          Activity

            [LU-1522] ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
            pjones Peter Jones added a comment -

            Landed for 2.1.3 and 2.3

            pjones Peter Jones added a comment - Landed for 2.1.3 and 2.3

            The patch set 2 of review #3145 was landed to b2_1, but not master.
            The patch of LU-1432 was landed to master, but not b2_1.

            We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the LU-1432 patch to our b2_1 and is running in our production systems without a crash for several weeks now.

            So, please comment if I should have both LU-1432 and patch set 2 of #3145? Thanks!

            jaylan Jay Lan (Inactive) added a comment - The patch set 2 of review #3145 was landed to b2_1, but not master. The patch of LU-1432 was landed to master, but not b2_1. We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the LU-1432 patch to our b2_1 and is running in our production systems without a crash for several weeks now. So, please comment if I should have both LU-1432 and patch set 2 of #3145? Thanks!

            We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up.

            The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122.

            jaylan Jay Lan (Inactive) added a comment - We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up. The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122.

            No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)).

            jaylan Jay Lan (Inactive) added a comment - No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)).

            Jay, that LBUG doesn't look related, do you see it always?

            tappro Mikhail Pershin added a comment - Jay, that LBUG doesn't look related, do you see it always?

            Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch.

            tappro Mikhail Pershin added a comment - Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch.

            I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122.

            jaylan Jay Lan (Inactive) added a comment - I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122.

            After applying http://review.whamcloud.com/3122
            the mds LBUG'ed:

            LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M
            LustreError: 11533:0:(lu_object.c:113:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed^M
            LustreError: 11533:0:(lu_object.c:113:lu_object_put()) LBUG^M
            Pid: 11533, comm: mdt_rdpg_07^M
            ^M
            Call Trace:^M
            [<ffffffffa056b855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
            [<ffffffffa056be95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
            [<ffffffffa0576da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]^M
            ^M

            jaylan Jay Lan (Inactive) added a comment - After applying http://review.whamcloud.com/3122 the mds LBUG'ed: LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M LustreError: 11533:0:(lu_object.c:113:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed^M LustreError: 11533:0:(lu_object.c:113:lu_object_put()) LBUG^M Pid: 11533, comm: mdt_rdpg_07^M ^M Call Trace:^M [<ffffffffa056b855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [<ffffffffa056be95>] lbug_with_loc+0x75/0xe0 [libcfs] ^M [<ffffffffa0576da6>] libcfs_assertion_failed+0x66/0x70 [libcfs] ^M ^M

            Mikhail,
            Maybe I'm wrong but it looks to me like your mod to ldlm_lib.c in http://review.whamcloud.com/3145 now allows an error exit to the routine that leaves &target->obd_recovery_task_lock still locked. Did you mean to do that?

            bogl Bob Glossman (Inactive) added a comment - Mikhail, Maybe I'm wrong but it looks to me like your mod to ldlm_lib.c in http://review.whamcloud.com/3145 now allows an error exit to the routine that leaves &target->obd_recovery_task_lock still locked. Did you mean to do that?
            tappro Mikhail Pershin added a comment - Jay, check this one: http://review.whamcloud.com/3145

            People

              tappro Mikhail Pershin
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: