Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5651

ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • 3
    • 15837

    Description

      Client doesn't restore import state correctly
      on reconnect during replay. It resends lock replay
      when final ping was queued by server.
      Server fails with "target_queue_recovery_request())
      ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

      Solution is to add imp_replay_state to store last replay state.
      During reconnect imp_state is restored from imp_replay_state.

      Attachments

        Issue Links

          Activity

            [LU-5651] ASSERTION( req->rq_export->exp_lock_replay_needed ) failed
            yujian Jian Yu added a comment -

            Patches landed to master and b2_5 branches.

            yujian Jian Yu added a comment - Patches landed to master and b2_5 branches.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12163/
            Subject: LU-5651: ptlrpc: fix import state during replay
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 5748379d28846c672f793ba1f8e143e63531dd05

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12163/ Subject: LU-5651 : ptlrpc: fix import state during replay Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 5748379d28846c672f793ba1f8e143e63531dd05

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/
            Subject: LU-5651 test: run replay-single test 93 only when supported.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: afde9f17260650d0cb80d53613fb5afda0a39384

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/ Subject: LU-5651 test: run replay-single test 93 only when supported. Project: fs/lustre-release Branch: master Current Patch Set: Commit: afde9f17260650d0cb80d53613fb5afda0a39384

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942
            Subject: LU-5651 test: run replay-single test 93 only when supported.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7bc6d9b8c2e09027444b31c40dd320e244c2412d

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942 Subject: LU-5651 test: run replay-single test 93 only when supported. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7bc6d9b8c2e09027444b31c40dd320e244c2412d

            I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes.

            simmonsja James A Simmons added a comment - I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes.

            Patch for b2_5 at http://review.whamcloud.com/#/c/12163/

            The master patch has landed. Is there more work needed to complete this ticket or should it be closed?

            jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/12163/ The master patch has landed. Is there more work needed to complete this ticket or should it be closed?

            ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            hilljjornl Jason Hill (Inactive) added a comment - ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            Ah, my mistake, the code has been changed by LU-793, so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            niu Niu Yawei (Inactive) added a comment - Ah, my mistake, the code has been changed by LU-793 , so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is

            no_export:
                            OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout);
                    } else if (req->rq_export == NULL &&
            		   atomic_read(&export->exp_rpc_count) > 0) {
                            LCONSOLE_WARN("%s: Client %s (at %s) refused connection, "
                                          "still busy with %d references\n",
                                          target->obd_name, cluuid.uuid,
                                          libcfs_nid2str(req->rq_peer.nid),
            			      atomic_read(&export->exp_refcount));
                            GOTO(out, rc = -EBUSY);
            

            but reconnect request has valid export handle.

            askulysh Andriy Skulysh added a comment - I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is no_export: OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout); } else if (req->rq_export == NULL && atomic_read(&export->exp_rpc_count) > 0) { LCONSOLE_WARN("%s: Client %s (at %s) refused connection, " "still busy with %d references\n", target->obd_name, cluuid.uuid, libcfs_nid2str(req->rq_peer.nid), atomic_read(&export->exp_refcount)); GOTO(out, rc = -EBUSY); but reconnect request has valid export handle.

            no. request reaches timeout on reply, client reconnects

            Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            niu Niu Yawei (Inactive) added a comment - no. request reaches timeout on reply, client reconnects Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            People

              niu Niu Yawei (Inactive)
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: