Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5651

ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • 3
    • 15837

    Description

      Client doesn't restore import state correctly
      on reconnect during replay. It resends lock replay
      when final ping was queued by server.
      Server fails with "target_queue_recovery_request())
      ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

      Solution is to add imp_replay_state to store last replay state.
      During reconnect imp_state is restored from imp_replay_state.

      Attachments

        Issue Links

          Activity

            [LU-5651] ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/
            Subject: LU-5651 test: run replay-single test 93 only when supported.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: afde9f17260650d0cb80d53613fb5afda0a39384

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/ Subject: LU-5651 test: run replay-single test 93 only when supported. Project: fs/lustre-release Branch: master Current Patch Set: Commit: afde9f17260650d0cb80d53613fb5afda0a39384

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942
            Subject: LU-5651 test: run replay-single test 93 only when supported.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7bc6d9b8c2e09027444b31c40dd320e244c2412d

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942 Subject: LU-5651 test: run replay-single test 93 only when supported. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7bc6d9b8c2e09027444b31c40dd320e244c2412d

            I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes.

            simmonsja James A Simmons added a comment - I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes.

            Patch for b2_5 at http://review.whamcloud.com/#/c/12163/

            The master patch has landed. Is there more work needed to complete this ticket or should it be closed?

            jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/12163/ The master patch has landed. Is there more work needed to complete this ticket or should it be closed?

            ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            hilljjornl Jason Hill (Inactive) added a comment - ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            Ah, my mistake, the code has been changed by LU-793, so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            niu Niu Yawei (Inactive) added a comment - Ah, my mistake, the code has been changed by LU-793 , so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is

            no_export:
                            OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout);
                    } else if (req->rq_export == NULL &&
            		   atomic_read(&export->exp_rpc_count) > 0) {
                            LCONSOLE_WARN("%s: Client %s (at %s) refused connection, "
                                          "still busy with %d references\n",
                                          target->obd_name, cluuid.uuid,
                                          libcfs_nid2str(req->rq_peer.nid),
            			      atomic_read(&export->exp_refcount));
                            GOTO(out, rc = -EBUSY);
            

            but reconnect request has valid export handle.

            askulysh Andriy Skulysh added a comment - I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is no_export: OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout); } else if (req->rq_export == NULL && atomic_read(&export->exp_rpc_count) > 0) { LCONSOLE_WARN("%s: Client %s (at %s) refused connection, " "still busy with %d references\n", target->obd_name, cluuid.uuid, libcfs_nid2str(req->rq_peer.nid), atomic_read(&export->exp_refcount)); GOTO(out, rc = -EBUSY); but reconnect request has valid export handle.

            no. request reaches timeout on reply, client reconnects

            Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            niu Niu Yawei (Inactive) added a comment - no. request reaches timeout on reply, client reconnects Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            no. request reaches timeout on reply, client reconnects

            askulysh Andriy Skulysh added a comment - no. request reaches timeout on reply, client reconnects

            step by step explanation:
            1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
            2) client1 and client 2 send lock relay,
            3) client1 sends final ping
            4) server queues final ping from client1 and set exp_lock_replay_needed to 0
            5) client2 still in lock replay stage (waits for lock replies form server)
            6) client1 reconnects

            Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared.

            niu Niu Yawei (Inactive) added a comment - step by step explanation: 1) server goes to recovery stage 2 (lock replay) with at least 2 clients. 2) client1 and client 2 send lock relay, 3) client1 sends final ping 4) server queues final ping from client1 and set exp_lock_replay_needed to 0 5) client2 still in lock replay stage (waits for lock replies form server) 6) client1 reconnects Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared.

            step by step explanation:
            1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
            2) client1 and client 2 send lock relay,
            3) client1 sends final ping
            4) server queues final ping from client1 and set exp_lock_replay_needed to 0
            5) client2 still in lock replay stage (waits for lock replies form server)
            6) client1 reconnects
            7) client1 replays locks from the beginning
            8) assertion fails

            askulysh Andriy Skulysh added a comment - step by step explanation: 1) server goes to recovery stage 2 (lock replay) with at least 2 clients. 2) client1 and client 2 send lock relay, 3) client1 sends final ping 4) server queues final ping from client1 and set exp_lock_replay_needed to 0 5) client2 still in lock replay stage (waits for lock replies form server) 6) client1 reconnects 7) client1 replays locks from the beginning 8) assertion fails

            People

              niu Niu Yawei (Inactive)
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: