Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5651

ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • 3
    • 15837

    Description

      Client doesn't restore import state correctly
      on reconnect during replay. It resends lock replay
      when final ping was queued by server.
      Server fails with "target_queue_recovery_request())
      ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

      Solution is to add imp_replay_state to store last replay state.
      During reconnect imp_state is restored from imp_replay_state.

      Attachments

        Issue Links

          Activity

            [LU-5651] ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

            ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            hilljjornl Jason Hill (Inactive) added a comment - ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

            Ah, my mistake, the code has been changed by LU-793, so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            niu Niu Yawei (Inactive) added a comment - Ah, my mistake, the code has been changed by LU-793 , so client can reconnect now. I'll review the patch soon, thanks for your explanation.

            I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is

            no_export:
                            OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout);
                    } else if (req->rq_export == NULL &&
            		   atomic_read(&export->exp_rpc_count) > 0) {
                            LCONSOLE_WARN("%s: Client %s (at %s) refused connection, "
                                          "still busy with %d references\n",
                                          target->obd_name, cluuid.uuid,
                                          libcfs_nid2str(req->rq_peer.nid),
            			      atomic_read(&export->exp_refcount));
                            GOTO(out, rc = -EBUSY);
            

            but reconnect request has valid export handle.

            askulysh Andriy Skulysh added a comment - I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is no_export: OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout); } else if (req->rq_export == NULL && atomic_read(&export->exp_rpc_count) > 0) { LCONSOLE_WARN("%s: Client %s (at %s) refused connection, " "still busy with %d references\n", target->obd_name, cluuid.uuid, libcfs_nid2str(req->rq_peer.nid), atomic_read(&export->exp_refcount)); GOTO(out, rc = -EBUSY); but reconnect request has valid export handle.

            no. request reaches timeout on reply, client reconnects

            Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            niu Niu Yawei (Inactive) added a comment - no. request reaches timeout on reply, client reconnects Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

            no. request reaches timeout on reply, client reconnects

            askulysh Andriy Skulysh added a comment - no. request reaches timeout on reply, client reconnects

            step by step explanation:
            1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
            2) client1 and client 2 send lock relay,
            3) client1 sends final ping
            4) server queues final ping from client1 and set exp_lock_replay_needed to 0
            5) client2 still in lock replay stage (waits for lock replies form server)
            6) client1 reconnects

            Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared.

            niu Niu Yawei (Inactive) added a comment - step by step explanation: 1) server goes to recovery stage 2 (lock replay) with at least 2 clients. 2) client1 and client 2 send lock relay, 3) client1 sends final ping 4) server queues final ping from client1 and set exp_lock_replay_needed to 0 5) client2 still in lock replay stage (waits for lock replies form server) 6) client1 reconnects Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared.

            step by step explanation:
            1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
            2) client1 and client 2 send lock relay,
            3) client1 sends final ping
            4) server queues final ping from client1 and set exp_lock_replay_needed to 0
            5) client2 still in lock replay stage (waits for lock replies form server)
            6) client1 reconnects
            7) client1 replays locks from the beginning
            8) assertion fails

            askulysh Andriy Skulysh added a comment - step by step explanation: 1) server goes to recovery stage 2 (lock replay) with at least 2 clients. 2) client1 and client 2 send lock relay, 3) client1 sends final ping 4) server queues final ping from client1 and set exp_lock_replay_needed to 0 5) client2 still in lock replay stage (waits for lock replies form server) 6) client1 reconnects 7) client1 replays locks from the beginning 8) assertion fails

            exp_in_recovery is zeroed during recovery final stage 3, client1 reconnects during lock replay stage 2.

            askulysh Andriy Skulysh added a comment - exp_in_recovery is zeroed during recovery final stage 3, client1 reconnects during lock replay stage 2.

            The recovery ends only on receiving and processing final pings from all clients.

            I was saying recovery on this export is done. (exp_in_recovery == 0).

            niu Niu Yawei (Inactive) added a comment - The recovery ends only on receiving and processing final pings from all clients. I was saying recovery on this export is done. (exp_in_recovery == 0).

            The recovery ends only on receiving and processing final pings from all clients. I can happen that server accepted final ping from client1 but waits for requests from client2 and client1 reconnects. This situation was simulated in test by adding timeout before processing final pings.

            askulysh Andriy Skulysh added a comment - The recovery ends only on receiving and processing final pings from all clients. I can happen that server accepted final ping from client1 but waits for requests from client2 and client1 reconnects. This situation was simulated in test by adding timeout before processing final pings.

            This looks dup of LU-5287 and I posted a fix at: http://review.whamcloud.com/#/c/11871/

            Client doesn't restore import state correctly
            on reconnect during replay. It resends lock replay
            when final ping was queued by server.
            Server fails with "target_queue_recovery_request())
            ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

            If the final ping has been queued on server, the recovery on this export should be finished (exp_in_recovery == 0) when the resent lock replay reach server. I don't see why that could trigger the assertion, could you explain it in detail?

            niu Niu Yawei (Inactive) added a comment - This looks dup of LU-5287 and I posted a fix at: http://review.whamcloud.com/#/c/11871/ Client doesn't restore import state correctly on reconnect during replay. It resends lock replay when final ping was queued by server. Server fails with "target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed" If the final ping has been queued on server, the recovery on this export should be finished (exp_in_recovery == 0) when the resent lock replay reach server. I don't see why that could trigger the assertion, could you explain it in detail?

            People

              niu Niu Yawei (Inactive)
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: