[LU-5651] ASSERTION( req->rq_export->exp_lock_replay_needed ) failed Created: 23/Sep/14  Updated: 29/May/15  Resolved: 15/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Minor
Reporter: Andriy Skulysh Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-5719 target_queue_recovery_request() ASSER... Closed
Related
is related to LU-5287 (ldlm_lib.c:2253:target_queue_recover... Resolved
Severity: 3
Rank (Obsolete): 15837

 Description   

Client doesn't restore import state correctly
on reconnect during replay. It resends lock replay
when final ping was queued by server.
Server fails with "target_queue_recovery_request())
ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

Solution is to add imp_replay_state to store last replay state.
During reconnect imp_state is restored from imp_replay_state.



 Comments   
Comment by Andriy Skulysh [ 23/Sep/14 ]

patch: http://review.whamcloud.com/12015

Comment by Niu Yawei (Inactive) [ 23/Sep/14 ]

This looks dup of LU-5287 and I posted a fix at: http://review.whamcloud.com/#/c/11871/

Client doesn't restore import state correctly
on reconnect during replay. It resends lock replay
when final ping was queued by server.
Server fails with "target_queue_recovery_request())
ASSERTION( req->rq_export->exp_lock_replay_needed ) failed"

If the final ping has been queued on server, the recovery on this export should be finished (exp_in_recovery == 0) when the resent lock replay reach server. I don't see why that could trigger the assertion, could you explain it in detail?

Comment by Andriy Skulysh [ 23/Sep/14 ]

The recovery ends only on receiving and processing final pings from all clients. I can happen that server accepted final ping from client1 but waits for requests from client2 and client1 reconnects. This situation was simulated in test by adding timeout before processing final pings.

Comment by Niu Yawei (Inactive) [ 23/Sep/14 ]

The recovery ends only on receiving and processing final pings from all clients.

I was saying recovery on this export is done. (exp_in_recovery == 0).

Comment by Andriy Skulysh [ 23/Sep/14 ]

exp_in_recovery is zeroed during recovery final stage 3, client1 reconnects during lock replay stage 2.

Comment by Andriy Skulysh [ 23/Sep/14 ]

step by step explanation:
1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
2) client1 and client 2 send lock relay,
3) client1 sends final ping
4) server queues final ping from client1 and set exp_lock_replay_needed to 0
5) client2 still in lock replay stage (waits for lock replies form server)
6) client1 reconnects
7) client1 replays locks from the beginning
8) assertion fails

Comment by Niu Yawei (Inactive) [ 23/Sep/14 ]

step by step explanation:
1) server goes to recovery stage 2 (lock replay) with at least 2 clients.
2) client1 and client 2 send lock relay,
3) client1 sends final ping
4) server queues final ping from client1 and set exp_lock_replay_needed to 0
5) client2 still in lock replay stage (waits for lock replies form server)
6) client1 reconnects

Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared.

Comment by Andriy Skulysh [ 23/Sep/14 ]

no. request reaches timeout on reply, client reconnects

Comment by Niu Yawei (Inactive) [ 24/Sep/14 ]

no. request reaches timeout on reply, client reconnects

Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path?

Comment by Andriy Skulysh [ 26/Sep/14 ]

I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is

no_export:
                OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout);
        } else if (req->rq_export == NULL &&
		   atomic_read(&export->exp_rpc_count) > 0) {
                LCONSOLE_WARN("%s: Client %s (at %s) refused connection, "
                              "still busy with %d references\n",
                              target->obd_name, cluuid.uuid,
                              libcfs_nid2str(req->rq_peer.nid),
			      atomic_read(&export->exp_refcount));
                GOTO(out, rc = -EBUSY);

but reconnect request has valid export handle.

Comment by Niu Yawei (Inactive) [ 26/Sep/14 ]

Ah, my mistake, the code has been changed by LU-793, so client can reconnect now. I'll review the patch soon, thanks for your explanation.

Comment by Jason Hill (Inactive) [ 08/Oct/14 ]

ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client.

Comment by James Nunez (Inactive) [ 04/Dec/14 ]

Patch for b2_5 at http://review.whamcloud.com/#/c/12163/

The master patch has landed. Is there more work needed to complete this ticket or should it be closed?

Comment by James A Simmons [ 04/Dec/14 ]

I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes.

Comment by Gerrit Updater [ 05/Dec/14 ]

James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942
Subject: LU-5651 test: run replay-single test 93 only when supported.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7bc6d9b8c2e09027444b31c40dd320e244c2412d

Comment by Gerrit Updater [ 18/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/
Subject: LU-5651 test: run replay-single test 93 only when supported.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afde9f17260650d0cb80d53613fb5afda0a39384

Comment by Gerrit Updater [ 15/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12163/
Subject: LU-5651: ptlrpc: fix import state during replay
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 5748379d28846c672f793ba1f8e143e63531dd05

Comment by Jian Yu [ 15/Jan/15 ]

Patches landed to master and b2_5 branches.

Generated at Sat Feb 10 01:53:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.