[LU-5651] ASSERTION( req->rq_export->exp_lock_replay_needed ) failed Created: 23/Sep/14 Updated: 29/May/15 Resolved: 15/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andriy Skulysh | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 15837 | ||||||||||||||||
| Description |
|
Client doesn't restore import state correctly Solution is to add imp_replay_state to store last replay state. |
| Comments |
| Comment by Andriy Skulysh [ 23/Sep/14 ] |
| Comment by Niu Yawei (Inactive) [ 23/Sep/14 ] |
|
This looks dup of
If the final ping has been queued on server, the recovery on this export should be finished (exp_in_recovery == 0) when the resent lock replay reach server. I don't see why that could trigger the assertion, could you explain it in detail? |
| Comment by Andriy Skulysh [ 23/Sep/14 ] |
|
The recovery ends only on receiving and processing final pings from all clients. I can happen that server accepted final ping from client1 but waits for requests from client2 and client1 reconnects. This situation was simulated in test by adding timeout before processing final pings. |
| Comment by Niu Yawei (Inactive) [ 23/Sep/14 ] |
I was saying recovery on this export is done. (exp_in_recovery == 0). |
| Comment by Andriy Skulysh [ 23/Sep/14 ] |
|
exp_in_recovery is zeroed during recovery final stage 3, client1 reconnects during lock replay stage 2. |
| Comment by Andriy Skulysh [ 23/Sep/14 ] |
|
step by step explanation: |
| Comment by Niu Yawei (Inactive) [ 23/Sep/14 ] |
Is the final ping from client1 still in queue? If it's still in queue, client1 can't reconnect because there is inflgiht RPC; If the final ping has been processed, the exp_in_recovery should have been cleared. |
| Comment by Andriy Skulysh [ 23/Sep/14 ] |
|
no. request reaches timeout on reply, client reconnects |
| Comment by Niu Yawei (Inactive) [ 24/Sep/14 ] |
Could you explain how can client1 reconnect when the final ping is in queue (server would reject the reconnect because the export has inflight RPC)? Is there defect in reconnect path? |
| Comment by Andriy Skulysh [ 26/Sep/14 ] |
|
I don't understand why reconnect should be rejected. The only check for inflight RPC on a export is no_export:
OBD_FAIL_TIMEOUT(OBD_FAIL_TGT_DELAY_CONNECT, 2 * obd_timeout);
} else if (req->rq_export == NULL &&
atomic_read(&export->exp_rpc_count) > 0) {
LCONSOLE_WARN("%s: Client %s (at %s) refused connection, "
"still busy with %d references\n",
target->obd_name, cluuid.uuid,
libcfs_nid2str(req->rq_peer.nid),
atomic_read(&export->exp_refcount));
GOTO(out, rc = -EBUSY);
but reconnect request has valid export handle. |
| Comment by Niu Yawei (Inactive) [ 26/Sep/14 ] |
|
Ah, my mistake, the code has been changed by |
| Comment by Jason Hill (Inactive) [ 08/Oct/14 ] |
|
ORNL hit this issue today in production after upgrading to Lustre 2.5.2 on both server and client. |
| Comment by James Nunez (Inactive) [ 04/Dec/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/12163/ The master patch has landed. Is there more work needed to complete this ticket or should it be closed? |
| Comment by James A Simmons [ 04/Dec/14 ] |
|
I believe the replay-single test needs to be updated to test for lustre versions so interop testing passes. |
| Comment by Gerrit Updater [ 05/Dec/14 ] |
|
James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/12942 |
| Comment by Gerrit Updater [ 18/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12942/ |
| Comment by Gerrit Updater [ 15/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12163/ |
| Comment by Jian Yu [ 15/Jan/15 ] |
|
Patches landed to master and b2_5 branches. |