[LU-6928] Version mismatch during DNE replay Created: 30/Jul/15 Updated: 27/Aug/15 Resolved: 27/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Di Wang | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
During 24 hours DNE failover test, one of client fails because of Version mismatch during replay. Lustre: 7977:0:(client.c:2828:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff8806698209c0 x1508081941826144/t21475879788(21475879788) o36->lustre-MDT0001-mdc-ffff880821cfdc00@192.168.2.126@o2ib:12/10 lens 608/424 e 1 to 0 dl 1438262329 ref 2 fl Interpret:R/4/0 rc -75/-75 Lustre: 7977:0:(import.c:1301:completed_replay_interpret()) lustre-MDT0001-mdc-ffff880821cfdc00: version recovery fails, reconnecting LustreError: 167-0: lustre-MDT0001-mdc-ffff880821cfdc00: This client was evicted by lustre-MDT0001; in progress operations using this service will fail. LustreError: 9213:0:(vvp_io.c:1475:vvp_io_init()) lustre: refresh file layout [0x240002341:0x1d74:0x0] error -5. LustreError: 29913:0:(lmv_obd.c:1332:lmv_fid_alloc()) Can't alloc new fid, rc -19 Lustre: lustre-MDT0001-mdc-ffff880821cfdc00: Connection restored to lustre-MDT0001 (at 192.168.2.126@o2ib) Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=43433 DURATION=86400 PERIOD=1800 Lustre: DEBUG MARKER: Client load failed on node c01, rc=1 Lustre: DEBUG MARKER: Duration: 86400 Lustre: Unmounted lustre-client |
| Comments |
| Comment by James Nunez (Inactive) [ 30/Jul/15 ] |
|
Di - I'm seeing a very similar error in replay-single test 48 in review-dne-part-2. Do you think this is the same issue? The logs are at: https://testing.hpdd.intel.com/test_sets/8281843c-365f-11e5-830b-5254006e85c2 The client console shows: 18:10:31:Lustre: DEBUG MARKER: == replay-single test 48: MDS->OSC failure during precreate cleanup (2824) == 18:08:48 (1438193328) 18:10:31:Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) 18:10:31:Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi 18:10:31:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 18:10:31:LustreError: 7733:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 18:10:31:Lustre: 7733:0:(client.c:2828:ptlrpc_replay_interpret()) @@@ Version mismatch during replay 18:10:31: req@ffff8800379e8c80 x1508053244555028/t197568495717(197568495717) o36->lustre-MDT0000-mdc-ffff88007b65b000@10.1.4.62@tcp:12/10 lens 640/416 e 2 to 0 dl 1438193365 ref 1 fl Interpret:R/6/0 rc -75/-75 18:10:31:Lustre: 7733:0:(import.c:1306:completed_replay_interpret()) lustre-MDT0000-mdc-ffff88007b65b000: version recovery fails, reconnecting 18:10:31:LustreError: 167-0: lustre-MDT0000-mdc-ffff88007b65b000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 18:10:31:LustreError: Skipped 1 previous similar message 18:10:31:LustreError: 13949:0:(lmv_obd.c:1473:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff88007b65b000), error -5 18:10:31:LustreError: 13949:0:(llite_lib.c:1707:ll_statfs_internal()) md_statfs fails: rc = -5 18:10:31:Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_48: @@@@@@ FAIL: client_up failed 18:10:31:Lustre: DEBUG MARKER: replay-single test_48: @@@@@@ FAIL: client_up failed |
| Comment by Di Wang [ 30/Jul/15 ] |
|
Yes, looks like same issue, I believe it is a bug caused by multiple slot patch. I will cook a patch. |
| Comment by Di Wang [ 30/Jul/15 ] |
|
Hmm, I think this is what happened During replay 1. client send replay request to MDS01, and MDS01 did not handle it in time. /* find the lowest unreplied XID */
list_for_each(tmp, &imp->imp_delayed_list) {
struct ptlrpc_request *r;
r = list_entry(tmp, struct ptlrpc_request, rq_list);
if (r->rq_xid < min_xid)
min_xid = r->rq_xid;
}
list_for_each(tmp, &imp->imp_sending_list) {
struct ptlrpc_request *r;
r = list_entry(tmp, struct ptlrpc_request, rq_list);
if (r->rq_xid < min_xid)
min_xid = r->rq_xid;
}
Delete req from sending list after req failure. /* Request already may be not on sending or delaying list. This
* may happen in the case of marking it erroneous for the case
* ptlrpc_import_delay_req(req, status) find it impossible to
* allow sending this rpc and returns *status != 0. */
if (!list_empty(&req->rq_list)) {
list_del_init(&req->rq_list);
atomic_dec(&imp->imp_inflight);
}
spin_unlock(&imp->imp_lock);
3. And MDT handle that replay request (mentioned in 1) at this time, which will add the lrd of this req into the reply_list. so the fix might be in step 2, i.e. we need consider the expired request when finding this min XID, or we just do not pack this min xid for CONNECT request. |
| Comment by Gerrit Updater [ 30/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15812 |
| Comment by Di Wang [ 14/Aug/15 ] |
|
Niu: could you please confirm if your patch can fix this problem? Thanks |
| Comment by Niu Yawei (Inactive) [ 27/Aug/15 ] |
|
Yes, the patch for |
| Comment by Di Wang [ 27/Aug/15 ] |
|
Duplicate with |