[LU-1717] mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0 Created: 07/Aug/12 Updated: 12/Jun/14 Resolved: 15/Oct/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Cliff White (Inactive) | Assignee: | Li Wei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | topsequoia | ||
| Environment: |
LLNL Hyperion, CHAOS 5 servers/clients, Lustre 2.2.92 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 6355 | ||||||||||||
| Description |
|
Running SWL tests, mix of various (IOR, mdtest, simul, mir, fdtree) Aug 7 09:51:29 ehyperion-rst6 kernel: LustreError: 29701:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid 1409327760945603 has mismatched opc: new 101 old 0 |
| Comments |
| Comment by Oleg Drokin [ 07/Aug/12 ] |
|
I think this might be a case of improper init: in target_send_reply() we have: rs->rs_opc = lustre_msg_get_opc(rs->rs_msg); rs->rs_msg does not seem to be initialized in a proper way yet (points to yet uninitialized reply buffer). so probably should be lustre_msg_get_opc(req->rq_reqmsg); The Stealing..../Stolen... messages should be silenced. Technically these messages should only be seen when there is a lost reply enroute to client |
| Comment by Christopher Morrone [ 09/Oct/12 ] |
|
We are also seeing this repeatedly. |
| Comment by Christopher Morrone [ 09/Oct/12 ] |
|
In particular, we are still seeing this in newer master code at 2.3.53-3chaos. |
| Comment by Jay Lan (Inactive) [ 09/Oct/12 ] |
|
We are seening this repeatedly on our 2.1.2 and 2.1.3 servers, but I can not pin this to a particular reported problem. |
| Comment by Oleg Drokin [ 12/Oct/12 ] |
|
The particular problem you'd see this in is when a reply from server to client was lost and client did a resend. |
| Comment by Christopher Morrone [ 12/Oct/12 ] |
|
Oleg, if your assumption about lost replies is correct, then I think we have a bigger problem here. We do not have lnet routers on Sequoia so we should have a reliable communication fabric. How are we losing messages so often?? |
| Comment by Oleg Drokin [ 12/Oct/12 ] |
|
I don't really know how are you losing the messages. The resent xid could only occur if a reply was not seen by a client and it decided to resend the message (there probably should be a client-side message about that too). |
| Comment by Li Wei (Inactive) [ 15/Oct/12 ] |
|
http://review.whamcloud.com/4271 This does what Oleg suggested. |
| Comment by Ian Colle (Inactive) [ 15/Oct/12 ] |
|
Patch landed |