[LU-1717] mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0 Created: 07/Aug/12  Updated: 12/Jun/14  Resolved: 15/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Cliff White (Inactive) Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: topsequoia
Environment:

LLNL Hyperion, CHAOS 5 servers/clients, Lustre 2.2.92


Issue Links:
Duplicate
Related
is related to LU-2187 Why are we losing messages? Resolved
Severity: 3
Rank (Obsolete): 6355

 Description   

Running SWL tests, mix of various (IOR, mdtest, simul, mir, fdtree)
Seeing this sequence repeatedly, lustre dump sent to ftp site. File Name: lu-1442.dump.gz

Aug 7 09:51:29 ehyperion-rst6 kernel: LustreError: 29701:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid 1409327760945603 has mismatched opc: new 101 old 0
Aug 7 09:51:29 ehyperion-rst6 kernel: LustreError: 29701:0:(mdt_recovery.c:611:mdt_steal_ack_locks()) Skipped 5 previous similar messages
Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 29701:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff8802d1a96000 x1409327760945603.t537972417766 o0 NID 192.168.117.9@o2ib1
Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 29701:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Skipped 5 previous similar messages
Aug 7 09:51:29 ehyperion-rst6 kernel: Lustre: 4710:0:(service.c:2095:ptlrpc_handle_rs()) All locks stolen from rs ffff8802d1a96000 x1409327760945603.t537972417766 o0 NID 192.168.117.9@o2ib1



 Comments   
Comment by Oleg Drokin [ 07/Aug/12 ]

I think this might be a case of improper init:

in target_send_reply() we have:

rs->rs_opc = lustre_msg_get_opc(rs->rs_msg);

rs->rs_msg does not seem to be initialized in a proper way yet (points to yet uninitialized reply buffer).

so probably should be lustre_msg_get_opc(req->rq_reqmsg);

The Stealing..../Stolen... messages should be silenced.

Technically these messages should only be seen when there is a lost reply enroute to client

Comment by Christopher Morrone [ 09/Oct/12 ]

We are also seeing this repeatedly.

Comment by Christopher Morrone [ 09/Oct/12 ]

In particular, we are still seeing this in newer master code at 2.3.53-3chaos.

Comment by Jay Lan (Inactive) [ 09/Oct/12 ]

We are seening this repeatedly on our 2.1.2 and 2.1.3 servers, but I can not pin this to a particular reported problem.

Comment by Oleg Drokin [ 12/Oct/12 ]

The particular problem you'd see this in is when a reply from server to client was lost and client did a resend.
The message is harmless (and wrong, and will be fixed).

Comment by Christopher Morrone [ 12/Oct/12 ]

Oleg, if your assumption about lost replies is correct, then I think we have a bigger problem here. We do not have lnet routers on Sequoia so we should have a reliable communication fabric.

How are we losing messages so often??

Comment by Oleg Drokin [ 12/Oct/12 ]

I don't really know how are you losing the messages.

The resent xid could only occur if a reply was not seen by a client and it decided to resend the message (there probably should be a client-side message about that too).
The specific message you see could only happen when that lost reply happened to be one for a so called "difficult" reply - where a lock is being returned to the client.

Comment by Li Wei (Inactive) [ 15/Oct/12 ]

http://review.whamcloud.com/4271

This does what Oleg suggested.

Comment by Ian Colle (Inactive) [ 15/Oct/12 ]

Patch landed

Generated at Sat Feb 10 01:19:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.