[LU-5629] osp_sync_interpret() ASSERTION( rc || req->rq_transno ) failed Created: 16/Sep/14 Updated: 12/Jan/18 Resolved: 12/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre 2.4.2-14chaos (see github.com/chaos/lustre) |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 15744 | ||||||||||||||||||||
| Description |
|
One of our MDS nodes crashed to day with the following assertion: client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 548 > total measured time 165 osp_sync.c:355:osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed Note that the two messages above were printed in the same second (as reported by syslog) and by the same kernel thread. I don't know if the ptlrpc_at_adj_net_latency() message is actually related to the assertion or not, but the proximity makes it worth noting. There were a few OST to which the MDS lost and reestablished a connection a couple of minutes earlier in the log. The backtrace was: panic lbug_with_loc osp_sync_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd kernel_thread It was running lustre version 2.4.2-14chaos (see github.com/chaos/lustre). We cannot provide logs or crash dumps for this machine. |
| Comments |
| Comment by Christopher Morrone [ 16/Sep/14 ] |
|
In |
| Comment by Liang Zhen (Inactive) [ 16/Sep/14 ] |
|
FYI, I think a possible reason of ptlrpc_at_adj_net_latency() warning is because early reply is lost so RPC is expired, then client(it is OSP of MDS at here) resends the request, because reply of original request (on server) is still using the same rq_xid as match-bits of reply, so it can fit into the reposted reply buffer. If this happened, service time returned by original reply can be longer than execution time of the resent RPC. I'm not sure if this is relevant to the assertion, but we at least should remove this warning and only put it in debug info. |
| Comment by Peter Jones [ 16/Sep/14 ] |
|
Dmitry is looking into this one |
| Comment by Dmitry Eremin (Inactive) [ 08/Oct/14 ] |
|
Commit e12b89a9e7d8409c2b624162760c2e7e3481d7be with fix was landed in 2.4.2. |
| Comment by Peter Jones [ 08/Oct/14 ] |
|
duplicate of lu-3892 |
| Comment by Christopher Morrone [ 17/Apr/15 ] |
|
I am reopening this ticket because it does not appear that the issue was resolved as previously believed. We are still seeing the same assertion with Lustre 2.5.3, which contains the patch from |
| Comment by James A Simmons [ 19/Nov/15 ] |
|
Looks like we just hit this on our 2.5.3+ production file system |
| Comment by Lance Weems [ 23/May/16 ] |
|
Wanted to report we hit this over the weekend on our 2.5.5 production file system here at LLNL. |
| Comment by Ruth Klundt (Inactive) [ 07/Jul/16 ] |
|
Add one more at snl last night, also running 2.5.5. |
| Comment by Alex Zhuravlev [ 08/Jul/16 ] |
|
any logs/dumps? |
| Comment by Ruth Klundt (Inactive) [ 08/Jul/16 ] |
|
Attached server syslogs, the stack dumps were not captured unfortunately, but the location for that collection is mounted now in case it happens again. There had been network issues earlier in the day, reportedly resolved by 4pm. fyi the number of clients on the fs is currently 6395. And the exact version of the software is |
| Comment by Cameron Harr [ 04/Aug/16 ] |
|
Saw the same crash on 8/03/16. We have a 16GB dump available if need be. Not sure how related it is, but an OSS node suffered major hardware problems (MCEs) throughout the 30 minutes before the LBUG on the MDS. The MDS console log messages directly (~2 min) before the assertion were an evict/reconnect to that OST. |
| Comment by Scott Kirvan (Inactive) [ 18/Oct/16 ] |
|
Exact same issue @ LANL, Lustre 2.5.5. |
| Comment by Dmitry Eremin (Inactive) [ 12/Jan/18 ] |
|
Probably the patch https://review.whamcloud.com/30129/ should resolve this. |
| Comment by Peter Jones [ 12/Jan/18 ] |
|
Closing as a duplicate of |