Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
While testing an LU-10428 I have seen a asserts, but Amir say me this issue don't limited to the LU-10428 and have seen on real OPA system. Collecting a logs point me to the long aged bug related to the 4mb IO landing. It have replace a bd_success with bd_failure and it introduce a second bug in this area (first one is client part of LU-11169).
problem is simple.
server bulk created as generate a two events for a transfer
int ptlrpc_start_bulk_transfer(struct ptlrpc_bulk_desc *desc) .. /* Network is about to get at the memory */ if (ptlrpc_is_bulk_put_source(desc->bd_type)) rc = LNetPut(self_nid, desc->bd_mds[posted_md], LNET_ACK_REQ, peer_id, desc->bd_portal, mbits, 0, 0);
So two lnet events per MD, but..
void server_bulk_callback(struct lnet_event *ev)
{
...
if (ev->unlinked) {
desc->bd_md_count--;
/* This is the last callback no matter what... */
if (desc->bd_md_count == 0)
wake_up(&desc->bd_waitq);
}
OOPS.. we have decrease a bd_md_count twice = one for LNET_SEND, second one is for LNET_ACK.
00000100:00000010:0.0:1533747855.090799:0:24663:0:(client.c:130:ptlrpc_new_bulk()) kmalloced 'desc': 416 at ffff88006080c800. 00000100:00000200:0.0:1533747855.091779:0:21701:0:(events.c:449:server_bulk_callback()) event type 5, status 0, desc ffff88006080c800 00000100:00000200:1.0:1533747855.091788:0:21700:0:(events.c:449:server_bulk_callback()) event type 4, status 0, desc ffff88006080c800 00000100:00040000:0.0:1533747855.091796:0:21701:0:(events.c:453:server_bulk_callback()) ASSERTION( desc->bd_md_count > 0 ) failed:
So looks we don't need to trust an ev->unlinked (buffer is unlinked after send), but wait an ACK if it still needs.