[LU-11229] server_bulk_callback()) ASSERTION( desc->bd_md_count > 0 ) failed Created: 09/Aug/18  Updated: 09/Aug/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing an LU-10428 I have seen a asserts, but Amir say me this issue don't limited to the LU-10428 and have seen on real OPA system. Collecting a logs point me to the long aged bug related to the 4mb IO landing. It have replace a bd_success with bd_failure and it introduce a second bug in this area (first one is client part of LU-11169).
problem is simple.
server bulk created as generate a two events for a transfer

int ptlrpc_start_bulk_transfer(struct ptlrpc_bulk_desc *desc)
..
               /* Network is about to get at the memory */
                if (ptlrpc_is_bulk_put_source(desc->bd_type))
                        rc = LNetPut(self_nid, desc->bd_mds[posted_md],
                                     LNET_ACK_REQ, peer_id,
                                     desc->bd_portal, mbits, 0, 0);

So two lnet events per MD, but..

 void server_bulk_callback(struct lnet_event *ev)
{
...
             if (ev->unlinked) {
        desc->bd_md_count--;
        /* This is the last callback no matter what... */
        if (desc->bd_md_count == 0)
           wake_up(&desc->bd_waitq);
}

OOPS.. we have decrease a bd_md_count twice = one for LNET_SEND, second one is for LNET_ACK.

 00000100:00000010:0.0:1533747855.090799:0:24663:0:(client.c:130:ptlrpc_new_bulk()) kmalloced 'desc': 416 at ffff88006080c800.
 00000100:00000200:0.0:1533747855.091779:0:21701:0:(events.c:449:server_bulk_callback()) event type 5, status 0, desc ffff88006080c800
 00000100:00000200:1.0:1533747855.091788:0:21700:0:(events.c:449:server_bulk_callback()) event type 4, status 0, desc ffff88006080c800
 00000100:00040000:0.0:1533747855.091796:0:21701:0:(events.c:453:server_bulk_callback()) ASSERTION( desc->bd_md_count > 0 ) failed:

So looks we don't need to trust an ev->unlinked (buffer is unlinked after send), but wait an ACK if it still needs.



 Comments   
Comment by Alexey Lyashkov [ 09/Aug/18 ]

I not sure, why ACK is needs in this case. I think it just additional overhead if enabled correctly and server can able to handle a partial transfer for now.

Generated at Sat Feb 10 02:42:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.