Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11229

server_bulk_callback()) ASSERTION( desc->bd_md_count > 0 ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While testing an LU-10428 I have seen a asserts, but Amir say me this issue don't limited to the LU-10428 and have seen on real OPA system. Collecting a logs point me to the long aged bug related to the 4mb IO landing. It have replace a bd_success with bd_failure and it introduce a second bug in this area (first one is client part of LU-11169).
      problem is simple.
      server bulk created as generate a two events for a transfer

      int ptlrpc_start_bulk_transfer(struct ptlrpc_bulk_desc *desc)
      ..
                     /* Network is about to get at the memory */
                      if (ptlrpc_is_bulk_put_source(desc->bd_type))
                              rc = LNetPut(self_nid, desc->bd_mds[posted_md],
                                           LNET_ACK_REQ, peer_id,
                                           desc->bd_portal, mbits, 0, 0);
      

      So two lnet events per MD, but..

       void server_bulk_callback(struct lnet_event *ev)
      {
      ...
                   if (ev->unlinked) {
              desc->bd_md_count--;
              /* This is the last callback no matter what... */
              if (desc->bd_md_count == 0)
                 wake_up(&desc->bd_waitq);
      }
      

      OOPS.. we have decrease a bd_md_count twice = one for LNET_SEND, second one is for LNET_ACK.

       00000100:00000010:0.0:1533747855.090799:0:24663:0:(client.c:130:ptlrpc_new_bulk()) kmalloced 'desc': 416 at ffff88006080c800.
       00000100:00000200:0.0:1533747855.091779:0:21701:0:(events.c:449:server_bulk_callback()) event type 5, status 0, desc ffff88006080c800
       00000100:00000200:1.0:1533747855.091788:0:21700:0:(events.c:449:server_bulk_callback()) event type 4, status 0, desc ffff88006080c800
       00000100:00040000:0.0:1533747855.091796:0:21701:0:(events.c:453:server_bulk_callback()) ASSERTION( desc->bd_md_count > 0 ) failed:
      

      So looks we don't need to trust an ev->unlinked (buffer is unlinked after send), but wait an ACK if it still needs.

      Attachments

        Activity

          People

            shadow Alexey Lyashkov
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: