Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17525

Unaligned DIO interop with different page sizes fails

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Unaligned DIO interop with differnt page sizes fails

      When doing DIO 4k <-> 64k page unaligned I/O in the brw/ptlrpc bulk ops due to a differing number of pages that can be added to the initial MD while fitting within the LNET_MTU limit.

      One solution is to restrict the initial unaligned MD to the maximum page size of all the interoperable machines. In this case aarch64 and a few other arches have 64k pages.
      Limiting the first I/O to the limit of what can fit in (LNET_MTU - 64k bytes) ensures that MDs are sized to the same maximum that all architectures can support. This is only done when initial MD is unaligned and the vectors are nominally aligned thereafter.

      When client and server page sizes are different then the client and server prepare MDs differently, each based on their local page size. When the offset
      When the system with the larger page size is writing at an offset greater than the smaller page size and the resulting (first) MD number of bytes + larger page size offset is greater then the LNET MTU (1M) the number of bytes that can fit is greater for the smaller page size system. In this case the MD (send or receiving) will match on xid/match_bits and fail on the message length check:

      lnet_try_match_md()
      {
      ....
      	} else if ((md->md_options & LNET_MD_TRUNCATE) == 0) {
      		/* this packet _really_ is too big */
      		CERROR("Matching packet from %s, match %llu"
      		       " length %d too big: %d left, %d allowed\n",
      		       libcfs_idstr(&info->mi_id), info->mi_mbits,
      		       info->mi_rlength, md->md_length - offset, mlength);
      
      		return LNET_MATCHMD_DROP;
      ...
      }
      

      This then triggers a resend, both sides recompute and resend, however the lengths are still wrong so the I/O never completes.

      So adjust the fitting logic in {__ptlrpc_prep_bulk_page()} for the first MD when all of the following is true:

      • I/O is direct-io
      • write is not aligned on the largest allowed page_size (64k) boundary
      • offset is > smallest page size (MD_MIN_INTEROP_PAGE_SIZE)

      For interop the first page is assumed to be 64k which then causes the smaller paged system
      to stop adding pages/bytes to the MD at the same point as the larger pages system except when:

      • number of bytes + 64k offset <= LNET_MTU
        due to the last page # of bytes falling short of the MTU limit, in this case the extraneous MD is
        collapsed back as only a single MD is needed / used for this bulk I/O.

      A quick survey of systems with page sizes (or configurable PAGE_SIZE) that Linux supports shows a few uncommon architectures that support page sizes > 64k however those systems are also configurable for 64k (or smaller) page sizes.
      In addition no current supported platform appears to allow a page size of less than 4k. Therefore restricting lustre to 4k to 64k page sizes (along with the MD_MAX_INTEROP_PAGE_SIZE) should not be controversial.

      Finally to accommodate the possible additional MD needed for a full bulk I/o that is also restricted due to offset and page alignment increase the maximum to PTLRPC_BULK_OPS_COUNT + 1.
      To do this we have to double the theoretical maximum from 6 bits to 7 to correctly deal with the mbits/xid logic.

      Finally to indicate to the server that a client has in fact adjusted the MD size(s) for 64k alignment the unused lower 16 bits of struct obd_ioobj.ioo_max_brw can be used for flags of which one bit can be used to indicate OBD_IOOBJ_INTEROP_PAGE_ALIGNMENT is needed.

      Without this patch a 64k unaligned I/O where the client and server have different native page sizes cannot agree on how big the MD is (one side will abort MD with 'too big' [see: lnet_try_match_md() => LNET_MATCHMD_DROP] for the allocated space and trigger a retry, but the MD math never changes so the effect is a hang.

      Attachments

        Issue Links

          Activity

            People

              stancheff Shaun Tancheff
              stancheff Shaun Tancheff
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: