Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15189

GDS support improvements and fixes.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Whamcloud GDS code have several oddness.
      1. Lack of autoconf support it caused a wrong structure / macros used for GDS<>Lustre interaction.
      (changed in the last nvidia-fs code).

      2. GDS have a bug in page state testing function. It caused a set force_rdma flag to the ptlrpc message buffer. It's because

      bool nvfs_is_gpu_page(struct page *page)
      {
              nvfs_mgroup_ptr_t nvfs_mgroup;
      
              nvfs_mgroup = __nvfs_mgroup_from_page(page, false);
              if (nvfs_mgroup == NULL) {
                      return false;
              } else if (unlikely(IS_ERR(nvfs_mgroup))) {
                      // This is a GPU page but we did not take reference as we are in shutdown path
                      // But, we will return true to the caller so that caller doesn't think it is a
                      // CPU page and fall back to CPU path 
                      return true; <<< true if no magic.
      
      

      It's very easy to detect - just push force_rdma flag into memory mapping function and check is force_rdma buffer can mapped with GDS code.
      But instead of invest it, Whamcloud adds an odd workaround which tries to execute a both mapping functions who caused an lnet slowness.

      Lets fix it.

      Attachments

        Activity

          People

            shadow Alexey Lyashkov
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: