Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12419

ppc64le: "LNetError: RDMA has too many fragments for peer_ni" when reading two files

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.10.6, Lustre 2.12.1
    • None
    • 3
    • 9223372036854775807

    Description

      I'm trying to install and configure lustre client on ibm power9 64le machine to mount existing lustre system through router. The power9 host is similar to ORNL Summit worker and we going to use it for debugging software before running it on leadership facilities so this case can be of interest for others.

      At first there were issues with connecting to the lnet (lctl ping); the issues get resolved after explicitly setting options in /etc/modprobe.d/ko2iblnd.conf as per LU-3322:

        map_on_demand=16 - on ibmpower9

        map_on_demand=256 - on x86_64 all servers and router.

      Lustre was restarted and modules were reloaded after these changes.

      Now I can mount lustre on power9 client and execute SINGLE file read with dd, this works for files located on any of six OSTs but doing one read transfer at a time. I did not try writes.

      When I start two transfers in parallel or start one transfer, then start the other 10 seconds later I'm getting LNET error when the transfer starts for the second file. I can kill -9 dd process (but not always all processes); sometimes one of the processes can not be killed with signal -9. Even all IO processes ("dd") are killed on the client the router and servers continue to report errors in the logs; and I do observe IO on both OSTs where files reside.  I can not unmount lustre on power9 client or remove modules.

      "lctl net unconfigure" reports "LNET busy". I have to reboot power9 client. Only after the client reboot errors stop being reported on servers and router.

      Attachments

        1. client.tgz
          2.66 MB
        2. router.tgz
          414 kB
        3. server.tgz
          616 kB

        Issue Links

          Activity

            People

              wc-triage WC Triage
              alex.ku Alex Kulyavtsev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: