Uploaded image for project: 'Lustre Documentation'
  1. Lustre Documentation
  2. LUDOC-479

Need LNet clarifications

    XMLWordPrintable

Details

    • Question/Request
    • Resolution: Fixed
    • Minor
    • None
    • None
    • Lustre 2.12.5
      Infiniband (MLX4, MLX5)
      TCP
      OPA
    • 9223372036854775807

    Description

      We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?

      Questions:

      1. What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
        1. Client determines next hop router
        2. Client checks available routing buffer credits (rtr) on router
        3. Client checks available send credits (tx) and peer_credits to that router on self
        4. Client sends <= #peer_credits messages, decrementing tx for each message
        5. Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
        6. Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
      2. Is there any need to manually add peers in a non-MR config? My understanding is no.
      3. Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
      4. The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
      5. Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
      6. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            charr Cameron Harr
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: