Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 15533

    Description

      LNet chooses routers based on queued bytes on routers, at the meanwhile, it normally takes tens of seconds to detect dead routers (we see failed completion event of outstanding tx/rx, then close connection and notify LNet peer is dead) , which means it is still possible to queue more messages to a potentially dead router if all other alive routers have long message queue.

      we may need to check aliveness timestamp as part of router evaluation, and avoid to choose those routers that are inactive for certain number of seconds as long as there are other active routers (it takes pretty long to mark a router as dead, we might prefer not to choose it before marking it as dead)

      Attachments

        Issue Links

          Activity

            [LU-5570] Better router selection in LNet
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7734 [ LU-7734 ]
            adilger Andreas Dilger made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            johann Johann Lombardi (Inactive) made changes -
            Labels Original: lu_st
            liang Liang Zhen (Inactive) made changes -
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
            jlevi Jodi Levi (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            jlevi Jodi Levi (Inactive) made changes -
            Remote Link New: This issue links to "Page (HPDD Community Wiki)" [ 12820 ]
            liang Liang Zhen (Inactive) created issue -

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: