Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 15533

    Description

      LNet chooses routers based on queued bytes on routers, at the meanwhile, it normally takes tens of seconds to detect dead routers (we see failed completion event of outstanding tx/rx, then close connection and notify LNet peer is dead) , which means it is still possible to queue more messages to a potentially dead router if all other alive routers have long message queue.

      we may need to check aliveness timestamp as part of router evaluation, and avoid to choose those routers that are inactive for certain number of seconds as long as there are other active routers (it takes pretty long to mark a router as dead, we might prefer not to choose it before marking it as dead)

      Attachments

        Issue Links

          Activity

            [LU-5570] Better router selection in LNet
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7734 [ LU-7734 ]
            adilger Andreas Dilger made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

            This won't be needed with the Router re-work I have done. LNet Health currently uses the legacy routing code. I reworked the routing code to bring it inline with the Multi-Rail. These changes should resolve the issue described in this ticket.

            ashehata Amir Shehata (Inactive) added a comment - This won't be needed with the Router re-work I have done. LNet Health currently uses the legacy routing code. I reworked the routing code to bring it inline with the Multi-Rail. These changes should resolve the issue described in this ticket.
            johann Johann Lombardi (Inactive) made changes -
            Labels Original: lu_st

            James, probably not, I still need your help to sample NI status on router (my last comment on LU-5758)

            liang Liang Zhen (Inactive) added a comment - James, probably not, I still need your help to sample NI status on router (my last comment on LU-5758 )

            Do you think this could help with the ARF problems we have been having? Earlier comment seem to point to that.

            simmonsja James A Simmons added a comment - Do you think this could help with the ARF problems we have been having? Earlier comment seem to point to that.

            Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13342
            Subject: LU-5570 lnet: check router aliveness timestamp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8650e97e5bede98f3bae16cf64a687ae1c07ef4b

            gerrit Gerrit Updater added a comment - Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13342 Subject: LU-5570 lnet: check router aliveness timestamp Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8650e97e5bede98f3bae16cf64a687ae1c07ef4b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13302/
            Subject: Revert "LU-5570 lnet: check router aliveness timestamp"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bfaadd73b74da2aca82007ca78a6baf15ea2790c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13302/ Subject: Revert " LU-5570 lnet: check router aliveness timestamp" Project: fs/lustre-release Branch: master Current Patch Set: Commit: bfaadd73b74da2aca82007ca78a6baf15ea2790c
            liang Liang Zhen (Inactive) made changes -
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]

            I just requested Oleg to revert it, because this patch is not cleanly rebased, also we need Isaac to review it.

            liang Liang Zhen (Inactive) added a comment - I just requested Oleg to revert it, because this patch is not cleanly rebased, also we need Isaac to review it.

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: