Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 15533

    Description

      LNet chooses routers based on queued bytes on routers, at the meanwhile, it normally takes tens of seconds to detect dead routers (we see failed completion event of outstanding tx/rx, then close connection and notify LNet peer is dead) , which means it is still possible to queue more messages to a potentially dead router if all other alive routers have long message queue.

      we may need to check aliveness timestamp as part of router evaluation, and avoid to choose those routers that are inactive for certain number of seconds as long as there are other active routers (it takes pretty long to mark a router as dead, we might prefer not to choose it before marking it as dead)

      Attachments

        Issue Links

          Activity

            [LU-5570] Better router selection in LNet

            Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13342
            Subject: LU-5570 lnet: check router aliveness timestamp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8650e97e5bede98f3bae16cf64a687ae1c07ef4b

            gerrit Gerrit Updater added a comment - Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13342 Subject: LU-5570 lnet: check router aliveness timestamp Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8650e97e5bede98f3bae16cf64a687ae1c07ef4b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13302/
            Subject: Revert "LU-5570 lnet: check router aliveness timestamp"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bfaadd73b74da2aca82007ca78a6baf15ea2790c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13302/ Subject: Revert " LU-5570 lnet: check router aliveness timestamp" Project: fs/lustre-release Branch: master Current Patch Set: Commit: bfaadd73b74da2aca82007ca78a6baf15ea2790c

            I just requested Oleg to revert it, because this patch is not cleanly rebased, also we need Isaac to review it.

            liang Liang Zhen (Inactive) added a comment - I just requested Oleg to revert it, because this patch is not cleanly rebased, also we need Isaac to review it.

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13302
            Subject: Revert "LU-5570 lnet: check router aliveness timestamp"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2d8b8c9e0149b0fe860983cd2020d9781bd2e548

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13302 Subject: Revert " LU-5570 lnet: check router aliveness timestamp" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2d8b8c9e0149b0fe860983cd2020d9781bd2e548

            Patch landed to Master. If there is more work to be done in this ticket, please reopen the ticket.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. If there is more work to be done in this ticket, please reopen the ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11748/
            Subject: LU-5570 lnet: check router aliveness timestamp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 339c7b2b784a528f41c432e9b90285d3445b7536

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11748/ Subject: LU-5570 lnet: check router aliveness timestamp Project: fs/lustre-release Branch: master Current Patch Set: Commit: 339c7b2b784a528f41c432e9b90285d3445b7536

            it's not true, i have a lots crash dumps with negative credits per destination when router dead.

            shadow Alexey Lyashkov added a comment - it's not true, i have a lots crash dumps with negative credits per destination when router dead.

            In current lnet, if there is any router has positive credit, we will queue message to it not to router with negative credits.

            liang Liang Zhen (Inactive) added a comment - In current lnet, if there is any router has positive credit, we will queue message to it not to router with negative credits.

            I think we should don't queue any messages in case negative credits. In that case we will queue only data able to send and easy to put other data to the different routers.

            shadow Alexey Lyashkov added a comment - I think we should don't queue any messages in case negative credits. In that case we will queue only data able to send and easy to put other data to the different routers.
            liang Liang Zhen (Inactive) added a comment - - edited

            I would think we should always avoid configuration changes when it's possible, we already have too many tunables which is overkill and very hard for users to make them all corrects.

            I totally agree it 's better to fix LU-5485 in this patch (sorry I missed your comment there), and RC ping reduction is also a very good idea, I will have a follow-on patch to implement ping reduction, as it may requires a little more changes to a few different timestamps, so it could be clear to have a separate patch, thanks.

            liang Liang Zhen (Inactive) added a comment - - edited I would think we should always avoid configuration changes when it's possible, we already have too many tunables which is overkill and very hard for users to make them all corrects. I totally agree it 's better to fix LU-5485 in this patch (sorry I missed your comment there), and RC ping reduction is also a very good idea, I will have a follow-on patch to implement ping reduction, as it may requires a little more changes to a few different timestamps, so it could be clear to have a separate patch, thanks.

            Also, with aliveness for routers, it'd be possible to fix LU-5485 as well. Better plan them all together.

            isaac Isaac Huang (Inactive) added a comment - Also, with aliveness for routers, it'd be possible to fix LU-5485 as well. Better plan them all together.

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: