Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13145

LNet Health: increase transaction timeout

Details

    • 3
    • 9223372036854775807

    Description

      On larger clusters it appears like the current transaction timeout default of 10 seconds is too short and it causes RDMA timeouts.

      The proposal is to increase the timeout to 150s. With a retry count of 3 that would bring the LND timeout to 50s, which was the initial value before health.

      Attachments

        Issue Links

          Activity

            [LU-13145] LNet Health: increase transaction timeout
            adilger Andreas Dilger added a comment - - edited

            From LU-13020 the workaround to get equivalent behavior for systems without this patch is to run the following commands on all of the 2.12.3 nodes in the shown order:

            echo 150 > /sys/module/lnet/parameters/lnet_transaction_timeout
            echo 2 > /sys/module/lnet/parameters/lnet_retry_count
            

            This only temporarily changes these values, but they can be set permanently by adding the following line in /etc/modprobe.d/lnet.conf:

            options lnet lnet_retry_count=2 lnet_transaction_timeout=150
            
            adilger Andreas Dilger added a comment - - edited From LU-13020 the workaround to get equivalent behavior for systems without this patch is to run the following commands on all of the 2.12.3 nodes in the shown order: echo 150 > /sys/module/lnet/parameters/lnet_transaction_timeout echo 2 > /sys/module/lnet/parameters/lnet_retry_count This only temporarily changes these values, but they can be set permanently by adding the following line in /etc/modprobe.d/lnet.conf : options lnet lnet_retry_count=2 lnet_transaction_timeout=150

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37390/
            Subject: LU-13145 lnet: use conservative health timeouts
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 3c86a9361862d34a4efde73c4f1cb7603ec6b2f9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37390/ Subject: LU-13145 lnet: use conservative health timeouts Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 3c86a9361862d34a4efde73c4f1cb7603ec6b2f9
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37430/
            Subject: LU-13145 lnet: use conservative health timeouts
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 361e9eaef13c0f472ad45388d3e147dabc32b737

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37430/ Subject: LU-13145 lnet: use conservative health timeouts Project: fs/lustre-release Branch: master Current Patch Set: Commit: 361e9eaef13c0f472ad45388d3e147dabc32b737

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37430
            Subject: LU-13145 lnet: use conservative health timeouts
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 188b4cb2f782e348cfa4225745341ac8e3e267c3

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37430 Subject: LU-13145 lnet: use conservative health timeouts Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 188b4cb2f782e348cfa4225745341ac8e3e267c3

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37390
            Subject: LU-13145 lnet: use conservative health timeouts
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f308fc8fc393d939f34e4d7587c0848667084a8c

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37390 Subject: LU-13145 lnet: use conservative health timeouts Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f308fc8fc393d939f34e4d7587c0848667084a8c
            gerrit Gerrit Updater added a comment - - edited

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37253
            Subject: LU-13145 lnet: increase transaction timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 07ef6f9f0c142240a4586674fe9c556482f323c4

            gerrit Gerrit Updater added a comment - - edited Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37253 Subject: LU-13145 lnet: increase transaction timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07ef6f9f0c142240a4586674fe9c556482f323c4

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: