Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5183

If Adaptive Timeout is set for at_max = 600 then id ldlm_timeouts gets affective or it becomes over ruled

Details

    • Question/Request
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • Lustre Server 2.1.6
      Lustre Client 1.8.9
    • 14384

    Description

      Hi

      I'd like an explanation of which timeout values are being exceeded that are resulting in these evictions, so what does that "227 seconds" reffers to, like which timeout it's considering. Is that "ldlm_timeout, obd_timeout, /proc/sys/lustre/timeout, at_min or at_max.

      May 14 05:37:59 dc2oss15 kernel: : Lustre: dc2-OST009c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bb8ff2800, cur 1400060279 expire 1400060129 last 1400060052
      May 14 05:37:59 dc2oss15 kernel: : Lustre: Skipped 9 previous similar messages
      May 14 05:38:02 dc2oss12 kernel: : Lustre: dc2-OST007b: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88169eecb400, cur 1400060282 expire 1400060132 last 1400060055
      May 14 05:38:02 dc2oss12 kernel: : Lustre: Skipped 8 previous similar messages
      May 14 05:37:53 dc2oss04 kernel: : Lustre: dc2-OST0021: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bd683c400, cur 1400060273 expire 1400060123 last 1400060046
      May 14 05:37:53 dc2oss04 kernel: : Lustre: Skipped 8 previous similar messages
      May 14 05:37:58 dc2oss05 kernel: : Lustre: dc2-OST002c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88154cc64000, cur 1400060278 expire 1400060128 last 1400060051
      May 14 05:37:58 dc2oss05 kernel: : Lustre: Skipped 9 previous similar messages
      

      Particularly, I'm interested in knowing whether ldlm_timeouts, which is 20s for OSTs and 6s for MDT, are in play given that we've adaptive timeouts enabled(at_max = 600) and /proc/sys/lustre/timeout=100.

      Should we consider increasing the ldlm_timeouts if they are in fact being used? Should we consider setting at_min to 60-70s to allow time for slow client responses?

      If yes then how does that settings helps and makes difference.

      See sections 2.2.2 and 2.2.8 in Cory Spitz's paper here:
      https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/page
      s/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf

      Thank You,
      Manish

      Attachments

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              manish Manish Patel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: