[LU-5183] If Adaptive Timeout is set for at_max = 600 then id ldlm_timeouts gets affective or it becomes over ruled - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
Lustre Server 2.1.6
Lustre Client 1.8.9

Rank (Obsolete):
14384

Description

I'd like an explanation of which timeout values are being exceeded that are resulting in these evictions, so what does that "227 seconds" reffers to, like which timeout it's considering. Is that "ldlm_timeout, obd_timeout, /proc/sys/lustre/timeout, at_min or at_max.

May 14 05:37:59 dc2oss15 kernel: : Lustre: dc2-OST009c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bb8ff2800, cur 1400060279 expire 1400060129 last 1400060052
May 14 05:37:59 dc2oss15 kernel: : Lustre: Skipped 9 previous similar messages
May 14 05:38:02 dc2oss12 kernel: : Lustre: dc2-OST007b: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88169eecb400, cur 1400060282 expire 1400060132 last 1400060055
May 14 05:38:02 dc2oss12 kernel: : Lustre: Skipped 8 previous similar messages
May 14 05:37:53 dc2oss04 kernel: : Lustre: dc2-OST0021: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bd683c400, cur 1400060273 expire 1400060123 last 1400060046
May 14 05:37:53 dc2oss04 kernel: : Lustre: Skipped 8 previous similar messages
May 14 05:37:58 dc2oss05 kernel: : Lustre: dc2-OST002c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88154cc64000, cur 1400060278 expire 1400060128 last 1400060051
May 14 05:37:58 dc2oss05 kernel: : Lustre: Skipped 9 previous similar messages

Particularly, I'm interested in knowing whether ldlm_timeouts, which is 20s for OSTs and 6s for MDT, are in play given that we've adaptive timeouts enabled(at_max = 600) and /proc/sys/lustre/timeout=100.

Should we consider increasing the ldlm_timeouts if they are in fact being used? Should we consider setting at_min to 60-70s to allow time for slow client responses?

If yes then how does that settings helps and makes difference.

See sections 2.2.2 and 2.2.8 in Cory Spitz's paper here:
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/page
s/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf

Thank You,
Manish

Attachments

Issue Links

is related to

LUDOC-250 More explanation of several kinds of timeout setting

Open

Activity

People

Assignee:: Emoly Liu

Reporter:: Manish Patel (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jun/14 3:53 PM

Updated:: 21/Jul/14 1:48 PM

Resolved:: 21/Jul/14 1:48 PM