Details
-
Question/Request
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
Lustre Server 2.1.6
Lustre Client 1.8.9
-
14384
Description
Hi
I'd like an explanation of which timeout values are being exceeded that are resulting in these evictions, so what does that "227 seconds" reffers to, like which timeout it's considering. Is that "ldlm_timeout, obd_timeout, /proc/sys/lustre/timeout, at_min or at_max.
May 14 05:37:59 dc2oss15 kernel: : Lustre: dc2-OST009c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bb8ff2800, cur 1400060279 expire 1400060129 last 1400060052 May 14 05:37:59 dc2oss15 kernel: : Lustre: Skipped 9 previous similar messages May 14 05:38:02 dc2oss12 kernel: : Lustre: dc2-OST007b: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88169eecb400, cur 1400060282 expire 1400060132 last 1400060055 May 14 05:38:02 dc2oss12 kernel: : Lustre: Skipped 8 previous similar messages May 14 05:37:53 dc2oss04 kernel: : Lustre: dc2-OST0021: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880bd683c400, cur 1400060273 expire 1400060123 last 1400060046 May 14 05:37:53 dc2oss04 kernel: : Lustre: Skipped 8 previous similar messages May 14 05:37:58 dc2oss05 kernel: : Lustre: dc2-OST002c: haven't heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88154cc64000, cur 1400060278 expire 1400060128 last 1400060051 May 14 05:37:58 dc2oss05 kernel: : Lustre: Skipped 9 previous similar messages
Particularly, I'm interested in knowing whether ldlm_timeouts, which is 20s for OSTs and 6s for MDT, are in play given that we've adaptive timeouts enabled(at_max = 600) and /proc/sys/lustre/timeout=100.
Should we consider increasing the ldlm_timeouts if they are in fact being used? Should we consider setting at_min to 60-70s to allow time for slow client responses?
If yes then how does that settings helps and makes difference.
See sections 2.2.2 and 2.2.8 in Cory Spitz's paper here:
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/page
s/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
Thank You,
Manish
Attachments
Issue Links
- is related to
-
LUDOC-250 More explanation of several kinds of timeout setting
-
- Open
-
Closing ticket as the remaining doc work will be handled under LUDOC-250