aboyko, could you please provide some more background on why a tunable ping_interval is needed? I'm concerned that allowing ping_interval to be tuned separately from obd_timeout can lead to random client eviction when clients are not sending RPCs or OBD_PING in a timely manner. This might be hard to notice if it works out like e.g. ping_interval = obd_timeout - 10 and this is OK while an import is active and sending RPCs or OBD_PING, but fails intermittently if the import becomes idle and a ping is also lost.
I'd much prefer to have a per-device obd_timeout value as implemented in patch https://review.whamcloud.com/50519 "LU-9912 ptlrpc: make obd timeout a per-device param", and then the ping_interval for each import is controlled by obd->obd_timeout / 4. This would work properly for clients that mount multiple filesystems, unlike having a global obd_timeout (and now global ping_interval).
However, before we change anything with the global ping_interval that was aadded in 2.16, I'd like to understand why it was added and what problem it was solving. I'd prefer to avoid having a tunable ping_interval entirely, just because it can go badly. If this was needed to solve some specific problem, would a per-device obd_timeout also solve this same issue? Also, hornc landed patch https://review.whamcloud.com/49807 "LU-16483 ptlrpc: Track highest reply XID" that also solves a longstanding problem where clients reconnect on a ping failure, even though they have successfully sent other pings in the meantime.
I'm thinking we should remove the global ping_interval tunable completely (so that pings are always tied to obd_timeout), and use something like:
#define PING_INTERVAL(obd) (obd_timeout(obd) / 4)
It would still be possible to keep evict_multiplier if that is important, something like:
#define PING_INTERVAL(obd) (obd_timeout(obd) * 3 / (evict_multiplier * 2))
but before we add complexity I'd like to understand what this was needed for.
As per discussion on the LWG call today, moving tickets that do not appear to be essential to fix version 2.17. If the fix lands before code freeze we will update the fix version to reflect that but we want to focus on activities on the critical path. Please speak up if you think that this issue definitely needs to be fixed before we could issue a 2.16 release.