[LU-15980] Aggregation messages should say when the event was last seen Created: 28/Jun/22  Updated: 28/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

The aggregated messages like "Lustre: Skipped 5292 previous similar messages" could be delayed up to libcfs_console_max_delay seconds (default 10 minutes) so for people that depend on closely following this to tell various errors from logs it adds a significant fudge factor.

We should print some sort of an offset like "last seen X seconds ago" if it was longer than some (configurable?) number of seconds to help better correlate aggregated events to actual timestamps of when it ended.



 Comments   
Comment by Andreas Dilger [ 28/Jun/22 ]

We need to be careful to get properly useful info out of this. The "last seen" may or may not be representative, if the message is printed right at the max timeout, whether there was a steady stream of those messages or a huge number and then nothing again until now. The only info that is available today is cdls_next when the next message should be printed, so if that is not close to "now" then some time has passed since the last bust of this message. It would also be useful to print the start of that range, which is somewhere between (cdls_next - cdls_delay) and (cdls_next - cdls_delay/2) (the exact time could be found by digging in the logs, but may be a hassle).

I was thinking it would be prudent to avoid growing the cdls struct, but there are at most about 4000 such structures in the code (only for CERROR, CWARN, CNETERR, CDEBUG_LIMIT, and LCONSOLE_INFO/WARN/ERROR, so adding a few bytes is unlikely to be problematic.

Generated at Sat Feb 10 03:22:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.