[LU-15980] Aggregation messages should say when the event was last seen Created: 28/Jun/22 Updated: 28/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The aggregated messages like "Lustre: Skipped 5292 previous similar messages" could be delayed up to libcfs_console_max_delay seconds (default 10 minutes) so for people that depend on closely following this to tell various errors from logs it adds a significant fudge factor. We should print some sort of an offset like "last seen X seconds ago" if it was longer than some (configurable?) number of seconds to help better correlate aggregated events to actual timestamps of when it ended. |
| Comments |
| Comment by Andreas Dilger [ 28/Jun/22 ] |
|
We need to be careful to get properly useful info out of this. The "last seen" may or may not be representative, if the message is printed right at the max timeout, whether there was a steady stream of those messages or a huge number and then nothing again until now. The only info that is available today is cdls_next when the next message should be printed, so if that is not close to "now" then some time has passed since the last bust of this message. It would also be useful to print the start of that range, which is somewhere between (cdls_next - cdls_delay) and (cdls_next - cdls_delay/2) (the exact time could be found by digging in the logs, but may be a hassle). I was thinking it would be prudent to avoid growing the cdls struct, but there are at most about 4000 such structures in the code (only for CERROR, CWARN, CNETERR, CDEBUG_LIMIT, and LCONSOLE_INFO/WARN/ERROR, so adding a few bytes is unlikely to be problematic. |