[LU-13143] detect console spew during (interop) testing Created: 15/Jan/20  Updated: 15/Jan/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-1095 Console message cleanup Reopened
is related to LU-8294 Noisy gss_svc_upcall_handle_init Resolved
is related to LU-11579 cl_file_inode_init()) ASSERTION(inode... Resolved
is related to LU-12712 sanity-pfl tests triggering “not SEL ... Resolved
is related to LU-13136 (layout.c:2121:__req_capsule_get()) @... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

I'm wondering about is whether we could add a check in test-framework.sh::cleanup_check() that looks for excess console error messages on the console, like we do for LBUG, Busy inode, and memory leaks? It seems that we've had several situations like this (LU-13136, LU-12712, LU-11579, LU-8294, LU-1095, ...) that are not detected during normal testing because they do not actually cause any tests to fail, but are annoying to end users.

One option would be to scan the whole dmesg log looking for Lustre: and LustreError: messages, possibly excluding D_CONSOLE, MARKER, and similar messages, and instead checking for duplicate output from the same line, like mdt_lvb.c:163:mdt_lvbo_fill() in this case, to avoid differences in the details of the message:

LustreError: 2456:0:(mdt_lvb.c:163:mdt_lvbo_fill()) myth-MDT0000: expected 336 actual 240.

then sort and count the number of such messages and trigger an error above a certain threshold.

We might have to make a 'whitelist' for a specific number of errors that are generated during specific test that are not necessarily a sign of problems (e.g. the llog-test runs in sanity test-60a), but they should be confined to a specific test script and an approximate count of failures (e.g. SANITY_CONSOLE_MDS_EXCEPT="mdt_lvbo_fill:100 ...", SANITY_CONSOLE_CLIENT_EXCEPT="ptlrpc_expire_one_request:250 ...", etc.).

While this may cause some spurious test failures as new subtests are added to a specific script, that would be an exception rather than the rule, and would at least give us a chance to detect unusual errors being printed to the console during testing.


Generated at Sat Feb 10 02:58:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.