[LU-11373] CERROR/CWARN messages are not throttled Created: 12/Sep/18  Updated: 31/Oct/18  Resolved: 30/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-11384 wrong check in libcfs_debug_vmsg2() Resolved
is duplicated by LU-11383 A lot of cfs_fail_loc error message f... Resolved
Related
is related to LU-11566 sanity test_60aa: llog_print_cb()) no... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

It appears that CDEBUG_LIMIT is not working properly, since test logs in Maloo are full of repeated error messages. This would be a nightmare for a large cluster if there are many clients and servers spewing repeated messages.

For example, sanity test_60a on the MDS console:

[ 2786.154265] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.155427] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.156482] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.157628] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.158671] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.159789] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.160824] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.161934] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.162977] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.164074] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[ 2786.165111] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
[repeats hundreds of times]

It might relate to ktime_t patches that James landed previously, but that is just speculation as I haven't investigated it yet.

The sanity test_60b should be catching the failure of CDEBUG_LIMIT() but it is checking the logs on the client, while the test is being run on the MGS.



 Comments   
Comment by Gerrit Updater [ 12/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33154
Subject: LU-11373 tests: fix sanity test_60b to run on the MGS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1c37077c2d5340a56914908e5e3f469b3d1d8394

Comment by Gerrit Updater [ 01/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33154/
Subject: LU-11373 libcfs: fix wrong check in libcfs_debug_vmsg2()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4037c1462730c120af00bd0e95b08fe0669e0271

Comment by Peter Jones [ 01/Oct/18 ]

Landed for 2.12

Comment by Bruno Faccini (Inactive) [ 15/Oct/18 ]

Just in case, I have just experienced an Oops due to soft-lockup (i.e. with /proc/sys/kernel/softlockup_panic set to 1) where my crash-dump analysis (all cores but one stuck in stop_machine()/stop_cpu() during module unload, when the last core seems to print Console msgs since too long time, right after sanity/test_60a/llog_test_10() and due to catalog full simulation) has permitted me to also identify that the root cause could be this Console msgs rate limitation failure/regression!

But just in case a working rate limiting will not be enough to fix, I also wonder if the number of LLOG record creation/write attempts could be slightly reduced in sanity/test_60a/llog_test_10() and I have also found a specific module-unload/stop_machine() scenario and fix report at https://lists.openwall.net/linux-kernel/2015/08/19/367 which could be of interest.

 

Comment by James Nunez (Inactive) [ 23/Oct/18 ]

Reopened ticket because we have at least two recent sanity test 60b failures after the patch for this ticket was landed; https://testing.whamcloud.com/test_sets/e51fa37e-d6db-11e8-ad90-52540065bddc and https://testing.whamcloud.com/test_sets/45b22ad4-d6d4-11e8-ad90-52540065bddc .

Comment by Andreas Dilger [ 24/Oct/18 ]

It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine.

Comment by James Nunez (Inactive) [ 25/Oct/18 ]

Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures.

All failures are for ZFS test sessions.

It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa (LU-11566), I'll submit a patch to increase the check from 100 to 103.

Comment by Gerrit Updater [ 25/Oct/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474
Subject: LU-11373 tests: increase debug limit sanity 60b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8a6f5af4b66815f97e7437a1d4d98b49f88a43e2

Comment by Gerrit Updater [ 30/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/
Subject: LU-11373 tests: increase debug limit sanity 60b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4afee32188d38d2969ea160ddec64cd8359e12ca

Comment by Peter Jones [ 30/Oct/18 ]

Landed for 2.12

Comment by Andreas Dilger [ 31/Oct/18 ]

Any patches failing due to this issue need to rebase to get the fix.

Generated at Sat Feb 10 02:43:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.