[LU-11373] CERROR/CWARN messages are not throttled Created: 12/Sep/18 Updated: 31/Oct/18 Resolved: 30/Oct/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
It appears that CDEBUG_LIMIT is not working properly, since test logs in Maloo are full of repeated error messages. This would be a nightmare for a large cluster if there are many clients and servers spewing repeated messages. For example, sanity test_60a on the MDS console: [ 2786.154265] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.155427] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.156482] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.157628] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.158671] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.159789] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.160824] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.161934] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.162977] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.164074] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [ 2786.165111] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35 [repeats hundreds of times] It might relate to ktime_t patches that James landed previously, but that is just speculation as I haven't investigated it yet. The sanity test_60b should be catching the failure of CDEBUG_LIMIT() but it is checking the logs on the client, while the test is being run on the MGS. |
| Comments |
| Comment by Gerrit Updater [ 12/Sep/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33154 |
| Comment by Gerrit Updater [ 01/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33154/ |
| Comment by Peter Jones [ 01/Oct/18 ] |
|
Landed for 2.12 |
| Comment by Bruno Faccini (Inactive) [ 15/Oct/18 ] |
|
Just in case, I have just experienced an Oops due to soft-lockup (i.e. with /proc/sys/kernel/softlockup_panic set to 1) where my crash-dump analysis (all cores but one stuck in stop_machine()/stop_cpu() during module unload, when the last core seems to print Console msgs since too long time, right after sanity/test_60a/llog_test_10() and due to catalog full simulation) has permitted me to also identify that the root cause could be this Console msgs rate limitation failure/regression! But just in case a working rate limiting will not be enough to fix, I also wonder if the number of LLOG record creation/write attempts could be slightly reduced in sanity/test_60a/llog_test_10() and I have also found a specific module-unload/stop_machine() scenario and fix report at https://lists.openwall.net/linux-kernel/2015/08/19/367 which could be of interest.
|
| Comment by James Nunez (Inactive) [ 23/Oct/18 ] |
|
Reopened ticket because we have at least two recent sanity test 60b failures after the patch for this ticket was landed; https://testing.whamcloud.com/test_sets/e51fa37e-d6db-11e8-ad90-52540065bddc and https://testing.whamcloud.com/test_sets/45b22ad4-d6d4-11e8-ad90-52540065bddc . |
| Comment by Andreas Dilger [ 24/Oct/18 ] |
|
It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine. |
| Comment by James Nunez (Inactive) [ 25/Oct/18 ] |
|
Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures. All failures are for ZFS test sessions. It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa ( |
| Comment by Gerrit Updater [ 25/Oct/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474 |
| Comment by Gerrit Updater [ 30/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/ |
| Comment by Peter Jones [ 30/Oct/18 ] |
|
Landed for 2.12 |
| Comment by Andreas Dilger [ 31/Oct/18 ] |
|
Any patches failing due to this issue need to rebase to get the fix. |