Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11373

CERROR/CWARN messages are not throttled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      It appears that CDEBUG_LIMIT is not working properly, since test logs in Maloo are full of repeated error messages. This would be a nightmare for a large cluster if there are many clients and servers spewing repeated messages.

      For example, sanity test_60a on the MDS console:

      [ 2786.154265] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.155427] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.156482] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.157628] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.158671] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.159789] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.160824] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.161934] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.162977] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.164074] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.165111] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [repeats hundreds of times]
      

      It might relate to ktime_t patches that James landed previously, but that is just speculation as I haven't investigated it yet.

      The sanity test_60b should be catching the failure of CDEBUG_LIMIT() but it is checking the logs on the client, while the test is being run on the MGS.

      Attachments

        Issue Links

          Activity

            [LU-11373] CERROR/CWARN messages are not throttled

            Any patches failing due to this issue need to rebase to get the fix.

            adilger Andreas Dilger added a comment - Any patches failing due to this issue need to rebase to get the fix.
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/
            Subject: LU-11373 tests: increase debug limit sanity 60b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4afee32188d38d2969ea160ddec64cd8359e12ca

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/ Subject: LU-11373 tests: increase debug limit sanity 60b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4afee32188d38d2969ea160ddec64cd8359e12ca

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474
            Subject: LU-11373 tests: increase debug limit sanity 60b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8a6f5af4b66815f97e7437a1d4d98b49f88a43e2

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474 Subject: LU-11373 tests: increase debug limit sanity 60b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8a6f5af4b66815f97e7437a1d4d98b49f88a43e2

            Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures.

            All failures are for ZFS test sessions.

            It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa (LU-11566), I'll submit a patch to increase the check from 100 to 103.

            jamesanunez James Nunez (Inactive) added a comment - Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures. All failures are for ZFS test sessions. It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa ( LU-11566 ), I'll submit a patch to increase the check from 100 to 103.

            It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine.

            adilger Andreas Dilger added a comment - It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine.

            Reopened ticket because we have at least two recent sanity test 60b failures after the patch for this ticket was landed; https://testing.whamcloud.com/test_sets/e51fa37e-d6db-11e8-ad90-52540065bddc and https://testing.whamcloud.com/test_sets/45b22ad4-d6d4-11e8-ad90-52540065bddc .

            jamesanunez James Nunez (Inactive) added a comment - Reopened ticket because we have at least two recent sanity test 60b failures after the patch for this ticket was landed; https://testing.whamcloud.com/test_sets/e51fa37e-d6db-11e8-ad90-52540065bddc and https://testing.whamcloud.com/test_sets/45b22ad4-d6d4-11e8-ad90-52540065bddc .

            Just in case, I have just experienced an Oops due to soft-lockup (i.e. with /proc/sys/kernel/softlockup_panic set to 1) where my crash-dump analysis (all cores but one stuck in stop_machine()/stop_cpu() during module unload, when the last core seems to print Console msgs since too long time, right after sanity/test_60a/llog_test_10() and due to catalog full simulation) has permitted me to also identify that the root cause could be this Console msgs rate limitation failure/regression!

            But just in case a working rate limiting will not be enough to fix, I also wonder if the number of LLOG record creation/write attempts could be slightly reduced in sanity/test_60a/llog_test_10() and I have also found a specific module-unload/stop_machine() scenario and fix report at https://lists.openwall.net/linux-kernel/2015/08/19/367 which could be of interest.

             

            bruno Bruno Faccini (Inactive) added a comment - Just in case, I have just experienced an Oops due to soft-lockup (i.e. with /proc/sys/kernel/softlockup_panic set to 1) where my crash-dump analysis (all cores but one stuck in stop_machine()/stop_cpu() during module unload, when the last core seems to print Console msgs since too long time, right after sanity/test_60a/llog_test_10() and due to catalog full simulation) has permitted me to also identify that the root cause could be this Console msgs rate limitation failure/regression! But just in case a working rate limiting will not be enough to fix, I also wonder if the number of LLOG record creation/write attempts could be slightly reduced in sanity/test_60a/llog_test_10() and I have also found a specific module-unload/stop_machine() scenario and fix report at https://lists.openwall.net/linux-kernel/2015/08/19/367  which could be of interest.  
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33154/
            Subject: LU-11373 libcfs: fix wrong check in libcfs_debug_vmsg2()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4037c1462730c120af00bd0e95b08fe0669e0271

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33154/ Subject: LU-11373 libcfs: fix wrong check in libcfs_debug_vmsg2() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4037c1462730c120af00bd0e95b08fe0669e0271

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: