Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11373

CERROR/CWARN messages are not throttled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      It appears that CDEBUG_LIMIT is not working properly, since test logs in Maloo are full of repeated error messages. This would be a nightmare for a large cluster if there are many clients and servers spewing repeated messages.

      For example, sanity test_60a on the MDS console:

      [ 2786.154265] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.155427] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.156482] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.157628] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.158671] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.159789] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.160824] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.161934] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.162977] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.164074] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [ 2786.165111] Lustre: 30107:0:(llog_cat.c:98:llog_cat_new_log()) MGS: there are no more free slots in catalog e973af35
      [repeats hundreds of times]
      

      It might relate to ktime_t patches that James landed previously, but that is just speculation as I haven't investigated it yet.

      The sanity test_60b should be catching the failure of CDEBUG_LIMIT() but it is checking the logs on the client, while the test is being run on the MGS.

      Attachments

        Issue Links

          Activity

            [LU-11373] CERROR/CWARN messages are not throttled

            Any patches failing due to this issue need to rebase to get the fix.

            adilger Andreas Dilger added a comment - Any patches failing due to this issue need to rebase to get the fix.
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/
            Subject: LU-11373 tests: increase debug limit sanity 60b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4afee32188d38d2969ea160ddec64cd8359e12ca

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33474/ Subject: LU-11373 tests: increase debug limit sanity 60b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4afee32188d38d2969ea160ddec64cd8359e12ca

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474
            Subject: LU-11373 tests: increase debug limit sanity 60b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8a6f5af4b66815f97e7437a1d4d98b49f88a43e2

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33474 Subject: LU-11373 tests: increase debug limit sanity 60b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8a6f5af4b66815f97e7437a1d4d98b49f88a43e2

            Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures.

            All failures are for ZFS test sessions.

            It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa (LU-11566), I'll submit a patch to increase the check from 100 to 103.

            jamesanunez James Nunez (Inactive) added a comment - Looking at the failures so far, it looks like ZFS is the factor in these failures. Looking at all sanity test 60b from the past 3 days, none of the ldiskfs test sessions, including DNE test sessions, had any failures. All failures are for ZFS test sessions. It looks like all the failures fail with 102 or 103 debug lines. Although all of these failures count two extra lines from test_60aa ( LU-11566 ), I'll submit a patch to increase the check from 100 to 103.

            It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine.

            adilger Andreas Dilger added a comment - It looks like the failure is with 103 debug messages vs. expected 100 messages. I can expect DNE is more verbose (it looks like the other MDTs are still mounted and complaining), so bumping this limit up a bit is fine.

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: