Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6556

changelog catalog corruption if all possible records is define

Details

    • 3
    • 9223372036854775807

    Description

      After our last lustre upgrade, On tera100 and tgcc site, some
      lustre fs have meet the same corruption on the changelog_catalog
      The robinhood node panic like in the LU-6471 and the crash analyze
      show that is the changelog-catalog file that have a corruption.
      The file is too big than the maximum size of this type of file and
      the record who produces the panic is not in the right place.

      Attachments

        Issue Links

          Activity

            [LU-6556] changelog catalog corruption if all possible records is define

            Backported of the http://review.whamcloud.com/14912 patch also need the patch http://review.whamcloud.com/17052 "LU-7329 obdclass: sync device to flush journal callbacks" patch to avoid introducing test failures in sanity test_60a.

            adilger Andreas Dilger added a comment - Backported of the http://review.whamcloud.com/14912 patch also need the patch http://review.whamcloud.com/17052 " LU-7329 obdclass: sync device to flush journal callbacks" patch to avoid introducing test failures in sanity test_60a.

            Landed to 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed to 2.8

            LU-7340 has been created to address previous ChangeLogs related and more graceful handling of ENOSPC conditions.

            bfaccini Bruno Faccini (Inactive) added a comment - LU-7340 has been created to address previous ChangeLogs related and more graceful handling of ENOSPC conditions.

            Bruno, that is fine. Please file a separate bug and copy over relevant comments before closing this one, so that they are not forgotten.

            adilger Andreas Dilger added a comment - Bruno, that is fine. Please file a separate bug and copy over relevant comments before closing this one, so that they are not forgotten.

            Andreas, Robert,
            I also think that your concerns are really good points for more ChangeLogs related enhancements, but also that they should addressed in a separate ticket, when this ticket could now be closed.
            Do you agree ?

            bfaccini Bruno Faccini (Inactive) added a comment - Andreas, Robert, I also think that your concerns are really good points for more ChangeLogs related enhancements, but also that they should addressed in a separate ticket, when this ticket could now be closed. Do you agree ?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14912/
            Subject: LU-6556 obdclass: re-allow catalog to wrap around
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4691290f6d39bffaa3e463697fbc3ac351015e76

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14912/ Subject: LU-6556 obdclass: re-allow catalog to wrap around Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4691290f6d39bffaa3e463697fbc3ac351015e76
            rread Robert Read added a comment -

            I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

            rread Robert Read added a comment - I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

            I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

            adilger Andreas Dilger added a comment - I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

            I have created LU-6634 because during my testing of my patch for this/LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

            bfaccini Bruno Faccini (Inactive) added a comment - I have created LU-6634 because during my testing of my patch for this/ LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912
            Subject: LU-6556 obdclass: re-allow catalog loopback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912 Subject: LU-6556 obdclass: re-allow catalog loopback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

            People

              bfaccini Bruno Faccini (Inactive)
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: