Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6556

changelog catalog corruption if all possible records is define

Details

    • 3
    • 9223372036854775807

    Description

      After our last lustre upgrade, On tera100 and tgcc site, some
      lustre fs have meet the same corruption on the changelog_catalog
      The robinhood node panic like in the LU-6471 and the crash analyze
      show that is the changelog-catalog file that have a corruption.
      The file is too big than the maximum size of this type of file and
      the record who produces the panic is not in the right place.

      Attachments

        Issue Links

          Activity

            [LU-6556] changelog catalog corruption if all possible records is define
            rread Robert Read added a comment -

            I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

            rread Robert Read added a comment - I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

            I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

            adilger Andreas Dilger added a comment - I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

            I have created LU-6634 because during my testing of my patch for this/LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

            bfaccini Bruno Faccini (Inactive) added a comment - I have created LU-6634 because during my testing of my patch for this/ LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912
            Subject: LU-6556 obdclass: re-allow catalog loopback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912 Subject: LU-6556 obdclass: re-allow catalog loopback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

            Please find in attachment customer trace analyze file (log_lu-6556_b.txt)
            to understand the changelog-catalog coruption

            apercher Antoine Percher added a comment - Please find in attachment customer trace analyze file (log_lu-6556_b.txt) to understand the changelog-catalog coruption

            Peter, no in this particular case the similar LBUG than for LU-6471 is valid since it is triggered due to the ChangeLog Catalog corruption I just explained before in my previous comment.

            bfaccini Bruno Faccini (Inactive) added a comment - Peter, no in this particular case the similar LBUG than for LU-6471 is valid since it is triggered due to the ChangeLog Catalog corruption I just explained before in my previous comment.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Assigning to me since I have been working on this issue when being on-site with Antoine.

            In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things :
            1) a ChangeLog Catalog that has already looped-back
            2) integrating/running with LU-4528 patch that seems to have introduced a regression where looped-back Catalogs are not handled correctly and are only expected to grow when it is not the case.

            Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...).

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Assigning to me since I have been working on this issue when being on-site with Antoine. In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things : 1) a ChangeLog Catalog that has already looped-back 2) integrating/running with LU-4528 patch that seems to have introduced a regression where looped-back Catalogs are not handled correctly and are only expected to grow when it is not the case. Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...).
            pjones Peter Jones added a comment -

            Bruno

            Can you please confirm whether this does indeed meet the profile of LU-6471?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bruno Can you please confirm whether this does indeed meet the profile of LU-6471 ? Thanks Peter

            People

              bfaccini Bruno Faccini (Inactive)
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: