Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6954

LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.5.4
    • lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64

      The mds service on both porter and stout fails to start. Able to import zfs pool on both systems with no problem. The mgs device mounts with no problem but the mdt on both systems fails to mount. Doing a "writeconf" on the stout mds did not help. The following console messages were reported on stout-mds1 console:

      2015-08-02 16:38:26 Lustre: Lustre: Build Version: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.16.2.1chaos.ch5.3.x86_64
      2015-08-02 16:38:27 Lustre: MGC172.21.1.99@o2ib200: Connection restored to MGS (at 0@lo)
      2015-08-02 16:38:28 Lustre: MGS: Logs for fs fsrzb were removed by user request.  All servers must be restarted in order to regenerate the logs.
      2015-08-02 16:38:30 LustreError: 11-0: fsrzb-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      2015-08-02 16:38:31 Lustre: 12934:0:(llog_cat.c:718:llog_cat_reverse_process()) catalog 0x2:10 crosses index zero
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:380:mdd_changelog_init()) fsrzb-MDD0000: changelog setup during init failed: rc = -5
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:963:mdd_prepare()) fsrzb-MDD0000: failed to initialize changelog: rc = -5
      2015-08-02 16:38:31 Lustre: fsrzb-MDT0000: Unable to start target: -5
      2015-08-02 16:38:31 Lustre: Failing over fsrzb-MDT0000
      2015-08-02 16:38:32 Lustre: server umount fsrzb-MDT0000 complete
      2015-08-02 16:38:32 LustreError: 12934:0:(obd_mount.c:1331:lustre_fill_super()) Unable to mount  (-5)
      

      A workaround was found to allow the MDT to mount:
      Mounting the MDT via ZPL
      Delete the changelog_catalog and changelog_users files
      Unmount
      Mount the MDT via lustre in the normal manner

      Attachments

        1. changelog_catalog
          3.97 MB
        2. changelog_catalog.stout
          4.11 MB
        3. changelog_users
          8 kB

        Issue Links

          Activity

            [LU-6954] LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5

            Bruno,

            I confirmed that both filesystems produced the same sequence of error messages when attempting to start the MDT, including the "crosses index zero" and "changelog init failed" messages, same rc's.

            We do have the LU-4528 patch in our build.

            I'll attach the second changelog_catalog. The one you've already seen is from porter.

            ofaaland Olaf Faaland added a comment - Bruno, I confirmed that both filesystems produced the same sequence of error messages when attempting to start the MDT, including the "crosses index zero" and "changelog init failed" messages, same rc's. We do have the LU-4528 patch in our build. I'll attach the second changelog_catalog. The one you've already seen is from porter.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hello Chris, thanks to warn me about the fact that 2 FSs are affected, I should have better read the description text for this ticket, sorry about that.

            But, if the same symptoms/msgs have occurred for both Filesystems failures, I can already confirm that the "crosses index zero" msg is an indication of a Catalog loop-back, and I will also need the 2nd changelog_catalog file for the 2nd filesystem to analyze it.
            We may have end up in a situation where both Catalog have loop-back, when only one/1st was just doing so ... And this since last Filesystems restarts.

            And BTW, I have double-checked the 1st Catalog you have provided and I can also confirm you that it shows the same corruption (Catalog records written past normal end, leading to a Catalog size > header+bitmap+records) than what has been found for LU-6556.

            Concerning the fact I used the "just reached" comment, this may come from the fact that, for an unexplained reason at the moment, bits at the beginning of the bitmap have been cleared (or may be never set).

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello Chris, thanks to warn me about the fact that 2 FSs are affected, I should have better read the description text for this ticket, sorry about that. But, if the same symptoms/msgs have occurred for both Filesystems failures, I can already confirm that the "crosses index zero" msg is an indication of a Catalog loop-back, and I will also need the 2nd changelog_catalog file for the 2nd filesystem to analyze it. We may have end up in a situation where both Catalog have loop-back, when only one/1st was just doing so ... And this since last Filesystems restarts. And BTW, I have double-checked the 1st Catalog you have provided and I can also confirm you that it shows the same corruption (Catalog records written past normal end, leading to a Catalog size > header+bitmap+records) than what has been found for LU-6556 . Concerning the fact I used the "just reached" comment, this may come from the fact that, for an unexplained reason at the moment, bits at the beginning of the bitmap have been cleared (or may be never set).

            I find it exceptionally unlikely that two different filesystems had both independently "just reached" the end of the changelog_catalog and were just about to loop back at exactly the same time. More explanation is needed.

            morrone Christopher Morrone (Inactive) added a comment - I find it exceptionally unlikely that two different filesystems had both independently "just reached" the end of the changelog_catalog and were just about to loop back at exactly the same time. More explanation is needed.

            Thanks Olaf, and here is what I can tell after analyzing the changelog_catalog file you have provided.
            I am not able to confirm that Lustre v2.5.4, you run with, contains or not patch from LU-4528 (http://review.whamcloud.com/#/c/10108/, Commit 7c243a561ffe8503a6abf5c4cafef0c3566192bc). Can you check this for me?
            But if this is the case and since your changelog_catalog had just reached its end and was about to loop-back, I think you likely encountered the same kind of regression described in LU-6556.

            bfaccini Bruno Faccini (Inactive) added a comment - Thanks Olaf, and here is what I can tell after analyzing the changelog_catalog file you have provided. I am not able to confirm that Lustre v2.5.4, you run with, contains or not patch from LU-4528 ( http://review.whamcloud.com/#/c/10108/ , Commit 7c243a561ffe8503a6abf5c4cafef0c3566192bc). Can you check this for me? But if this is the case and since your changelog_catalog had just reached its end and was about to loop-back, I think you likely encountered the same kind of regression described in LU-6556 .
            ofaaland Olaf Faaland added a comment -

            Bruno,
            Sorry, yes, attached now.

            ofaaland Olaf Faaland added a comment - Bruno, Sorry, yes, attached now.

            Hello Olaf,
            Did you keep a copy of changelog_catalog and changelog_users files that you can provide ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Olaf, Did you keep a copy of changelog_catalog and changelog_users files that you can provide ?

            People

              bfaccini Bruno Faccini (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: