Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6954

LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.5.4
    • lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64

      The mds service on both porter and stout fails to start. Able to import zfs pool on both systems with no problem. The mgs device mounts with no problem but the mdt on both systems fails to mount. Doing a "writeconf" on the stout mds did not help. The following console messages were reported on stout-mds1 console:

      2015-08-02 16:38:26 Lustre: Lustre: Build Version: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.16.2.1chaos.ch5.3.x86_64
      2015-08-02 16:38:27 Lustre: MGC172.21.1.99@o2ib200: Connection restored to MGS (at 0@lo)
      2015-08-02 16:38:28 Lustre: MGS: Logs for fs fsrzb were removed by user request.  All servers must be restarted in order to regenerate the logs.
      2015-08-02 16:38:30 LustreError: 11-0: fsrzb-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      2015-08-02 16:38:31 Lustre: 12934:0:(llog_cat.c:718:llog_cat_reverse_process()) catalog 0x2:10 crosses index zero
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:380:mdd_changelog_init()) fsrzb-MDD0000: changelog setup during init failed: rc = -5
      2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:963:mdd_prepare()) fsrzb-MDD0000: failed to initialize changelog: rc = -5
      2015-08-02 16:38:31 Lustre: fsrzb-MDT0000: Unable to start target: -5
      2015-08-02 16:38:31 Lustre: Failing over fsrzb-MDT0000
      2015-08-02 16:38:32 Lustre: server umount fsrzb-MDT0000 complete
      2015-08-02 16:38:32 LustreError: 12934:0:(obd_mount.c:1331:lustre_fill_super()) Unable to mount  (-5)
      

      A workaround was found to allow the MDT to mount:
      Mounting the MDT via ZPL
      Delete the changelog_catalog and changelog_users files
      Unmount
      Mount the MDT via lustre in the normal manner

      Attachments

        1. changelog_catalog
          3.97 MB
        2. changelog_catalog.stout
          4.11 MB
        3. changelog_users
          8 kB

        Issue Links

          Activity

            [LU-6954] LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5

            Thanks Bruno and Olaf.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Thanks Bruno and Olaf. ~ jfc.
            ofaaland Olaf Faaland added a comment -

            Hi Bruno,
            Yes, I agree this should be closed as a dup of LU-6556. Thank you.
            -Olaf

            ofaaland Olaf Faaland added a comment - Hi Bruno, Yes, I agree this should be closed as a dup of LU-6556 . Thank you. -Olaf

            Hello Olaf,
            Do you agree that this ticket can be closed as a dup of LU-6556?
            Thanks again and in advance for your help and answer.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Olaf, Do you agree that this ticket can be closed as a dup of LU-6556 ? Thanks again and in advance for your help and answer.

            Olaf,
            To be quick, Catalog wrap-around was working before LU-4528 patch and no longer after, and also situations where Catalog has already wrapped-around will lead to corruption with new records being written past of expected Catalog's normal end of file size.

            bfaccini Bruno Faccini (Inactive) added a comment - Olaf, To be quick, Catalog wrap-around was working before LU-4528 patch and no longer after, and also situations where Catalog has already wrapped-around will lead to corruption with new records being written past of expected Catalog's normal end of file size.

            Bruno,

            Looks to me like this code in llog_cat_new_log() implemented wrap-around before LU-4528. Please confirm I'm not misreading.

            76         bitmap_size = LLOG_BITMAP_SIZE(llh);
            77
            78         index = (cathandle->lgh_last_idx + 1) % bitmap_size;
            ...
            118         cathandle->lgh_last_idx = index;
            119         llh->llh_tail.lrt_index = index;
            

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Bruno, Looks to me like this code in llog_cat_new_log() implemented wrap-around before LU-4528 . Please confirm I'm not misreading. 76 bitmap_size = LLOG_BITMAP_SIZE(llh); 77 78 index = (cathandle->lgh_last_idx + 1) % bitmap_size; ... 118 cathandle->lgh_last_idx = index; 119 llh->llh_tail.lrt_index = index; thanks, Olaf

            I guess what I'm really asking is, did changelog_catalogs wrap around prior to the LU-4528 patch? Some comment made me think so, maybe I misunderstood.

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - I guess what I'm really asking is, did changelog_catalogs wrap around prior to the LU-4528 patch? Some comment made me think so, maybe I misunderstood. thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Bruno,

            I can see how we could end up with the changelog_catalog file corruption if, before we upgraded to LU-4528 code, our changelog_catalog was already wrapped around, so that lgh_last_idx == 12196 and changelog_catalog size == 4,153,280. I think this is what you are saying happened.

            However, in the LU-4528 patch, and in the previous code it applied to, I don't see something implementing changelog_catalog wrap-around - setting lgh_last_idx in some way other than incrementing or setting to 0 when creating changelog_catalog for the first time. Do you?

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Bruno, I can see how we could end up with the changelog_catalog file corruption if, before we upgraded to LU-4528 code, our changelog_catalog was already wrapped around, so that lgh_last_idx == 12196 and changelog_catalog size == 4,153,280. I think this is what you are saying happened. However, in the LU-4528 patch, and in the previous code it applied to, I don't see something implementing changelog_catalog wrap-around - setting lgh_last_idx in some way other than incrementing or setting to 0 when creating changelog_catalog for the first time. Do you? thanks, Olaf

            People

              bfaccini Bruno Faccini (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: