Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4481

Impossible to start changelogs after corruption

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 12272

    Description

      Hi,

      On a customer cluster, changelogs refuse to start, probably because of an internal data corruption.
      Here are the messages we can see when mounting the MDT:

      1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_lvfs.c:199:llog_lvfs_read_header()) bad log header magic: 0x10670000 (expected 0x10645539)
      1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14143:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1490186b: -5
      1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:393:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -5
      1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(llog_obd.c:220:llog_setup_named()) obd mdd_obd-scratch3-MDT0000 ctxt 14 lop_setup=ffffffffa0501cc0 failed -5
      1373184833 2013 Jul 7 10:13:53 bcluster111 kern err kernel LustreError: 14133:0:(mds_log.c:218:mds_changelog_llog_init()) changelog users llog setup failed -5
      1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:216:mdd_changelog_llog_init()) no changelog user context
      1373184835 2013 Jul 7 10:13:55 bcluster111 kern err kernel LustreError: 14133:0:(mdd_device.c:254:mdd_changelog_init()) Changelog setup during init failed -22
      1373184835 2013 Jul 7 10:13:55 bcluster111 kern warning kernel Lustre: scratch3-MDT0000: used disk, loading
      

      So the MDt is started, but without changelogs.

      And if we try to look at changelog_users with lctl:

      # lctl get_param mdd.scratch3-MDT0000.changelog_users
      error: get_param: read('/proc/fs/lustre/mdd/scratch3-MDT0000/changelog_users') failed: No such device or address
      

      The problem is the customer needs Lustre changelogs because they are consumed by Robinhood to monitor activity on the file system.

      So the first thing we need is a way to restart changelogs as soon as possible. We already tried any administrative lustre command (lfs or lctl) to cleanup things, but it did not work due to the fact that the feature did not start. Manually cleaning OBJETS files is not a thing we tried, for fear of making the situation even worse.

      After the changelogs will be restarted on site, we will need a fix so that changelogs can deal with corrupted data and start afresh in that case.

      But again, the really first thing we need is a helping hand to clean things on site and restart changelogs ASAP.

      TIA,
      Sebastien.

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: