Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16182

llog_osd_prev_block() ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.10, Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      There is an LBUG triggered during MDT mount due to corruption in the Changelog catalog/log file:

      LustreError: 29516:0:(llog_osd.c:1075:llog_osd_prev_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
      LustreError: 29516:0:(llog_osd.c:1075:llog_osd_prev_block()) LBUG
      Pid: 29516, comm: mount.lustre
      Call Trace:
      libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      lbug_with_loc+0x45/0xc0 [libcfs]
      llog_osd_prev_block+0x9f7/0xaf0 [obdclass]
      llog_reverse_process+0x147/0xac0 [obdclass]
      ? changelog_init_cb+0x0/0x1f0 [mdd]
      llog_cat_reverse_process_cb+0x157/0x540 [obdclass]
      llog_reverse_process+0x269/0xac0 [obdclass]
      llog_cat_reverse_process+0x199/0x2d0 [obdclass]
      mdd_prepare+0x1269/0x1a00 [mdd]
      mdt_prepare+0x51/0x3b0 [mdt]
      server_start_targets+0x2574/0x2e10 [obdclass]
      server_fill_super+0x108d/0x184c [obdclass]
      lustre_fill_super+0x328/0x950 [obdclass]
      mount_nodev+0x4d/0xb0
      lustre_mount+0x38/0x60 [obdclass]
      mount_fs+0x39/0x1b0
      vfs_kern_mount+0x5f/0xf0
      do_mount+0x24e/0xa40
      

      The kernel code should not have an LASSERT() check for data that is read from disk, so this should be removed and replaced with error handling:

                      LASSERT(last_rec->lrh_index == tail->lrt_index);
      

      The llog_osd_prev_block() function has many other places where errors are returned to the caller, so it looks (at first glance) that the LASSERT() should be replaced with an error message that prints the llog record number and FID, and returns an error to the caller that stops changelog processing and either clears this record or deletes the whole changelog.

      Also, I see that changelog_init_cb() also has LASSERT() checks for the llog records that could fail if the records are corrupted:

              LASSERT(llh->lgh_hdr->llh_flags & LLOG_F_IS_PLAIN);
              LASSERT(rec->cr_hdr.lrh_type == CHANGELOG_REC);
      

      that should also be fixed.

      The llog_cat_reverse_process(changelog_init_cb) handling looks like it is finding the highest changelog index currently in use? If this reverse llog processing fails, then it may be that the last changelog index is lost? Options would include doing "forward" llog processing, but this may also suffer from the same problem, or using the changelog_users file to at least start with a changelog index higher than what the users have processed (e.g. current_index = max(user_index) + 10M or similar).

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: