Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12971

changelog_deregister: fail to cancel 0 of 1 llog-records

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6 - Lustre 2.12.3 clients and servers
    • 3
    • 9223372036854775807

    Description

      Hello,

      We have four MDTs on Fir, and Robinhood kept being blocked on one of them. So we decided to clear/unregister all four changelogs readers and start with a fresh filesystem scan + new readers. While it was ok on three MDTs, with the last one, {fir-MDT0003}, I am not able to successfully runĀ changelog_deregister due to the following errors:

      Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 11:39:24 fir-md1-s4 kernel: Lustre: 14109:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero 

      A single changelog_deregister will just hang. If I launch a second one, the first will eventually returns with the following error:

      [root@fir-md1-s4 ~]#  lctl --device fir-MDT0003 changelog_deregister cl1
      error: changelog_deregister: No such file or directory
      

      If I launch two changelog_deregister at the same time, each in an endless loop, I'm able to make progress in the idx number, but not very fast:

      Nov 14 09:34:18 fir-md1-s4 kernel: LustreError: 11745:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11428 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 09:34:19 fir-md1-s4 kernel: Lustre: 12314:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
      Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 10:02:41 fir-md1-s4 kernel: LustreError: 12314:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11431 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 10:02:42 fir-md1-s4 kernel: Lustre: 12741:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
      Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 10:36:51 fir-md1-s4 kernel: LustreError: 11916:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11433 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 10:36:52 fir-md1-s4 kernel: Lustre: 13213:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
      Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 10:53:58 fir-md1-s4 kernel: LustreError: 12741:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11435 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 10:53:59 fir-md1-s4 kernel: Lustre: 13452:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
      Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 11:16:38 fir-md1-s4 kernel: LustreError: 13452:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11436 of catalog [0x5:0xa:0x0]: rc = -2
      Nov 14 11:16:39 fir-md1-s4 kernel: Lustre: 13794:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0003: catlog [0x5:0xa:0x0] crosses index zero
      Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(llog_cat.c:762:llog_cat_cancel_records()) fir-MDD0003: fail to cancel 0 of 1 llog-records: rc = -2
      Nov 14 11:39:23 fir-md1-s4 kernel: LustreError: 13794:0:(mdd_device.c:371:llog_changelog_cancel()) fir-MDD0003: cancel idx 11438 of catalog [0x5:0xa:0x0]: rc = -2
      

      We would like to clear all changelogs and start fresh on fir-MDT0003. Because we have been using 2.12 since the beginning and we might have corrupt changelogs there. How to do that properly? Thanks!

      Attachments

        1. fir-md1-s4_chglog_clear_notes_20191122.txt
          9 kB
          Stephane Thiell
        2. fir-md1-s4_chglog.tar.gz
          2.10 MB
          Stephane Thiell

        Activity

          People

            tappro Mikhail Pershin
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: