Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7039

llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Hyperion SWL test
    • 3
    • 9223372036854775807

    Description

      Running tip of master with SWL and DNE.

      2015-08-25 12:19:21 Lustre: lustre-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
      2015-08-25 12:19:21 LustreError: 6384:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT000b-osp-MDT0001: invalid llog tail at log id 0x3:2147484674/0 offset 3407872
      2015-08-25 12:19:21 LustreError: 6384:0:(lod_dev.c:392:lod_sub_recovery_thread()) lustre-MDT000b-osp-MDT0001 getting update log failed: rc = -22
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
      2015-08-25 12:19:21 Pid: 6378, comm: lod0001_rec0005
      2015-08-25 12:19:21 Aug 25 12:19:21
      2015-08-25 12:19:21 iws12 kernel: LuCall Trace:
      2015-08-25 12:19:21 streError: 6378: [<ffffffffa04a2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2015-08-25 12:19:21 0:(llog_osd.c:77 [<ffffffffa04a2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      2015-08-25 12:19:21 8:llog_osd_next_ [<ffffffffa08dad15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
      2015-08-25 12:19:21 block()) ASSERTI [<ffffffffa08ccb4e>] llog_process_thread+0xInitializing cgroup subsys cpuset
      2015-08-25 12:19:21 Initializing cgroup subsys cpu
      

      Attempting to recreate and get a dump

      Attachments

        1. LU-7039.llog.txt.gz
          3.53 MB
        2. lustre-log.1443755187.9078
          712 kB
        3. lola-10-lustre-log.1444148492.4548-dm-minus-one.log.bz2
          1021 kB
        4. console.log.bz2
          190 kB
        5. memory-counter-lola-11.dat.bz2
          25 kB
        6. messages-lola-11.log.bz2
          302 kB
        7. slab-details-lola-11.dat.bz2
          873 kB
        8. slab-details-one-file-per-slab.tar.bz2
          617 kB
        9. slab-total-lola-11.dat.bz2
          28 kB
        10. vmcore-dmesg.txt.bz2
          28 kB

        Issue Links

          Activity

            [LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

            Landed for 2.8.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16969/
            Subject: LU-7039 llog: update llog header and size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e1745ed18d8e28f3cf3d72df3b7ef50d83f36601

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16969/ Subject: LU-7039 llog: update llog header and size Project: fs/lustre-release Branch: master Current Patch Set: Commit: e1745ed18d8e28f3cf3d72df3b7ef50d83f36601
            sarah Sarah Liu added a comment -

            another instance on master DNE mode
            https://testing.hpdd.intel.com/test_sets/b57ce146-bbfd-11e5-8506-5254006e85c2
            client and server: lustre-master build#3305

            sarah Sarah Liu added a comment - another instance on master DNE mode https://testing.hpdd.intel.com/test_sets/b57ce146-bbfd-11e5-8506-5254006e85c2 client and server: lustre-master build#3305
            di.wang Di Wang added a comment -

            This OOM is not caused by llog corruption, so it is a new problem (LU-7517).

            di.wang Di Wang added a comment - This OOM is not caused by llog corruption, so it is a new problem ( LU-7517 ).

            It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039.

            heckes Frank Heckes (Inactive) added a comment - It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039 .
            heckes Frank Heckes (Inactive) added a comment - - edited

            The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.

            • Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
              The allocation of slabs increased continuously till ~ 31 GB till crash
            • MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432)
            • ptlrpc_cache seems to be the biggest consumer
              Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
              attached files containing extracted counters for memory, slab totals and per slab allocation.

            The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36.

            heckes Frank Heckes (Inactive) added a comment - - edited The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola . DNE is enabled and MDSes are configured in active-active HA failover configuration. Primary resources of MDT lola-11 were failed back at Dec, 3 20:18. The allocation of slabs increased continuously till ~ 31 GB till crash MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432 ) ptlrpc_cache seems to be the biggest consumer Attached lola-11 's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also attached files containing extracted counters for memory, slab totals and per slab allocation. The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36 .
            di.wang Di Wang added a comment -

            Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.

            di.wang Di Wang added a comment - Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.
            di.wang Di Wang added a comment -

            Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by LU-7456, and we will continue soak-test to check this problem once 7456 is fixed.

            di.wang Di Wang added a comment - Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by LU-7456 , and we will continue soak-test to check this problem once 7456 is fixed.

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199
            Subject: LU-7039 recovery: abort update recovery once fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199 Subject: LU-7039 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/
            Subject: LU-7039 llog: skip to next chunk for corrupt record
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/ Subject: LU-7039 llog: skip to next chunk for corrupt record Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

            People

              di.wang Di Wang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: