Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7039

llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Hyperion SWL test
    • 3
    • 9223372036854775807

    Description

      Running tip of master with SWL and DNE.

      2015-08-25 12:19:21 Lustre: lustre-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
      2015-08-25 12:19:21 LustreError: 6384:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT000b-osp-MDT0001: invalid llog tail at log id 0x3:2147484674/0 offset 3407872
      2015-08-25 12:19:21 LustreError: 6384:0:(lod_dev.c:392:lod_sub_recovery_thread()) lustre-MDT000b-osp-MDT0001 getting update log failed: rc = -22
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
      2015-08-25 12:19:21 Pid: 6378, comm: lod0001_rec0005
      2015-08-25 12:19:21 Aug 25 12:19:21
      2015-08-25 12:19:21 iws12 kernel: LuCall Trace:
      2015-08-25 12:19:21 streError: 6378: [<ffffffffa04a2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2015-08-25 12:19:21 0:(llog_osd.c:77 [<ffffffffa04a2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      2015-08-25 12:19:21 8:llog_osd_next_ [<ffffffffa08dad15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
      2015-08-25 12:19:21 block()) ASSERTI [<ffffffffa08ccb4e>] llog_process_thread+0xInitializing cgroup subsys cpuset
      2015-08-25 12:19:21 Initializing cgroup subsys cpu
      

      Attempting to recreate and get a dump

      Attachments

        1. console.log.bz2
          190 kB
        2. lola-10-lustre-log.1444148492.4548-dm-minus-one.log.bz2
          1021 kB
        3. LU-7039.llog.txt.gz
          3.53 MB
        4. lustre-log.1443755187.9078
          712 kB
        5. memory-counter-lola-11.dat.bz2
          25 kB
        6. messages-lola-11.log.bz2
          302 kB
        7. slab-details-lola-11.dat.bz2
          873 kB
        8. slab-details-one-file-per-slab.tar.bz2
          617 kB
        9. slab-total-lola-11.dat.bz2
          28 kB
        10. vmcore-dmesg.txt.bz2
          28 kB

        Issue Links

          Activity

            [LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

            It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039.

            heckes Frank Heckes (Inactive) added a comment - It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039 .
            heckes Frank Heckes (Inactive) added a comment - - edited

            The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.

            • Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
              The allocation of slabs increased continuously till ~ 31 GB till crash
            • MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432)
            • ptlrpc_cache seems to be the biggest consumer
              Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
              attached files containing extracted counters for memory, slab totals and per slab allocation.

            The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36.

            heckes Frank Heckes (Inactive) added a comment - - edited The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola . DNE is enabled and MDSes are configured in active-active HA failover configuration. Primary resources of MDT lola-11 were failed back at Dec, 3 20:18. The allocation of slabs increased continuously till ~ 31 GB till crash MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432 ) ptlrpc_cache seems to be the biggest consumer Attached lola-11 's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also attached files containing extracted counters for memory, slab totals and per slab allocation. The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36 .
            di.wang Di Wang added a comment -

            Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.

            di.wang Di Wang added a comment - Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.
            di.wang Di Wang added a comment -

            Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by LU-7456, and we will continue soak-test to check this problem once 7456 is fixed.

            di.wang Di Wang added a comment - Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by LU-7456 , and we will continue soak-test to check this problem once 7456 is fixed.

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199
            Subject: LU-7039 recovery: abort update recovery once fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199 Subject: LU-7039 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/
            Subject: LU-7039 llog: skip to next chunk for corrupt record
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/ Subject: LU-7039 llog: skip to next chunk for corrupt record Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c
            di.wang Di Wang added a comment -

            Frank: I updated http://review.whamcloud.com/16838, and added more debug information there. Could you please retry with the patch. Thanks.

            di.wang Di Wang added a comment - Frank: I updated http://review.whamcloud.com/16838 , and added more debug information there. Could you please retry with the patch. Thanks.

            Di: For build '20151027' (https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027)
            which doesn't include change 16797, as far as I can see, the problem is still present:

            Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == 
            tail->lrt_index ) failed: 
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG
            Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007
            Oct 28 07:10:59 lola-8 kernel: 
            Oct 28 07:10:59 lola-8 kernel: Call Trace:
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re
            cord: index 9421 but expected 9420
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett
            ing update log failed: rc = -34
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            
            heckes Frank Heckes (Inactive) added a comment - Di: For build '20151027' ( https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027 ) which doesn't include change 16797, as far as I can see, the problem is still present: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007 Oct 28 07:10:59 lola-8 kernel: Oct 28 07:10:59 lola-8 kernel: Call Trace: Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re cord: index 9421 but expected 9420 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett ing update log failed: rc = -34 Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/
            Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/ Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969
            Subject: LU-7039 llog: update llog header and size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969 Subject: LU-7039 llog: update llog header and size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad
            di.wang Di Wang added a comment -

            Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

            di.wang Di Wang added a comment - Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

            People

              di.wang Di Wang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: