Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7039

llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Hyperion SWL test
    • 3
    • 9223372036854775807

    Description

      Running tip of master with SWL and DNE.

      2015-08-25 12:19:21 Lustre: lustre-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
      2015-08-25 12:19:21 LustreError: 6384:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT000b-osp-MDT0001: invalid llog tail at log id 0x3:2147484674/0 offset 3407872
      2015-08-25 12:19:21 LustreError: 6384:0:(lod_dev.c:392:lod_sub_recovery_thread()) lustre-MDT000b-osp-MDT0001 getting update log failed: rc = -22
      2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
      2015-08-25 12:19:21 Pid: 6378, comm: lod0001_rec0005
      2015-08-25 12:19:21 Aug 25 12:19:21
      2015-08-25 12:19:21 iws12 kernel: LuCall Trace:
      2015-08-25 12:19:21 streError: 6378: [<ffffffffa04a2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2015-08-25 12:19:21 0:(llog_osd.c:77 [<ffffffffa04a2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      2015-08-25 12:19:21 8:llog_osd_next_ [<ffffffffa08dad15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
      2015-08-25 12:19:21 block()) ASSERTI [<ffffffffa08ccb4e>] llog_process_thread+0xInitializing cgroup subsys cpuset
      2015-08-25 12:19:21 Initializing cgroup subsys cpu
      

      Attempting to recreate and get a dump

      Attachments

        1. console.log.bz2
          190 kB
        2. lola-10-lustre-log.1444148492.4548-dm-minus-one.log.bz2
          1021 kB
        3. LU-7039.llog.txt.gz
          3.53 MB
        4. lustre-log.1443755187.9078
          712 kB
        5. memory-counter-lola-11.dat.bz2
          25 kB
        6. messages-lola-11.log.bz2
          302 kB
        7. slab-details-lola-11.dat.bz2
          873 kB
        8. slab-details-one-file-per-slab.tar.bz2
          617 kB
        9. slab-total-lola-11.dat.bz2
          28 kB
        10. vmcore-dmesg.txt.bz2
          28 kB

        Issue Links

          Activity

            [LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199
            Subject: LU-7039 recovery: abort update recovery once fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199 Subject: LU-7039 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/
            Subject: LU-7039 llog: skip to next chunk for corrupt record
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/ Subject: LU-7039 llog: skip to next chunk for corrupt record Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c
            di.wang Di Wang added a comment -

            Frank: I updated http://review.whamcloud.com/16838, and added more debug information there. Could you please retry with the patch. Thanks.

            di.wang Di Wang added a comment - Frank: I updated http://review.whamcloud.com/16838 , and added more debug information there. Could you please retry with the patch. Thanks.

            Di: For build '20151027' (https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027)
            which doesn't include change 16797, as far as I can see, the problem is still present:

            Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == 
            tail->lrt_index ) failed: 
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG
            Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007
            Oct 28 07:10:59 lola-8 kernel: 
            Oct 28 07:10:59 lola-8 kernel: Call Trace:
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re
            cord: index 9421 but expected 9420
            Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett
            ing update log failed: rc = -34
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod]
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
            Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            
            heckes Frank Heckes (Inactive) added a comment - Di: For build '20151027' ( https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027 ) which doesn't include change 16797, as far as I can see, the problem is still present: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007 Oct 28 07:10:59 lola-8 kernel: Oct 28 07:10:59 lola-8 kernel: Call Trace: Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re cord: index 9421 but expected 9420 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett ing update log failed: rc = -34 Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/
            Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/ Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969
            Subject: LU-7039 llog: update llog header and size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969 Subject: LU-7039 llog: update llog header and size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad
            di.wang Di Wang added a comment -

            Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

            di.wang Di Wang added a comment - Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

            Do you mean to make opo_ooa exists all time? That would solve this problem I think.

            tappro Mikhail Pershin added a comment - Do you mean to make opo_ooa exists all time? That would solve this problem I think.
            di.wang Di Wang added a comment -

            Hmm, Mike, I think you are right. We need update the size for the OSP object after write.

            di.wang Di Wang added a comment - Hmm, Mike, I think you are right. We need update the size for the OSP object after write.

            I don't see how lgh_lock helps here. Consider the following scenario:

            • llog_osd_write() issues new record, write RPC is scheduled for another server
            • new llog_osd_write() is called immediately and issues attr_get() RPC for the same llog to the remote server
            • since these two RPCs are not serialized and were issued almost together technically the attr_get() RPC might return size value before write will be applied on remote server

            This is not the case if llog write RPC is synchronous but it is not so, isn't it?

            tappro Mikhail Pershin added a comment - I don't see how lgh_lock helps here. Consider the following scenario: llog_osd_write() issues new record, write RPC is scheduled for another server new llog_osd_write() is called immediately and issues attr_get() RPC for the same llog to the remote server since these two RPCs are not serialized and were issued almost together technically the attr_get() RPC might return size value before write will be applied on remote server This is not the case if llog write RPC is synchronous but it is not so, isn't it?
            di.wang Di Wang added a comment -

            They are serialize by lgh_lock in llog_cat_add_rec().

            di.wang Di Wang added a comment - They are serialize by lgh_lock in llog_cat_add_rec().

            People

              di.wang Di Wang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: