[LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
None
Environment:
Hyperion SWL test

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running tip of master with SWL and DNE.

2015-08-25 12:19:21 Lustre: lustre-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
2015-08-25 12:19:21 LustreError: 6384:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT000b-osp-MDT0001: invalid llog tail at log id 0x3:2147484674/0 offset 3407872
2015-08-25 12:19:21 LustreError: 6384:0:(lod_dev.c:392:lod_sub_recovery_thread()) lustre-MDT000b-osp-MDT0001 getting update log failed: rc = -22
2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
2015-08-25 12:19:21 Pid: 6378, comm: lod0001_rec0005
2015-08-25 12:19:21 Aug 25 12:19:21
2015-08-25 12:19:21 iws12 kernel: LuCall Trace:
2015-08-25 12:19:21 streError: 6378: [<ffffffffa04a2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2015-08-25 12:19:21 0:(llog_osd.c:77 [<ffffffffa04a2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
2015-08-25 12:19:21 8:llog_osd_next_ [<ffffffffa08dad15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
2015-08-25 12:19:21 block()) ASSERTI [<ffffffffa08ccb4e>] llog_process_thread+0xInitializing cgroup subsys cpuset
2015-08-25 12:19:21 Initializing cgroup subsys cpu

Attempting to recreate and get a dump

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.log.bz2
190 kB
04/Dec/15 1:13 PM
lola-10-lustre-log.1444148492.4548-dm-minus-one.log.bz2
1021 kB
06/Oct/15 4:38 PM
LU-7039.llog.txt.gz
3.53 MB
25/Aug/15 7:39 PM
lustre-log.1443755187.9078
712 kB
02/Oct/15 7:26 AM
memory-counter-lola-11.dat.bz2
25 kB
04/Dec/15 1:13 PM
messages-lola-11.log.bz2
302 kB
04/Dec/15 1:13 PM
slab-details-lola-11.dat.bz2
873 kB
04/Dec/15 1:13 PM
slab-details-one-file-per-slab.tar.bz2
617 kB
04/Dec/15 1:13 PM
slab-total-lola-11.dat.bz2
28 kB
04/Dec/15 1:13 PM
vmcore-dmesg.txt.bz2
28 kB
04/Dec/15 1:13 PM

Issue Links

is duplicated by

LU-7391 (osp_md_object.c:1155:osp_md_write()) ASSERTION( ob j->opo_ooa->ooa_attr.la_valid & LA_SIZE ) failed

Resolved

LU-7100 conf-sanity test_84 LBUGS with “(llog_osd.c:811:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index )”

Closed

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

LU-7715 out_handle() misuses GOTO()

Resolved

LU-6994 MDT recovery timer goes negative, recovery never ends

Resolved

LU-7455 Tracking tickets to make DNE pass soak-test.

Resolved

LU-7426 DNE3: improve llog format for remote update llog

Open

(2 is related to)

Activity

[LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Frank Heckes (Inactive) added a comment - 04/Dec/15 1:22 PM

It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039.

Frank Heckes (Inactive) added a comment - 04/Dec/15 1:22 PM It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7039 .

Frank Heckes (Inactive) added a comment - 04/Dec/15 1:11 PM - edited

The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.

Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
The allocation of slabs increased continuously till ~ 31 GB till crash
MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also ~~LU-7432~~)
ptlrpc_cache seems to be the biggest consumer
Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
attached files containing extracted counters for memory, slab totals and per slab allocation.

The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36.

Frank Heckes (Inactive) added a comment - 04/Dec/15 1:11 PM - edited The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola . DNE is enabled and MDSes are configured in active-active HA failover configuration. Primary resources of MDT lola-11 were failed back at Dec, 3 20:18. The allocation of slabs increased continuously till ~ 31 GB till crash MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432 ) ptlrpc_cache seems to be the biggest consumer Attached lola-11 's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also attached files containing extracted counters for memory, slab totals and per slab allocation. The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7039/127.0.0.1-2015-12-04-00\:22\:36 .

Di Wang (Inactive) added a comment - 24/Nov/15 7:36 PM

Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.

Di Wang (Inactive) added a comment - 24/Nov/15 7:36 PM Just reminder, http://review.whamcloud.com/16969 and http://review.whamcloud.com/17199 are key fixes for this problem.

Di Wang (Inactive) added a comment - 20/Nov/15 6:04 PM

Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by ~~LU-7456~~, and we will continue soak-test to check this problem once 7456 is fixed.

Di Wang (Inactive) added a comment - 20/Nov/15 6:04 PM Just update, it seems corruption disappears in the build of 20151120, though we need run more test to confirm this. Currently the soak-test is blocked by LU-7456 , and we will continue soak-test to check this problem once 7456 is fixed.

Gerrit Updater added a comment - 15/Nov/15 6:52 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199
Subject: ~~LU-7039~~ recovery: abort update recovery once fails
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

Gerrit Updater added a comment - 15/Nov/15 6:52 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199 Subject: LU-7039 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

Gerrit Updater added a comment - 30/Oct/15 4:39 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/
Subject: ~~LU-7039~~ llog: skip to next chunk for corrupt record
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

Gerrit Updater added a comment - 30/Oct/15 4:39 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/ Subject: LU-7039 llog: skip to next chunk for corrupt record Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

Di Wang (Inactive) added a comment - 29/Oct/15 9:59 PM

Frank: I updated http://review.whamcloud.com/16838, and added more debug information there. Could you please retry with the patch. Thanks.

Di Wang (Inactive) added a comment - 29/Oct/15 9:59 PM Frank: I updated http://review.whamcloud.com/16838 , and added more debug information there. Could you please retry with the patch. Thanks.

Frank Heckes (Inactive) added a comment - 29/Oct/15 12:54 PM

Di: For build '20151027' (https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027)
which doesn't include change 16797, as far as I can see, the problem is still present:

Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == 
tail->lrt_index ) failed: 
Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG
Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007
Oct 28 07:10:59 lola-8 kernel: 
Oct 28 07:10:59 lola-8 kernel: Call Trace:
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240
Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re
cord: index 9421 but expected 9420
Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett
ing update log failed: rc = -34
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

Frank Heckes (Inactive) added a comment - 29/Oct/15 12:54 PM Di: For build '20151027' ( https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027 ) which doesn't include change 16797, as far as I can see, the problem is still present: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007 Oct 28 07:10:59 lola-8 kernel: Oct 28 07:10:59 lola-8 kernel: Call Trace: Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re cord: index 9421 but expected 9420 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett ing update log failed: rc = -34 Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

Gerrit Updater added a comment - 28/Oct/15 1:48 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/
Subject: ~~LU-7039~~ tgt: Delete txn_callback correctly in tgt_init()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

Gerrit Updater added a comment - 28/Oct/15 1:48 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/ Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

Gerrit Updater added a comment - 27/Oct/15 11:52 PM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969
Subject: ~~LU-7039~~ llog: update llog header and size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

Gerrit Updater added a comment - 27/Oct/15 11:52 PM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969 Subject: LU-7039 llog: update llog header and size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

Di Wang (Inactive) added a comment - 27/Oct/15 4:39 PM

Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

Di Wang (Inactive) added a comment - 27/Oct/15 4:39 PM Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

People

Assignee:: Di Wang (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 25/Aug/15 7:25 PM

Updated:: 13/Feb/24 9:29 PM

Resolved:: 27/Jan/16 12:55 AM