[LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
None
Environment:
Hyperion SWL test

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running tip of master with SWL and DNE.

2015-08-25 12:19:21 Lustre: lustre-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:
2015-08-25 12:19:21 LustreError: 6384:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT000b-osp-MDT0001: invalid llog tail at log id 0x3:2147484674/0 offset 3407872
2015-08-25 12:19:21 LustreError: 6384:0:(lod_dev.c:392:lod_sub_recovery_thread()) lustre-MDT000b-osp-MDT0001 getting update log failed: rc = -22
2015-08-25 12:19:21 LustreError: 6378:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
2015-08-25 12:19:21 Pid: 6378, comm: lod0001_rec0005
2015-08-25 12:19:21 Aug 25 12:19:21
2015-08-25 12:19:21 iws12 kernel: LuCall Trace:
2015-08-25 12:19:21 streError: 6378: [<ffffffffa04a2875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2015-08-25 12:19:21 0:(llog_osd.c:77 [<ffffffffa04a2e77>] lbug_with_loc+0x47/0xb0 [libcfs]
2015-08-25 12:19:21 8:llog_osd_next_ [<ffffffffa08dad15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
2015-08-25 12:19:21 block()) ASSERTI [<ffffffffa08ccb4e>] llog_process_thread+0xInitializing cgroup subsys cpuset
2015-08-25 12:19:21 Initializing cgroup subsys cpu

Attempting to recreate and get a dump

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.log.bz2
190 kB
04/Dec/15 1:13 PM
lola-10-lustre-log.1444148492.4548-dm-minus-one.log.bz2
1021 kB
06/Oct/15 4:38 PM
LU-7039.llog.txt.gz
3.53 MB
25/Aug/15 7:39 PM
lustre-log.1443755187.9078
712 kB
02/Oct/15 7:26 AM
memory-counter-lola-11.dat.bz2
25 kB
04/Dec/15 1:13 PM
messages-lola-11.log.bz2
302 kB
04/Dec/15 1:13 PM
slab-details-lola-11.dat.bz2
873 kB
04/Dec/15 1:13 PM
slab-details-one-file-per-slab.tar.bz2
617 kB
04/Dec/15 1:13 PM
slab-total-lola-11.dat.bz2
28 kB
04/Dec/15 1:13 PM
vmcore-dmesg.txt.bz2
28 kB
04/Dec/15 1:13 PM

Issue Links

is duplicated by

LU-7391 (osp_md_object.c:1155:osp_md_write()) ASSERTION( ob j->opo_ooa->ooa_attr.la_valid & LA_SIZE ) failed

Resolved

LU-7100 conf-sanity test_84 LBUGS with “(llog_osd.c:811:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index )”

Closed

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

LU-7715 out_handle() misuses GOTO()

Resolved

LU-6994 MDT recovery timer goes negative, recovery never ends

Resolved

LU-7455 Tracking tickets to make DNE pass soak-test.

Resolved

LU-7426 DNE3: improve llog format for remote update llog

Open

(2 is related to)

Activity

[LU-7039] llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Gerrit Updater added a comment - 15/Nov/15 6:52 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199
Subject: ~~LU-7039~~ recovery: abort update recovery once fails
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

Gerrit Updater added a comment - 15/Nov/15 6:52 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17199 Subject: LU-7039 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70905df1d7ea16d50c927b6af9957bced89a0f3b

Gerrit Updater added a comment - 30/Oct/15 4:39 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/
Subject: ~~LU-7039~~ llog: skip to next chunk for corrupt record
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

Gerrit Updater added a comment - 30/Oct/15 4:39 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16740/ Subject: LU-7039 llog: skip to next chunk for corrupt record Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04f4023cf59b6e5a1634ba492cd813dcb1af0c7c

Di Wang added a comment - 29/Oct/15 9:59 PM

Frank: I updated http://review.whamcloud.com/16838, and added more debug information there. Could you please retry with the patch. Thanks.

Di Wang added a comment - 29/Oct/15 9:59 PM Frank: I updated http://review.whamcloud.com/16838 , and added more debug information there. Could you please retry with the patch. Thanks.

Frank Heckes (Inactive) added a comment - 29/Oct/15 12:54 PM

Di: For build '20151027' (https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027)
which doesn't include change 16797, as far as I can see, the problem is still present:

Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == 
tail->lrt_index ) failed: 
Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG
Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007
Oct 28 07:10:59 lola-8 kernel: 
Oct 28 07:10:59 lola-8 kernel: Call Trace:
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240
Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re
cord: index 9421 but expected 9420
Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett
ing update log failed: rc = -34
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod]
Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

Frank Heckes (Inactive) added a comment - 29/Oct/15 12:54 PM Di: For build '20151027' ( https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151027 ) which doesn't include change 16797, as far as I can see, the problem is still present: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: Oct 28 07:10:59 lola-8 kernel: LustreError: 6954:0:(llog_osd.c:874:llog_osd_next_block()) LBUG Oct 28 07:10:59 lola-8 kernel: Pid: 6954, comm: lod0003_rec0007 Oct 28 07:10:59 lola-8 kernel: Oct 28 07:10:59 lola-8 kernel: Call Trace: Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0917a7b>] llog_osd_next_block+0xa4b/0xc90 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(llog.c:534:llog_process_thread()) soaked-MDT0000-osp-MDT0003: Invalid re cord: index 9421 but expected 9420 Oct 28 07:10:59 lola-8 kernel: LustreError: 6948:0:(lod_dev.c:402:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0003 gett ing update log failed: rc = -34 Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0906d3e>] llog_process_thread+0x2de/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0904d5c>] ? llog_init_handle+0x11c/0x950 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090b108>] llog_cat_process_cb+0x458/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa09073aa>] llog_process_thread+0x94a/0xfc0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0907add>] llog_process_or_fork+0xbd/0x5d0 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090acb0>] ? llog_cat_process_cb+0x0/0x600 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa090996d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa13a0589>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1374990>] ? lod_process_recovery_updates+0x0/0x420 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa0909ad9>] llog_cat_process+0x19/0x20 [obdclass] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375c8e>] lod_sub_recovery_thread+0x26e/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffffa1375a20>] ? lod_sub_recovery_thread+0x0/0xb90 [lod] Oct 28 07:10:59 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Oct 28 07:10:59 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

Gerrit Updater added a comment - 28/Oct/15 1:48 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/
Subject: ~~LU-7039~~ tgt: Delete txn_callback correctly in tgt_init()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

Gerrit Updater added a comment - 28/Oct/15 1:48 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16797/ Subject: LU-7039 tgt: Delete txn_callback correctly in tgt_init() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44e9ec0b46fc46cc72bebbdc35e4a59a0397a81c

Gerrit Updater added a comment - 27/Oct/15 11:52 PM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969
Subject: ~~LU-7039~~ llog: update llog header and size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

Gerrit Updater added a comment - 27/Oct/15 11:52 PM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16969 Subject: LU-7039 llog: update llog header and size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9191d9a7f79c739d71dc7652333b9f07456218ad

Di Wang added a comment - 27/Oct/15 4:39 PM

Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

Di Wang added a comment - 27/Oct/15 4:39 PM Yes, I believe so. I added this fix to 16838 (along with a few other changes) and soak-test will try to see if this can resolve the corrupt issue.

Mikhail Pershin added a comment - 27/Oct/15 11:53 AM

Do you mean to make opo_ooa exists all time? That would solve this problem I think.

Mikhail Pershin added a comment - 27/Oct/15 11:53 AM Do you mean to make opo_ooa exists all time? That would solve this problem I think.

Di Wang added a comment - 27/Oct/15 8:18 AM

Hmm, Mike, I think you are right. We need update the size for the OSP object after write.

Di Wang added a comment - 27/Oct/15 8:18 AM Hmm, Mike, I think you are right. We need update the size for the OSP object after write.

Mikhail Pershin added a comment - 27/Oct/15 5:50 AM

I don't see how lgh_lock helps here. Consider the following scenario:

llog_osd_write() issues new record, write RPC is scheduled for another server
new llog_osd_write() is called immediately and issues attr_get() RPC for the same llog to the remote server
since these two RPCs are not serialized and were issued almost together technically the attr_get() RPC might return size value before write will be applied on remote server

This is not the case if llog write RPC is synchronous but it is not so, isn't it?

Mikhail Pershin added a comment - 27/Oct/15 5:50 AM I don't see how lgh_lock helps here. Consider the following scenario: llog_osd_write() issues new record, write RPC is scheduled for another server new llog_osd_write() is called immediately and issues attr_get() RPC for the same llog to the remote server since these two RPCs are not serialized and were issued almost together technically the attr_get() RPC might return size value before write will be applied on remote server This is not the case if llog write RPC is synchronous but it is not so, isn't it?

Di Wang added a comment - 26/Oct/15 9:40 PM

They are serialize by lgh_lock in llog_cat_add_rec().

Di Wang added a comment - 26/Oct/15 9:40 PM They are serialize by lgh_lock in llog_cat_add_rec().

People

Assignee:: Di Wang

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 25/Aug/15 7:25 PM

Updated:: 13/Feb/24 9:29 PM

Resolved:: 27/Jan/16 12:55 AM