[LU-7100] conf-sanity test_84 LBUGS with “(llog_osd.c:811:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index )” Created: 03/Sep/15  Updated: 19/Apr/17  Resolved: 19/Apr/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Tests run in the autotest environment


Issue Links:
Duplicate
duplicates LU-7039 llog_osd.c:778:llog_osd_next_block())... Resolved
Related
is related to LU-7222 conf-sanity test_84: invalid llog tai... Resolved
is related to LU-7428 conf-sanity test_84, replay-dual 0a: ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test 84 hangs at mount. We’ve seen this test LBUG with the stack trace below three times in the past month. Logs for an interop occurrence are at https://testing.hpdd.intel.com/test_sets/9145fb1a-51a8-11e5-9249-5254006e85c2

From the MDS log:

00:44:38:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
00:44:38:LustreError: 18100:0:(llog_osd.c:811:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: 
00:44:38:LustreError: 18100:0:(llog_osd.c:811:llog_osd_next_block()) LBUG
00:44:38:Pid: 18100, comm: llog_process_th
00:44:38:
00:44:38:Call Trace:
00:44:38: [<ffffffffa046c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
00:44:38: [<ffffffffa046ce77>] lbug_with_loc+0x47/0xb0 [libcfs]
00:44:38: [<ffffffffa058ed25>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
00:44:38: [<ffffffffa0580bae>] llog_process_thread+0x2de/0xfc0 [obdclass]
00:44:38: [<ffffffffa05cc3a5>] ? keys_fill+0xd5/0x1b0 [obdclass]
00:44:38: [<ffffffffa0581ed5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]
00:44:38: [<ffffffffa0581e90>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]
00:44:38: [<ffffffff8109e78e>] kthread+0x9e/0xc0
00:44:38: [<ffffffff8100c28a>] child_rip+0xa/0x20
00:44:38: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
00:44:38: [<ffffffff8100c280>] ? child_rip+0x0/0x20
00:44:38:
00:44:38:Kernel panic - not syncing: LBUG
00:44:38:Pid: 18100, comm: llog_process_th Not tainted 2.6.32-504.30.3.el6_lustre.g339e9ad.x86_64 #1

In a different occurrence and in a DNE setup, with logs at https://testing.hpdd.intel.com/test_sets/2eae8eae-4f7d-11e5-bc53-5254006e85c2, the MDS console has a few more errors before the LBUG:

22:49:38:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
22:49:38:LustreError: 14217:0:(llog_osd.c:788:llog_osd_next_block()) lustre-MDT0000-osd: invalid llog tail at log id 0x4:10/0 offset 16384
22:49:38:LustreError: 14198:0:(mgs_llog.c:457:mgs_find_or_make_fsdb()) Can't get db from client log -22
22:49:38:LustreError: 14198:0:(mgs_llog.c:496:mgs_check_index()) Can't get db for lustre
22:49:38:LustreError: 14219:0:(llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: 
22:49:38:LustreError: 14219:0:(llog_osd.c:778:llog_osd_next_block()) LBUG
22:49:38:Pid: 14219, comm: llog_process_th
22:49:38:
22:49:38:Call Trace:
22:49:38: [<ffffffffa046c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
22:49:38: [<ffffffffa046ce77>] lbug_with_loc+0x47/0xb0 [libcfs]
22:49:38: [<ffffffffa058ed15>] llog_osd_next_block+0xb75/0xbf0 [obdclass]
22:49:38: [<ffffffffa0580b4e>] llog_process_thread+0x2de/0xfc0 [obdclass]
22:49:38: [<ffffffffa05cc0e5>] ? keys_fill+0xd5/0x1b0 [obdclass]
22:49:38: [<ffffffffa0581e75>] llog_process_thread_daemonize+0x45/0x70 [obdclass]
22:49:38: [<ffffffffa0581e30>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]
22:49:38: [<ffffffff8109e78e>] kthread+0x9e/0xc0
22:49:38: [<ffffffff8100c28a>] child_rip+0xa/0x20
22:49:38: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
22:49:38: [<ffffffff8100c280>] ? child_rip+0x0/0x20
22:49:38:
22:49:38:Kernel panic - not syncing: LBUG
22:49:38:Pid: 14219, comm: llog_process_th Not tainted 2.6.32-504.30.3.el6_lustre.gc67434c.x86_64 #1

Another set of logs on review-dne-part-1 are at https://testing.hpdd.intel.com/test_sets/189b85b6-38a5-11e5-9f03-5254006e85c2



 Comments   
Comment by Joseph Gmitter (Inactive) [ 04/Sep/15 ]

Hi Mike,
Can you have a look at this issue?
Thanks.
Joe

Comment by Andreas Dilger [ 18/Sep/15 ]

Hit this a few times:
https://testing.hpdd.intel.com/test_sets/72457c70-542f-11e5-bfaa-5254006e85c2
https://testing.hpdd.intel.com/test_sets/4784c444-5a3f-11e5-9147-5254006e85c2

Comment by Gerrit Updater [ 08/Oct/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16771
Subject: LU-7100 llog: remember the latest offset in local variable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 512775d4f53bb25e92756e1a639317c3a5a2a8a4

Comment by James Nunez (Inactive) [ 19/Apr/17 ]

conf-sanity test 84 has not timed out nor failed for the past 3 months. Let's close this ticket and we can reopen if this test or this issue comes up again.

Generated at Sat Feb 10 02:05:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.