[LU-5003] sanity-hsm test_302 llog_write()) ASSERTION( loghandle->lgh_obj != ((void *)0) Created: 03/May/14  Updated: 06/Oct/14  Resolved: 06/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
duplicates LU-5019 Test failure on test suite conf-sanit... Closed
Severity: 3
Rank (Obsolete): 13852

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/d021bd8e-d2a0-11e3-a102-52540035b04c.

The sub-test test_302 failed with the following error:

test failed to respond and timed out

23:43:18:LustreError: 11-0: lustre-MDT0000-osp-MDT0001: Communicating with 10.1.4.101@tcp, operation obd_ping failed with -107.
23:43:18:LustreError: Skipped 6 previous similar messages
23:43:18:Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.1.4.101@tcp) was lost; in progress operations using this service will wait for recovery to complete
23:43:18:Lustre: Skipped 5 previous similar messages
23:43:18:LustreError: 166-1: MGC10.1.4.101@tcp: Connection to MGS (at 10.1.4.101@tcp) was lost; in progress operations using this service will fail
23:43:18:LustreError: 31202:0:(llog.c:407:llog_process_thread()) llog found corrupted
23:43:18:LustreError: 31202:0:(llog.c:870:llog_write()) ASSERTION( loghandle->lgh_obj != ((void *)0) ) failed:
23:43:18:LustreError: 31202:0:(llog.c:870:llog_write()) LBUG
23:43:18:Pid: 31202, comm: llog_process_th
23:43:18:
23:43:18:Call Trace:
23:43:18: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
23:43:18: [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs]
23:43:18: [<ffffffffa059dd42>] llog_write+0x352/0x420 [obdclass]
23:43:18: [<ffffffffa04945a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
23:43:18: [<ffffffffa059defc>] llog_cancel_rec+0xbc/0x6a0 [obdclass]
23:43:18: [<ffffffffa059f073>] llog_process_thread+0xb93/0xdc0 [obdclass]
23:43:18: [<ffffffffa05ed9cf>] ? keys_fill+0x6f/0x190 [obdclass]
23:43:18: [<ffffffffa059f2e5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]
23:43:18: [<ffffffffa059f2a0>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]
23:43:18: [<ffffffff8109aee6>] kthread+0x96/0xa0
23:43:18: [<ffffffff8100c20a>] child_rip+0xa/0x20
23:43:18: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
23:43:18: [<ffffffff8100c200>] ? child_rip+0x0/0x20
23:43:18:
23:43:18:Kernel panic - not syncing: LBUG

Info required for matching: sanity-hsm 302



 Comments   
Comment by Mikhail Pershin [ 05/May/14 ]

I saw the same issue while testing another patch, it happens in test_301. I tend to think that the reason might be commit http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6760f383f6cd0964e70250723b9d24bce13b5318
it changed the plan llog deletion recognition logic so I suppose llog was deleted during cancel and is being updated after that. This is just quick idea and area to check first.

Comment by Jian Yu [ 28/May/14 ]

Lustre build: http://build.whamcloud.com/job/lustre-master/2065/

conf-sanity test 57a also hit the same failure:
https://maloo.whamcloud.com/test_sets/6b7ba1ac-e5e1-11e3-87f3-52540035b04c

Comment by nasf (Inactive) [ 11/Jul/14 ]

Another failure instance:

https://testing.hpdd.intel.com/test_sets/8c42ce96-085b-11e4-8b2d-5254006e85c2

Comment by John Hammond [ 19/Aug/14 ]

Another:

https://testing.hpdd.intel.com/test_logs/ef9a8e1e-27e8-11e4-8e75-5254006e85c2

Comment by Alex Zhuravlev [ 17/Sep/14 ]

https://testing.hpdd.intel.com/test_sets/6d49dfa6-3e26-11e4-a0a3-5254006e85c2

Comment by Alexander Zarochentsev [ 17/Sep/14 ]

I have a crash looking exactly the same at replay-vbr test_4k , the llog being processed is a config llog.
From a crashdump. the loghandle appears to be a client one:

  lgh_hdr = 0xffff88005a8d0000,
  lgh_obj = 0x0 <per_cpu__irq_stack_union>,
  lgh_last_idx = 249,
  lgh_cur_idx = 0,
  lgh_cur_offset = 0,
  lgh_ctxt = 0xffff88005b433e40,
...
  lgh_name = 0x0 <per_cpu__irq_stack_union>, 
  private_data = 0x0 <per_cpu__irq_stack_union>, 
  lgh_logops = 0xffffffffa078f260 <llog_client_ops>, 
  lgh_refcount = {
    counter = 1
  }

lgh_obj = 0x0 is what caused llog_write() to LBUG and lgh_logops = 0xffffffffa078f260 <llog_client_ops> indicates that the llog_handle is a client one.

Comment by Alexander Zarochentsev [ 17/Sep/14 ]

patch http://review.whamcloud.com/#/c/11955/

Comment by Alexander Zarochentsev [ 18/Sep/14 ]

Seems it is not right to compare error code with -EIO, llog block read functions returns -EINVAL if llog fle format check fail:

static int llog_osd_next_block(const struct lu_env *env,
			       struct llog_handle *loghandle, int *cur_idx,
			       int next_idx, __u64 *cur_offset, void *buf,
			       int len)
{
	struct llog_thread_info	*lgi = llog_info(env);
...
		if (rc < sizeof(*tail)) {
			CERROR("%s: invalid llog block at log id "DOSTID"/%u "
			       "offset "LPU64"\n",
			       o->do_lu.lo_dev->ld_obd->obd_name,
			       POSTID(&loghandle->lgh_id.lgl_oi),
			       loghandle->lgh_id.lgl_ogen, *cur_offset);
			GOTO(out, rc = -EINVAL);
		}
...
		/* this shouldn't happen */
		if (tail->lrt_index == 0) {
			CERROR("%s: invalid llog tail at log id "DOSTID"/%u "
			       "offset "LPU64"\n",
			       o->do_lu.lo_dev->ld_obd->obd_name,
			       POSTID(&loghandle->lgh_id.lgl_oi),
			       loghandle->lgh_id.lgl_ogen, *cur_offset);
			GOTO(out, rc = -EINVAL);
		}
...

EIO is likely a temporary error (network, disk access, ...), EINVAL – points to a llog corruption.
would it be better to change rc == -EIO check to rc == -EINVAL && ... ?

Comment by Peter Jones [ 18/Sep/14 ]

Thanks Zam!

James

Could you please review this patch?

Thanks

Peter

Comment by Peter Jones [ 06/Oct/14 ]

Landed for 2.7

Generated at Sat Feb 10 01:47:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.