[LU-4949] Master to 2.5 server - LBUG: lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8) Created: 23/Apr/14  Updated: 24/Apr/14  Resolved: 24/Apr/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-3333 lustre_msg_get_opc()) incorrect messa... Resolved
Severity: 3
Rank (Obsolete): 13689

 Description   

While testing master clients (2.5.57) against 2.5 servers, we encountered this LBUG:

2014-04-18T19:19:05.626354-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)
2014-04-18T19:19:05.626401-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
2014-04-18T19:19:05.626409-05:00 c1-0c0s1n0 Pid: 5456, comm: IOR
2014-04-18T19:19:05.626417-05:00 c1-0c0s1n0 Call Trace:
2014-04-18T19:19:05.626424-05:00 c1-0c0s1n0 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
2014-04-18T19:19:05.626431-05:00 c1-0c0s1n0 [<ffffffff81004dd9>] dump_trace+0x89/0x440
2014-04-18T19:19:05.626437-05:00 c1-0c0s1n0 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-04-18T19:19:05.626444-05:00 c1-0c0s1n0 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
2014-04-18T19:19:05.626449-05:00 c1-0c0s1n0 [<ffffffffa03e63c9>] lustre_msg_buf+0x49/0x60 [ptlrpc]
2014-04-18T19:19:05.626455-05:00 c1-0c0s1n0 [<ffffffffa04128a0>] _sptlrpc_enlarge_msg_inplace+0x60/0x1c0 [ptlrpc]
2014-04-18T19:19:05.626462-05:00 c1-0c0s1n0 [<ffffffffa0422c81>] null_enlarge_reqbuf+0xd1/0x200 [ptlrpc]
2014-04-18T19:19:05.626468-05:00 c1-0c0s1n0 [<ffffffffa040fe9c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
2014-04-18T19:19:05.626479-05:00 c1-0c0s1n0 [<ffffffffa065d790>] mdc_finish_enqueue+0xa60/0x1090 [mdc]
2014-04-18T19:19:05.626485-05:00 c1-0c0s1n0 [<ffffffffa065f6e6>] mdc_enqueue+0x13d6/0x1ce0 [mdc]
2014-04-18T19:19:05.626495-05:00 c1-0c0s1n0 [<ffffffffa0660280>] mdc_intent_lock+0x290/0x55f [mdc]
2014-04-18T19:19:05.626511-05:00 c1-0c0s1n0 [<ffffffffa06090bd>] lmv_intent_open+0x33d/0x9a0 [lmv]
2014-04-18T19:19:05.626519-05:00 c1-0c0s1n0 [<ffffffffa06099ca>] lmv_intent_lock+0x2aa/0x370 [lmv]
2014-04-18T19:19:05.626526-05:00 c1-0c0s1n0 [<ffffffffa0754fa2>] ll_lookup_it+0x4b2/0x1840 [lustre]
2014-04-18T19:19:05.626535-05:00 c1-0c0s1n0 [<ffffffffa07563ba>] ll_lookup_nd+0x8a/0x550 [lustre]
2014-04-18T19:19:05.626542-05:00 c1-0c0s1n0 [<ffffffff8114647c>] d_alloc_and_lookup+0x4c/0x80
2014-04-18T19:19:05.626550-05:00 c1-0c0s1n0 [<ffffffff811465ae>] __lookup_hash+0xfe/0x180
2014-04-18T19:19:05.626581-05:00 c1-0c0s1n0 [<ffffffff81148c6f>] do_last+0x2cf/0x7a0
2014-04-18T19:19:05.626592-05:00 c1-0c0s1n0 [<ffffffff81149e08>] path_openat+0xc8/0x3c0
2014-04-18T19:19:05.626619-05:00 c1-0c0s1n0 [<ffffffff8114a228>] do_filp_open+0x48/0xa0
2014-04-18T19:19:05.626626-05:00 c1-0c0s1n0 [<ffffffff8113b0ee>] do_sys_open+0x16e/0x240
2014-04-18T19:19:05.626641-05:00 c1-0c0s1n0 [<ffffffff8113b200>] sys_open+0x20/0x30
2014-04-18T19:19:05.626666-05:00 c1-0c0s1n0 [<ffffffff81362e6b>] system_call_fastpath+0x16/0x1b
2014-04-18T19:19:05.626684-05:00 c1-0c0s1n0 [<00000000005b8b31>] 0x5b8b31

This was encountered during an IO performance test run, which is mostly IOR runs of various sizes. This occurred on a number of different nodes - At least 10% of the nodes used in the run fell to this bug, but by no means all the nodes. In all cases, the stack trace seemed to be the same.

The given magic value suggests it was either not set or was overwritten somehow... Here's the contents of the lustre_msg_v2 struct:

crash> struct lustre_msg_v2 0xffff88060f3c40f8
struct lustre_msg_v2 {
lm_bufcount = 4608,
lm_secflvr = 2,
lm_magic = 13,
lm_repsize = 0,
lm_cksum = 132270,
lm_flags = 2,
lm_padding_2 = 19,
lm_padding_3 = 0,
lm_buflens = 0xffff88060f3c4118
}

I'll be uploading a dump shortly. I do not have a clear reproducer for this currently, but if needed, I could probably manage it with a large test run.



 Comments   
Comment by Patrick Farrell (Inactive) [ 23/Apr/14 ]

Dump at:
ftp.whamcloud.com

uploads/LU-4949/140423_compute_LBUG.tar.gz

Generated at Sat Feb 10 01:47:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.