Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
None
-
None
-
3
-
13689
Description
While testing master clients (2.5.57) against 2.5 servers, we encountered this LBUG:
2014-04-18T19:19:05.626354-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)
2014-04-18T19:19:05.626401-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
2014-04-18T19:19:05.626409-05:00 c1-0c0s1n0 Pid: 5456, comm: IOR
2014-04-18T19:19:05.626417-05:00 c1-0c0s1n0 Call Trace:
2014-04-18T19:19:05.626424-05:00 c1-0c0s1n0 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
2014-04-18T19:19:05.626431-05:00 c1-0c0s1n0 [<ffffffff81004dd9>] dump_trace+0x89/0x440
2014-04-18T19:19:05.626437-05:00 c1-0c0s1n0 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-04-18T19:19:05.626444-05:00 c1-0c0s1n0 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
2014-04-18T19:19:05.626449-05:00 c1-0c0s1n0 [<ffffffffa03e63c9>] lustre_msg_buf+0x49/0x60 [ptlrpc]
2014-04-18T19:19:05.626455-05:00 c1-0c0s1n0 [<ffffffffa04128a0>] _sptlrpc_enlarge_msg_inplace+0x60/0x1c0 [ptlrpc]
2014-04-18T19:19:05.626462-05:00 c1-0c0s1n0 [<ffffffffa0422c81>] null_enlarge_reqbuf+0xd1/0x200 [ptlrpc]
2014-04-18T19:19:05.626468-05:00 c1-0c0s1n0 [<ffffffffa040fe9c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
2014-04-18T19:19:05.626479-05:00 c1-0c0s1n0 [<ffffffffa065d790>] mdc_finish_enqueue+0xa60/0x1090 [mdc]
2014-04-18T19:19:05.626485-05:00 c1-0c0s1n0 [<ffffffffa065f6e6>] mdc_enqueue+0x13d6/0x1ce0 [mdc]
2014-04-18T19:19:05.626495-05:00 c1-0c0s1n0 [<ffffffffa0660280>] mdc_intent_lock+0x290/0x55f [mdc]
2014-04-18T19:19:05.626511-05:00 c1-0c0s1n0 [<ffffffffa06090bd>] lmv_intent_open+0x33d/0x9a0 [lmv]
2014-04-18T19:19:05.626519-05:00 c1-0c0s1n0 [<ffffffffa06099ca>] lmv_intent_lock+0x2aa/0x370 [lmv]
2014-04-18T19:19:05.626526-05:00 c1-0c0s1n0 [<ffffffffa0754fa2>] ll_lookup_it+0x4b2/0x1840 [lustre]
2014-04-18T19:19:05.626535-05:00 c1-0c0s1n0 [<ffffffffa07563ba>] ll_lookup_nd+0x8a/0x550 [lustre]
2014-04-18T19:19:05.626542-05:00 c1-0c0s1n0 [<ffffffff8114647c>] d_alloc_and_lookup+0x4c/0x80
2014-04-18T19:19:05.626550-05:00 c1-0c0s1n0 [<ffffffff811465ae>] __lookup_hash+0xfe/0x180
2014-04-18T19:19:05.626581-05:00 c1-0c0s1n0 [<ffffffff81148c6f>] do_last+0x2cf/0x7a0
2014-04-18T19:19:05.626592-05:00 c1-0c0s1n0 [<ffffffff81149e08>] path_openat+0xc8/0x3c0
2014-04-18T19:19:05.626619-05:00 c1-0c0s1n0 [<ffffffff8114a228>] do_filp_open+0x48/0xa0
2014-04-18T19:19:05.626626-05:00 c1-0c0s1n0 [<ffffffff8113b0ee>] do_sys_open+0x16e/0x240
2014-04-18T19:19:05.626641-05:00 c1-0c0s1n0 [<ffffffff8113b200>] sys_open+0x20/0x30
2014-04-18T19:19:05.626666-05:00 c1-0c0s1n0 [<ffffffff81362e6b>] system_call_fastpath+0x16/0x1b
2014-04-18T19:19:05.626684-05:00 c1-0c0s1n0 [<00000000005b8b31>] 0x5b8b31
This was encountered during an IO performance test run, which is mostly IOR runs of various sizes. This occurred on a number of different nodes - At least 10% of the nodes used in the run fell to this bug, but by no means all the nodes. In all cases, the stack trace seemed to be the same.
The given magic value suggests it was either not set or was overwritten somehow... Here's the contents of the lustre_msg_v2 struct:
crash> struct lustre_msg_v2 0xffff88060f3c40f8
struct lustre_msg_v2 {
lm_bufcount = 4608,
lm_secflvr = 2,
lm_magic = 13,
lm_repsize = 0,
lm_cksum = 132270,
lm_flags = 2,
lm_padding_2 = 19,
lm_padding_3 = 0,
lm_buflens = 0xffff88060f3c4118
}
I'll be uploading a dump shortly. I do not have a clear reproducer for this currently, but if needed, I could probably manage it with a large test run.
Attachments
Issue Links
- is duplicated by
-
LU-3333 lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG
-
- Resolved
-
Dump at:
ftp.whamcloud.com
uploads/
LU-4949/140423_compute_LBUG.tar.gz