[LU-3333] lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG Created: 14/May/13  Updated: 06/Mar/15  Resolved: 01/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Attachments: File lustre-log.1398105923.9374.txt.gz    
Issue Links:
Duplicate
duplicates LU-4949 Master to 2.5 server - LBUG: lustre_m... Closed
duplicates LU-4956 Test failure on test suite recovery-s... Closed
is duplicated by LU-3309 sanity test 182 crash Resolved
Related
is related to LU-6172 Client LBUG in null_free_reqbuf Resolved
Severity: 3
Rank (Obsolete): 8241

 Description   

I was suspicious when I hit this for the first time, but I hit it once more, so I am filing a bug.

First time was in sanity test 182

[78415.059119] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 25769825268, new: 25769825267 replay: 0. see LU-617.
[78415.061562] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Skipped 8 previous similar messages
[78417.103412] LustreError: 31938:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800aba18df0)
[78417.103916] LustreError: 31938:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
[78417.104332] Pid: 31938, comm: createmany
[78417.104542] 
[78417.104543] Call Trace:
[78417.104925]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[78417.105213]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
[78417.105490]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
[78417.105778]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
[78417.106087]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
[78417.106345]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
[78417.106619]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
[78417.106878]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
[78417.107137]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
[78417.107389]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
[78417.107639]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
[78417.107907]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[78417.108187]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
[78417.108453]  [<ffffffffa0d2d9c5>] ldlm_cli_enqueue+0x365/0x770 [ptlrpc]
[78417.119499]  [<ffffffff814fc8bc>] ? __mutex_lock_slowpath+0x21c/0x2c0
[78417.119779]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
[78417.120071]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.120334]  [<ffffffffa0415f55>] mdc_enqueue+0x795/0x18c0 [mdc]
[78417.120583]  [<ffffffffa041727d>] mdc_intent_lock+0x1fd/0x5b4 [mdc]
[78417.120847]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.121427]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
[78417.121688]  [<ffffffffa03c8667>] ? lmv_fid_alloc+0x117/0x4b0 [lmv]
[78417.121952]  [<ffffffffa03e02d6>] lmv_intent_open+0x1f6/0x8c0 [lmv]
[78417.122206]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.122464]  [<ffffffff811661ca>] ? cache_alloc_debugcheck_after+0x14a/0x210
[78417.122720]  [<ffffffff81168671>] ? __kmalloc+0x191/0x2a0
[78417.122958]  [<ffffffffa03e0c4b>] lmv_intent_lock+0x2ab/0x370 [lmv]
[78417.123211]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.123471]  [<ffffffffa0f37a9e>] ? ll_i2gids+0x2e/0xd0 [lustre]
[78417.123718]  [<ffffffffa0f1e4ba>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre]
[78417.123983]  [<ffffffffa0f3d014>] ll_lookup_it+0x3a4/0xbf0 [lustre]
[78417.124273]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.124539]  [<ffffffffa0f3d8e9>] ll_lookup_nd+0x89/0x3b0 [lustre]
[78417.124784]  [<ffffffff8119b456>] ? d_alloc+0x166/0x1d0
[78417.125087]  [<ffffffff8118e8e2>] __lookup_hash+0x102/0x160
[78417.125322]  [<ffffffff8118ef8a>] lookup_hash+0x3a/0x50
[78417.125550]  [<ffffffff81192edb>] do_filp_open+0x2eb/0xe00
[78417.125790]  [<ffffffff814fae88>] ? thread_return+0x4e/0x776
[78417.126034]  [<ffffffff8118f37b>] ? getname+0x3b/0x250
[78417.126259]  [<ffffffff8119fd7b>] ? alloc_fd+0xab/0x160
[78417.126490]  [<ffffffff8117e399>] do_sys_open+0x69/0x140
[78417.126717]  [<ffffffff8100c575>] ? math_state_restore+0x45/0x60
[78417.126965]  [<ffffffff8117e4b0>] sys_open+0x20/0x30
[78417.127190]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[78417.127428] 
[78417.169930] Kernel panic - not syncing: LBUG

Crashdumps and modules for this one are in:
/exports/crashdumps/192.168.10.224-2013-05-09-20\:43\:05

The other crash was like this:

[79955.565718] Lustre: DEBUG MARKER: == sanity test 71: Running dbench on lustre (don't segment fault) ====== 21:01:15 (1368493275)
[80393.197212] LustreError: 25536:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800464ccdf0)
[80393.197711] LustreError: 25536:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
[80393.209355] Pid: 25536, comm: dbench
[80393.209559] 
[80393.209559] Call Trace:
[80393.209964]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[80393.210228]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
[80393.210502]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
[80393.212853]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
[80393.213149]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
[80393.213412]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
[80393.213689]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
[80393.213967]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
[80393.214225]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
[80393.214477]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
[80393.214733]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
[80393.214995]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[80393.215271]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
[80393.215526]  [<ffffffffa0401f59>] mdc_sync+0xb9/0x230 [mdc]
[80393.215773]  [<ffffffffa03d576e>] lmv_sync+0x2ae/0x550 [lmv]
[80393.216022]  [<ffffffffa0f0a6ab>] ll_fsync+0x1ab/0x860 [lustre]
[80393.216290]  [<ffffffff811b2401>] vfs_fsync_range+0xa1/0xe0
[80393.216525]  [<ffffffff811b24ad>] vfs_fsync+0x1d/0x20
[80393.216777]  [<ffffffff811b24ee>] do_fsync+0x3e/0x60
[80393.217003]  [<ffffffff811b2540>] sys_fsync+0x10/0x20
[80393.217230]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[80393.217471] 
[80393.228801] Kernel panic - not syncing: LBUG

/exports/crashdumps/192.168.10.224-2013-05-13-21\:08\:35/vmcore, same modules as before.



 Comments   
Comment by Andreas Dilger [ 14/May/13 ]

Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error.

Comment by Oleg Drokin [ 23/May/13 ]

I still hit this fairly frequently and caught one instance live today.

I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion.
lustre_msg pointer seems to be pointing at some sort of structure that's has mfd handle embedded in it later on (rep buffer content? certainly sounds plausible), how did that happen I am totally not clear.
the "lustre_msg" magic obviously does not match a proper one, but also does not match what's in assertion message, instead of the long number (that varies from recompile to recompile, but is the same in the same module version crashes) printed in assertion message, the in-memory value is always 0x2341

I wonder if it's an artifact of rq_msg replacement when enlarging.
Added some debug and going to rerun.

Comment by Oleg Drokin [ 23/May/13 ]

My hunch was right, it seems.

00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8
02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0
...
00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0)

Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here.

It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason.
I think it's still much easier to move the list adding past request extending and increase the race a tiny bit (that would probably be still very unrealistic to hit) than add locking all around req->rq_reqmsg access.

Comment by Oleg Drokin [ 27/May/13 ]

Tentative patch in http://review.whamcloud.com/6467

Comment by Cliff White (Inactive) [ 21/Apr/14 ]

I seem to have hit this on Hyperion:

LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8)
LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
Pid: 9374, comm: ior

Call Trace:
 [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc]
 [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc]
 [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc]
 [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
 [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc]
 [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
 [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc]
 [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
 [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv]
 [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre]
 [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre]
 [<ffffffff811965c2>] __lookup_hash+0x102/0x160
 [<ffffffff81196cea>] lookup_hash+0x3a/0x50
 [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20
 [<ffffffff81196a76>] ? final_putname+0x26/0x50
 [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190
 [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160
 [<ffffffff81185df9>] do_sys_open+0x69/0x140
 [<ffffffff81185f10>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Cliff White (Inactive) [ 21/Apr/14 ]

Log from the LBUG

Comment by Oleg Drokin [ 24/Apr/14 ]

The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types.
Instead I resurrected the patchset #1 from there at http://review.whamcloud.com/10074 that does not have the bug.

Comment by Jodi Levi (Inactive) [ 29/Apr/14 ]

Cliff,
Can you confirm this fix and close this ticket if appropriate?
Thank you!

Comment by Jodi Levi (Inactive) [ 01/May/14 ]

Reversal of patch completed.

Generated at Sat Feb 10 01:33:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.