[LU-3333] lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9
Labels:
None

Severity:
3
Rank (Obsolete):
8241

Description

I was suspicious when I hit this for the first time, but I hit it once more, so I am filing a bug.

First time was in sanity test 182

[78415.059119] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 25769825268, new: 25769825267 replay: 0. see LU-617.
[78415.061562] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Skipped 8 previous similar messages
[78417.103412] LustreError: 31938:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800aba18df0)
[78417.103916] LustreError: 31938:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
[78417.104332] Pid: 31938, comm: createmany
[78417.104542] 
[78417.104543] Call Trace:
[78417.104925]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[78417.105213]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
[78417.105490]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
[78417.105778]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
[78417.106087]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
[78417.106345]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
[78417.106619]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
[78417.106878]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
[78417.107137]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
[78417.107389]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
[78417.107639]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
[78417.107907]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[78417.108187]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
[78417.108453]  [<ffffffffa0d2d9c5>] ldlm_cli_enqueue+0x365/0x770 [ptlrpc]
[78417.119499]  [<ffffffff814fc8bc>] ? __mutex_lock_slowpath+0x21c/0x2c0
[78417.119779]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
[78417.120071]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.120334]  [<ffffffffa0415f55>] mdc_enqueue+0x795/0x18c0 [mdc]
[78417.120583]  [<ffffffffa041727d>] mdc_intent_lock+0x1fd/0x5b4 [mdc]
[78417.120847]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.121427]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
[78417.121688]  [<ffffffffa03c8667>] ? lmv_fid_alloc+0x117/0x4b0 [lmv]
[78417.121952]  [<ffffffffa03e02d6>] lmv_intent_open+0x1f6/0x8c0 [lmv]
[78417.122206]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.122464]  [<ffffffff811661ca>] ? cache_alloc_debugcheck_after+0x14a/0x210
[78417.122720]  [<ffffffff81168671>] ? __kmalloc+0x191/0x2a0
[78417.122958]  [<ffffffffa03e0c4b>] lmv_intent_lock+0x2ab/0x370 [lmv]
[78417.123211]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.123471]  [<ffffffffa0f37a9e>] ? ll_i2gids+0x2e/0xd0 [lustre]
[78417.123718]  [<ffffffffa0f1e4ba>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre]
[78417.123983]  [<ffffffffa0f3d014>] ll_lookup_it+0x3a4/0xbf0 [lustre]
[78417.124273]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
[78417.124539]  [<ffffffffa0f3d8e9>] ll_lookup_nd+0x89/0x3b0 [lustre]
[78417.124784]  [<ffffffff8119b456>] ? d_alloc+0x166/0x1d0
[78417.125087]  [<ffffffff8118e8e2>] __lookup_hash+0x102/0x160
[78417.125322]  [<ffffffff8118ef8a>] lookup_hash+0x3a/0x50
[78417.125550]  [<ffffffff81192edb>] do_filp_open+0x2eb/0xe00
[78417.125790]  [<ffffffff814fae88>] ? thread_return+0x4e/0x776
[78417.126034]  [<ffffffff8118f37b>] ? getname+0x3b/0x250
[78417.126259]  [<ffffffff8119fd7b>] ? alloc_fd+0xab/0x160
[78417.126490]  [<ffffffff8117e399>] do_sys_open+0x69/0x140
[78417.126717]  [<ffffffff8100c575>] ? math_state_restore+0x45/0x60
[78417.126965]  [<ffffffff8117e4b0>] sys_open+0x20/0x30
[78417.127190]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[78417.127428] 
[78417.169930] Kernel panic - not syncing: LBUG

Crashdumps and modules for this one are in:
/exports/crashdumps/192.168.10.224-2013-05-09-20\:43\:05

The other crash was like this:

[79955.565718] Lustre: DEBUG MARKER: == sanity test 71: Running dbench on lustre (don't segment fault) ====== 21:01:15 (1368493275)
[80393.197212] LustreError: 25536:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800464ccdf0)
[80393.197711] LustreError: 25536:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
[80393.209355] Pid: 25536, comm: dbench
[80393.209559] 
[80393.209559] Call Trace:
[80393.209964]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[80393.210228]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
[80393.210502]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
[80393.212853]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
[80393.213149]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
[80393.213412]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
[80393.213689]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
[80393.213967]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
[80393.214225]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
[80393.214477]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
[80393.214733]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
[80393.214995]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[80393.215271]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
[80393.215526]  [<ffffffffa0401f59>] mdc_sync+0xb9/0x230 [mdc]
[80393.215773]  [<ffffffffa03d576e>] lmv_sync+0x2ae/0x550 [lmv]
[80393.216022]  [<ffffffffa0f0a6ab>] ll_fsync+0x1ab/0x860 [lustre]
[80393.216290]  [<ffffffff811b2401>] vfs_fsync_range+0xa1/0xe0
[80393.216525]  [<ffffffff811b24ad>] vfs_fsync+0x1d/0x20
[80393.216777]  [<ffffffff811b24ee>] do_fsync+0x3e/0x60
[80393.217003]  [<ffffffff811b2540>] sys_fsync+0x10/0x20
[80393.217230]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[80393.217471] 
[80393.228801] Kernel panic - not syncing: LBUG

/exports/crashdumps/192.168.10.224-2013-05-13-21\:08\:35/vmcore, same modules as before.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1398105923.9374.txt.gz
4.21 MB
21/Apr/14 8:08 PM

Issue Links

duplicates

LU-4949 Master to 2.5 server - LBUG: lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)

Closed

LU-4956 Test failure on test suite recovery-small, subtest test_27

Closed

is duplicated by

LU-3309 sanity test 182 crash

Resolved

is related to

LU-6172 Client LBUG in null_free_reqbuf

Resolved

Activity

[LU-3333] lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG

Jodi Levi (Inactive) added a comment - 01/May/14 5:17 PM

Reversal of patch completed.

Jodi Levi (Inactive) added a comment - 01/May/14 5:17 PM Reversal of patch completed.

Jodi Levi (Inactive) added a comment - 29/Apr/14 1:32 PM

Cliff,
Can you confirm this fix and close this ticket if appropriate?
Thank you!

Jodi Levi (Inactive) added a comment - 29/Apr/14 1:32 PM Cliff, Can you confirm this fix and close this ticket if appropriate? Thank you!

Oleg Drokin added a comment - 24/Apr/14 3:34 AM

The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types.
Instead I resurrected the patchset #1 from there at http://review.whamcloud.com/10074 that does not have the bug.

Oleg Drokin added a comment - 24/Apr/14 3:34 AM The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types. Instead I resurrected the patchset #1 from there at http://review.whamcloud.com/10074 that does not have the bug.

Cliff White (Inactive) added a comment - 21/Apr/14 8:08 PM

Log from the LBUG

Cliff White (Inactive) added a comment - 21/Apr/14 8:08 PM Log from the LBUG

Cliff White (Inactive) added a comment - 21/Apr/14 8:04 PM

I seem to have hit this on Hyperion:

LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8)
LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
Pid: 9374, comm: ior

Call Trace:
 [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc]
 [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc]
 [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc]
 [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
 [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc]
 [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
 [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc]
 [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
 [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv]
 [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre]
 [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre]
 [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
 [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre]
 [<ffffffff811965c2>] __lookup_hash+0x102/0x160
 [<ffffffff81196cea>] lookup_hash+0x3a/0x50
 [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20
 [<ffffffff81196a76>] ? final_putname+0x26/0x50
 [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190
 [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160
 [<ffffffff81185df9>] do_sys_open+0x69/0x140
 [<ffffffff81185f10>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Cliff White (Inactive) added a comment - 21/Apr/14 8:04 PM I seem to have hit this on Hyperion: LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8) LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG Pid: 9374, comm: ior Call Trace: [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc] [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc] [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc] [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc] [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc] [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc] [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc] [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv] [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre] [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre] [<ffffffff811965c2>] __lookup_hash+0x102/0x160 [<ffffffff81196cea>] lookup_hash+0x3a/0x50 [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20 [<ffffffff81196a76>] ? final_putname+0x26/0x50 [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190 [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160 [<ffffffff81185df9>] do_sys_open+0x69/0x140 [<ffffffff81185f10>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Oleg Drokin added a comment - 27/May/13 6:12 PM

Tentative patch in http://review.whamcloud.com/6467

Oleg Drokin added a comment - 27/May/13 6:12 PM Tentative patch in http://review.whamcloud.com/6467

Oleg Drokin added a comment - 23/May/13 4:16 PM

My hunch was right, it seems.

00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8
02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0
...
00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0)

Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here.

It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason.
I think it's still much easier to move the list adding past request extending and increase the race a tiny bit (that would probably be still very unrealistic to hit) than add locking all around req->rq_reqmsg access.

Oleg Drokin added a comment - 23/May/13 4:16 PM My hunch was right, it seems. 00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8 02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0 ... 00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0) Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here. It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason. I think it's still much easier to move the list adding past request extending and increase the race a tiny bit (that would probably be still very unrealistic to hit) than add locking all around req->rq_reqmsg access.

Oleg Drokin added a comment - 23/May/13 6:22 AM

I still hit this fairly frequently and caught one instance live today.

I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion.
lustre_msg pointer seems to be pointing at some sort of structure that's has mfd handle embedded in it later on (rep buffer content? certainly sounds plausible), how did that happen I am totally not clear.
the "lustre_msg" magic obviously does not match a proper one, but also does not match what's in assertion message, instead of the long number (that varies from recompile to recompile, but is the same in the same module version crashes) printed in assertion message, the in-memory value is always 0x2341

I wonder if it's an artifact of rq_msg replacement when enlarging.
Added some debug and going to rerun.

Oleg Drokin added a comment - 23/May/13 6:22 AM I still hit this fairly frequently and caught one instance live today. I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion. lustre_msg pointer seems to be pointing at some sort of structure that's has mfd handle embedded in it later on (rep buffer content? certainly sounds plausible), how did that happen I am totally not clear. the "lustre_msg" magic obviously does not match a proper one, but also does not match what's in assertion message, instead of the long number (that varies from recompile to recompile, but is the same in the same module version crashes) printed in assertion message, the in-memory value is always 0x2341 I wonder if it's an artifact of rq_msg replacement when enlarging. Added some debug and going to rerun.

Andreas Dilger added a comment - 14/May/13 6:06 PM

Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error.

Andreas Dilger added a comment - 14/May/13 6:06 PM Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error.

People

Assignee:: Oleg Drokin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/May/13 2:41 PM

Updated:: 06/Mar/15 12:04 AM

Resolved:: 01/May/14 5:17 PM