[LU-3333] lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG Created: 14/May/13 Updated: 06/Mar/15 Resolved: 01/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 8241 | ||||||||||||||||||||||||
| Description |
|
I was suspicious when I hit this for the first time, but I hit it once more, so I am filing a bug. First time was in sanity test 182 [78415.059119] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 25769825268, new: 25769825267 replay: 0. see LU-617. [78415.061562] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Skipped 8 previous similar messages [78417.103412] LustreError: 31938:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800aba18df0) [78417.103916] LustreError: 31938:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG [78417.104332] Pid: 31938, comm: createmany [78417.104542] [78417.104543] Call Trace: [78417.104925] [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [78417.105213] [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs] [78417.105490] [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc] [78417.105778] [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc] [78417.106087] [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110 [78417.106345] [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc] [78417.106619] [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc] [78417.106878] [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc] [78417.107137] [<ffffffff8107eca0>] ? process_timeout+0x0/0x10 [78417.107389] [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc] [78417.107639] [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20 [78417.107907] [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [78417.108187] [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc] [78417.108453] [<ffffffffa0d2d9c5>] ldlm_cli_enqueue+0x365/0x770 [ptlrpc] [78417.119499] [<ffffffff814fc8bc>] ? __mutex_lock_slowpath+0x21c/0x2c0 [78417.119779] [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [78417.120071] [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre] [78417.120334] [<ffffffffa0415f55>] mdc_enqueue+0x795/0x18c0 [mdc] [78417.120583] [<ffffffffa041727d>] mdc_intent_lock+0x1fd/0x5b4 [mdc] [78417.120847] [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre] [78417.121427] [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [78417.121688] [<ffffffffa03c8667>] ? lmv_fid_alloc+0x117/0x4b0 [lmv] [78417.121952] [<ffffffffa03e02d6>] lmv_intent_open+0x1f6/0x8c0 [lmv] [78417.122206] [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre] [78417.122464] [<ffffffff811661ca>] ? cache_alloc_debugcheck_after+0x14a/0x210 [78417.122720] [<ffffffff81168671>] ? __kmalloc+0x191/0x2a0 [78417.122958] [<ffffffffa03e0c4b>] lmv_intent_lock+0x2ab/0x370 [lmv] [78417.123211] [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre] [78417.123471] [<ffffffffa0f37a9e>] ? ll_i2gids+0x2e/0xd0 [lustre] [78417.123718] [<ffffffffa0f1e4ba>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre] [78417.123983] [<ffffffffa0f3d014>] ll_lookup_it+0x3a4/0xbf0 [lustre] [78417.124273] [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre] [78417.124539] [<ffffffffa0f3d8e9>] ll_lookup_nd+0x89/0x3b0 [lustre] [78417.124784] [<ffffffff8119b456>] ? d_alloc+0x166/0x1d0 [78417.125087] [<ffffffff8118e8e2>] __lookup_hash+0x102/0x160 [78417.125322] [<ffffffff8118ef8a>] lookup_hash+0x3a/0x50 [78417.125550] [<ffffffff81192edb>] do_filp_open+0x2eb/0xe00 [78417.125790] [<ffffffff814fae88>] ? thread_return+0x4e/0x776 [78417.126034] [<ffffffff8118f37b>] ? getname+0x3b/0x250 [78417.126259] [<ffffffff8119fd7b>] ? alloc_fd+0xab/0x160 [78417.126490] [<ffffffff8117e399>] do_sys_open+0x69/0x140 [78417.126717] [<ffffffff8100c575>] ? math_state_restore+0x45/0x60 [78417.126965] [<ffffffff8117e4b0>] sys_open+0x20/0x30 [78417.127190] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b [78417.127428] [78417.169930] Kernel panic - not syncing: LBUG Crashdumps and modules for this one are in: The other crash was like this: [79955.565718] Lustre: DEBUG MARKER: == sanity test 71: Running dbench on lustre (don't segment fault) ====== 21:01:15 (1368493275) [80393.197212] LustreError: 25536:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800464ccdf0) [80393.197711] LustreError: 25536:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG [80393.209355] Pid: 25536, comm: dbench [80393.209559] [80393.209559] Call Trace: [80393.209964] [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [80393.210228] [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs] [80393.210502] [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc] [80393.212853] [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc] [80393.213149] [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110 [80393.213412] [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc] [80393.213689] [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc] [80393.213967] [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc] [80393.214225] [<ffffffff8107eca0>] ? process_timeout+0x0/0x10 [80393.214477] [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc] [80393.214733] [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20 [80393.214995] [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [80393.215271] [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc] [80393.215526] [<ffffffffa0401f59>] mdc_sync+0xb9/0x230 [mdc] [80393.215773] [<ffffffffa03d576e>] lmv_sync+0x2ae/0x550 [lmv] [80393.216022] [<ffffffffa0f0a6ab>] ll_fsync+0x1ab/0x860 [lustre] [80393.216290] [<ffffffff811b2401>] vfs_fsync_range+0xa1/0xe0 [80393.216525] [<ffffffff811b24ad>] vfs_fsync+0x1d/0x20 [80393.216777] [<ffffffff811b24ee>] do_fsync+0x3e/0x60 [80393.217003] [<ffffffff811b2540>] sys_fsync+0x10/0x20 [80393.217230] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b [80393.217471] [80393.228801] Kernel panic - not syncing: LBUG /exports/crashdumps/192.168.10.224-2013-05-13-21\:08\:35/vmcore, same modules as before. |
| Comments |
| Comment by Andreas Dilger [ 14/May/13 ] |
|
Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error. |
| Comment by Oleg Drokin [ 23/May/13 ] |
|
I still hit this fairly frequently and caught one instance live today. I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion. I wonder if it's an artifact of rq_msg replacement when enlarging. |
| Comment by Oleg Drokin [ 23/May/13 ] |
|
My hunch was right, it seems. 00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8 02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0 ... 00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0) Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here. It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason. |
| Comment by Oleg Drokin [ 27/May/13 ] |
|
Tentative patch in http://review.whamcloud.com/6467 |
| Comment by Cliff White (Inactive) [ 21/Apr/14 ] |
|
I seem to have hit this on Hyperion: LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8) LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG Pid: 9374, comm: ior Call Trace: [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc] [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc] [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc] [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc] [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc] [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc] [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc] [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv] [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre] [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre] [<ffffffff811965c2>] __lookup_hash+0x102/0x160 [<ffffffff81196cea>] lookup_hash+0x3a/0x50 [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20 [<ffffffff81196a76>] ? final_putname+0x26/0x50 [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190 [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160 [<ffffffff81185df9>] do_sys_open+0x69/0x140 [<ffffffff81185f10>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Cliff White (Inactive) [ 21/Apr/14 ] |
|
Log from the LBUG |
| Comment by Oleg Drokin [ 24/Apr/14 ] |
|
The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types. |
| Comment by Jodi Levi (Inactive) [ 29/Apr/14 ] |
|
Cliff, |
| Comment by Jodi Levi (Inactive) [ 01/May/14 ] |
|
Reversal of patch completed. |