Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3333

lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9
    • None
    • 3
    • 8241

    Description

      I was suspicious when I hit this for the first time, but I hit it once more, so I am filing a bug.

      First time was in sanity test 182

      [78415.059119] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 25769825268, new: 25769825267 replay: 0. see LU-617.
      [78415.061562] LustreError: 17692:0:(mdt_recovery.c:390:mdt_last_rcvd_update()) Skipped 8 previous similar messages
      [78417.103412] LustreError: 31938:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800aba18df0)
      [78417.103916] LustreError: 31938:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
      [78417.104332] Pid: 31938, comm: createmany
      [78417.104542] 
      [78417.104543] Call Trace:
      [78417.104925]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [78417.105213]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [78417.105490]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
      [78417.105778]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
      [78417.106087]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
      [78417.106345]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
      [78417.106619]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
      [78417.106878]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
      [78417.107137]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
      [78417.107389]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
      [78417.107639]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
      [78417.107907]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
      [78417.108187]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
      [78417.108453]  [<ffffffffa0d2d9c5>] ldlm_cli_enqueue+0x365/0x770 [ptlrpc]
      [78417.119499]  [<ffffffff814fc8bc>] ? __mutex_lock_slowpath+0x21c/0x2c0
      [78417.119779]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [78417.120071]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
      [78417.120334]  [<ffffffffa0415f55>] mdc_enqueue+0x795/0x18c0 [mdc]
      [78417.120583]  [<ffffffffa041727d>] mdc_intent_lock+0x1fd/0x5b4 [mdc]
      [78417.120847]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
      [78417.121427]  [<ffffffffa0d326e0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [78417.121688]  [<ffffffffa03c8667>] ? lmv_fid_alloc+0x117/0x4b0 [lmv]
      [78417.121952]  [<ffffffffa03e02d6>] lmv_intent_open+0x1f6/0x8c0 [lmv]
      [78417.122206]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
      [78417.122464]  [<ffffffff811661ca>] ? cache_alloc_debugcheck_after+0x14a/0x210
      [78417.122720]  [<ffffffff81168671>] ? __kmalloc+0x191/0x2a0
      [78417.122958]  [<ffffffffa03e0c4b>] lmv_intent_lock+0x2ab/0x370 [lmv]
      [78417.123211]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
      [78417.123471]  [<ffffffffa0f37a9e>] ? ll_i2gids+0x2e/0xd0 [lustre]
      [78417.123718]  [<ffffffffa0f1e4ba>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre]
      [78417.123983]  [<ffffffffa0f3d014>] ll_lookup_it+0x3a4/0xbf0 [lustre]
      [78417.124273]  [<ffffffffa0f389e0>] ? ll_md_blocking_ast+0x0/0x750 [lustre]
      [78417.124539]  [<ffffffffa0f3d8e9>] ll_lookup_nd+0x89/0x3b0 [lustre]
      [78417.124784]  [<ffffffff8119b456>] ? d_alloc+0x166/0x1d0
      [78417.125087]  [<ffffffff8118e8e2>] __lookup_hash+0x102/0x160
      [78417.125322]  [<ffffffff8118ef8a>] lookup_hash+0x3a/0x50
      [78417.125550]  [<ffffffff81192edb>] do_filp_open+0x2eb/0xe00
      [78417.125790]  [<ffffffff814fae88>] ? thread_return+0x4e/0x776
      [78417.126034]  [<ffffffff8118f37b>] ? getname+0x3b/0x250
      [78417.126259]  [<ffffffff8119fd7b>] ? alloc_fd+0xab/0x160
      [78417.126490]  [<ffffffff8117e399>] do_sys_open+0x69/0x140
      [78417.126717]  [<ffffffff8100c575>] ? math_state_restore+0x45/0x60
      [78417.126965]  [<ffffffff8117e4b0>] sys_open+0x20/0x30
      [78417.127190]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
      [78417.127428] 
      [78417.169930] Kernel panic - not syncing: LBUG
      

      Crashdumps and modules for this one are in:
      /exports/crashdumps/192.168.10.224-2013-05-09-20\:43\:05

      The other crash was like this:

      [79955.565718] Lustre: DEBUG MARKER: == sanity test 71: Running dbench on lustre (don't segment fault) ====== 21:01:15 (1368493275)
      [80393.197212] LustreError: 25536:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b03b5e(msg:ffff8800464ccdf0)
      [80393.197711] LustreError: 25536:0:(pack_generic.c:1033:lustre_msg_get_opc()) LBUG
      [80393.209355] Pid: 25536, comm: dbench
      [80393.209559] 
      [80393.209559] Call Trace:
      [80393.209964]  [<ffffffffa0b028a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [80393.210228]  [<ffffffffa0b02ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [80393.210502]  [<ffffffffa0d5a4ba>] lustre_msg_get_opc+0x8a/0x110 [ptlrpc]
      [80393.212853]  [<ffffffffa0d5d97b>] _debug_req+0x57b/0x680 [ptlrpc]
      [80393.213149]  [<ffffffff81047fde>] ? kernel_map_pages+0xfe/0x110
      [80393.213412]  [<ffffffffa0d4965e>] ptlrpc_free_committed+0x20e/0x620 [ptlrpc]
      [80393.213689]  [<ffffffffa0d4b313>] after_reply+0x7a3/0xd90 [ptlrpc]
      [80393.213967]  [<ffffffffa0d502c3>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
      [80393.214225]  [<ffffffff8107eca0>] ? process_timeout+0x0/0x10
      [80393.214477]  [<ffffffffa0d512fa>] ptlrpc_set_wait+0x32a/0x880 [ptlrpc]
      [80393.214733]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
      [80393.214995]  [<ffffffffa0d5b4c6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
      [80393.215271]  [<ffffffffa0d518cf>] ptlrpc_queue_wait+0x7f/0x220 [ptlrpc]
      [80393.215526]  [<ffffffffa0401f59>] mdc_sync+0xb9/0x230 [mdc]
      [80393.215773]  [<ffffffffa03d576e>] lmv_sync+0x2ae/0x550 [lmv]
      [80393.216022]  [<ffffffffa0f0a6ab>] ll_fsync+0x1ab/0x860 [lustre]
      [80393.216290]  [<ffffffff811b2401>] vfs_fsync_range+0xa1/0xe0
      [80393.216525]  [<ffffffff811b24ad>] vfs_fsync+0x1d/0x20
      [80393.216777]  [<ffffffff811b24ee>] do_fsync+0x3e/0x60
      [80393.217003]  [<ffffffff811b2540>] sys_fsync+0x10/0x20
      [80393.217230]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
      [80393.217471] 
      [80393.228801] Kernel panic - not syncing: LBUG
      

      /exports/crashdumps/192.168.10.224-2013-05-13-21\:08\:35/vmcore, same modules as before.

      Attachments

        Issue Links

          Activity

            [LU-3333] lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG

            Reversal of patch completed.

            jlevi Jodi Levi (Inactive) added a comment - Reversal of patch completed.

            Cliff,
            Can you confirm this fix and close this ticket if appropriate?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Cliff, Can you confirm this fix and close this ticket if appropriate? Thank you!
            green Oleg Drokin added a comment -

            The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types.
            Instead I resurrected the patchset #1 from there at http://review.whamcloud.com/10074 that does not have the bug.

            green Oleg Drokin added a comment - The landed patch from change 6467 has been reverted due to a but it introduced with sec_null types. Instead I resurrected the patchset #1 from there at http://review.whamcloud.com/10074 that does not have the bug.

            Log from the LBUG

            cliffw Cliff White (Inactive) added a comment - Log from the LBUG

            I seem to have hit this on Hyperion:

            LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8)
            LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
            Pid: 9374, comm: ior
            
            Call Trace:
             [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
             [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs]
             [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc]
             [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc]
             [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc]
             [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
             [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc]
             [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
             [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc]
             [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc]
             [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
             [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
             [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv]
             [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv]
             [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
             [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv]
             [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
             [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre]
             [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre]
             [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
             [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre]
             [<ffffffff811965c2>] __lookup_hash+0x102/0x160
             [<ffffffff81196cea>] lookup_hash+0x3a/0x50
             [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20
             [<ffffffff81196a76>] ? final_putname+0x26/0x50
             [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190
             [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160
             [<ffffffff81185df9>] do_sys_open+0x69/0x140
             [<ffffffff81185f10>] sys_open+0x20/0x30
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            
            cliffw Cliff White (Inactive) added a comment - I seem to have hit this on Hyperion: LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff881054aab0f8) LustreError: 9374:0:(pack_generic.c:487:lustre_msg_buf()) LBUG Pid: 9374, comm: ior Call Trace: [<ffffffffa043d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa043de97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa07cc5aa>] lustre_msg_buf+0x4a/0x60 [ptlrpc] [<ffffffffa07f894d>] _sptlrpc_enlarge_msg_inplace+0x5d/0x1b0 [ptlrpc] [<ffffffffa08086d5>] null_enlarge_reqbuf+0xd5/0x200 [ptlrpc] [<ffffffffa07f607c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc] [<ffffffffa098ece6>] mdc_finish_enqueue+0xb26/0x1190 [mdc] [<ffffffffa07d18a0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] [<ffffffffa0990860>] mdc_enqueue+0xfc0/0x1b00 [mdc] [<ffffffffa099159e>] mdc_intent_lock+0x1fe/0x63f [mdc] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa07aaac0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc] [<ffffffffa093b6aa>] ? lmv_fid_alloc+0x24a/0x3c0 [lmv] [<ffffffffa0956165>] lmv_intent_open+0x325/0x9c0 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0956abb>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a57b18>] ? ll_prep_md_op_data+0x1a8/0x490 [lustre] [<ffffffffa0a7789a>] ll_lookup_it+0x26a/0xad0 [lustre] [<ffffffffa0a76020>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] [<ffffffffa0a7818e>] ll_lookup_nd+0x8e/0x5d0 [lustre] [<ffffffff811965c2>] __lookup_hash+0x102/0x160 [<ffffffff81196cea>] lookup_hash+0x3a/0x50 [<ffffffff8119b7fe>] do_filp_open+0x2de/0xd20 [<ffffffff81196a76>] ? final_putname+0x26/0x50 [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190 [<ffffffff811a82e2>] ? alloc_fd+0x92/0x160 [<ffffffff81185df9>] do_sys_open+0x69/0x140 [<ffffffff81185f10>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            green Oleg Drokin added a comment - Tentative patch in http://review.whamcloud.com/6467
            green Oleg Drokin added a comment -

            My hunch was right, it seems.

            00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8
            02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0
            ...
            00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0)
            

            Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here.

            It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason.
            I think it's still much easier to move the list adding past request extending and increase the race a tiny bit (that would probably be still very unrealistic to hit) than add locking all around req->rq_reqmsg access.

            green Oleg Drokin added a comment - My hunch was right, it seems. 00000002:00100000:0.0:1369322005.401613:0:7254:0:(mdc_locks.c:635:mdc_finish_enqueue()) Attempting to enlarge req ffff880010aa67f0 with body ffff8800647f79f8 02000000:00100000:0.0:1369322005.401615:0:7254:0:(sec_null.c:266:null_enlarge_reqbuf()) Moved message ffff88002415ddf0 to ffff88004c1efbf0 ... 00000100:00020000:4.0:1369322005.402571:0:7256:0:(pack_generic.c:1032:lustre_msg_get_opc()) incorrect message magic: a0b04b5e(msg:ffff88002415ddf0) Unfortunately such replacement happens with no locking whatsoever and the open msg is already on the replay list. It's only a matter of (bad) luck for the race to happen where while trying to print req->rq_reqmsg the message get's moved to another place and the old one first overwritten and then freed leading to the assertion like what we've got here. It seems that we need to do this reallocate BEFORE we add the request to the replay list, though the comment in the addition to the list (that I remember I added some time ago before there was such an enlargement code) is advising of urgency of such adding for recovery reason. I think it's still much easier to move the list adding past request extending and increase the race a tiny bit (that would probably be still very unrealistic to hit) than add locking all around req->rq_reqmsg access.
            green Oleg Drokin added a comment -

            I still hit this fairly frequently and caught one instance live today.

            I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion.
            lustre_msg pointer seems to be pointing at some sort of structure that's has mfd handle embedded in it later on (rep buffer content? certainly sounds plausible), how did that happen I am totally not clear.
            the "lustre_msg" magic obviously does not match a proper one, but also does not match what's in assertion message, instead of the long number (that varies from recompile to recompile, but is the same in the same module version crashes) printed in assertion message, the in-memory value is always 0x2341

            I wonder if it's an artifact of rq_msg replacement when enlarging.
            Added some debug and going to rerun.

            green Oleg Drokin added a comment - I still hit this fairly frequently and caught one instance live today. I found the request that caused the LBUG, but in the request the rq_reqmsg does NOT match the lustre_msg pointer printed in the assertion. lustre_msg pointer seems to be pointing at some sort of structure that's has mfd handle embedded in it later on (rep buffer content? certainly sounds plausible), how did that happen I am totally not clear. the "lustre_msg" magic obviously does not match a proper one, but also does not match what's in assertion message, instead of the long number (that varies from recompile to recompile, but is the same in the same module version crashes) printed in assertion message, the in-memory value is always 0x2341 I wonder if it's an artifact of rq_msg replacement when enlarging. Added some debug and going to rerun.

            Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error.

            adilger Andreas Dilger added a comment - Separate from fixing the root cause of this patch, it doesn't make sense to LBUG() on anything coming off of the network. At worst this should be turned into an error.

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: