Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-539

small size for RMF_CONNECT_DATA caused out of bound memory crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 3
    • 4936

    Description

      For the interoperability between 1.8 and 2.x, we use small size structure of "obd_connect_data_v1" for "RMF_CONNECT_DATA" as following:

      ==========
      struct req_msg_field RMF_CONNECT_DATA =
      DEFINE_MSGF("cdata",
      RMF_F_NO_SIZE_CHECK /* we allow extra space for interop */,
      #if LUSTRE_VERSION_CODE > OBD_OCD_VERSION(2, 9, 0, 0)
      sizeof(struct obd_connect_data),
      #else
      /* For interoperability with 1.8 and 2.0 clients/servers.

      • The RPC verification code allows larger RPC buffers, but not
      • smaller buffers. Until we no longer need to keep compatibility
      • with older servers/clients we can only check that the buffer
      • size is at least as large as obd_connect_data_v1. That is not
      • not in itself harmful, since the chance of just corrupting this
      • field is low. See JIRA LU-16 for details. */
        sizeof(struct obd_connect_data_v1),
        #endif
        lustre_swab_connect, NULL);
        ============

      But when server process connection in "target_handle_connect()", it treats related fileds as large size structure of "obd_connect_data", assigning such fileds maybe cause out of bound memory over-written, and then cause memory crash as following:

      ============
      LustreError: 9439:0:(pack_generic.c:800:lustre_msghdr_get_flags()) ASSERTION(0) failed: incorrect message magic: 00000000
      LustreError: 9439:0:(pack_generic.c:800:lustre_msghdr_get_flags()) LBUG
      Pid: 9439, comm: ll_mgs_00

      Call Trace:
      [<00000000f8bff5c0>] libcfs_debug_dumpstack+0x50/0x70 [libcfs]
      [<00000000f8bffd5d>] lbug_with_loc+0x6d/0xd0 [libcfs]
      [<00000000f9d3bbc0>] reply_in_callback+0x0/0x850 [ptlrpc]
      [<00000000f9d32fe2>] lustre_msghdr_get_flags+0x82/0x90 [ptlrpc]
      [<00000000f9d3bf80>] reply_in_callback+0x3c0/0x850 [ptlrpc]
      [<00000000f9069751>] ldiskfs_mark_iloc_dirty+0x341/0x560 [ldiskfs]
      [<00000000f9d3bbc0>] reply_in_callback+0x0/0x850 [ptlrpc]
      [<00000000f9d3a527>] ptlrpc_master_callback+0x47/0xa0 [ptlrpc]
      [<00000000f8c51a0a>] lnet_enq_event_locked+0x5a/0xb0 [lnet]
      [<00000000f8c51ad8>] lnet_finalize+0x78/0x200 [lnet]
      [<00000000f8c60fef>] lolnd_recv+0x5f/0x100 [lnet]
      [<00000000f8c55e09>] lnet_ni_recv+0xf9/0x260 [lnet]
      [<00000000f8c56059>] lnet_recv_put+0xe9/0x130 [lnet]
      [<00000000f8c5c560>] lnet_parse+0x14e0/0x2620 [lnet]
      [<00000000c048ccd6>] dput+0x72/0xed
      [<00000000f8da1baf>] llog_free_handle+0x9f/0x330 [obdclass]
      [<00000000c04906be>] mntput_no_expire+0x11/0x6a
      [<00000000f8cc34f5>] pop_ctxt+0xe5/0x320 [lvfs]
      [<00000000f8db9870>] __llog_ctxt_put+0x20/0x2e0 [obdclass]
      [<00000000f8da3ce2>] llog_close+0x72/0x440 [obdclass]
      [<00000000f8c610d1>] lolnd_send+0x41/0x90 [lnet]
      [<00000000f8c55c9b>] lnet_ni_send+0x4b/0xc0 [lnet]
      [<00000000f8c5804c>] lnet_send+0x1fc/0xd90 [lnet]
      [<00000000f8db9870>] __llog_ctxt_put+0x20/0x2e0 [obdclass]
      [<00000000f8c5e665>] LNetPut+0x565/0xef0 [lnet]
      [<00000000f9d28924>] ptl_send_buf+0x1f4/0xab0 [ptlrpc]
      [<00000000f9d39e26>] lustre_msg_set_timeout+0x96/0x110 [ptlrpc]
      [<00000000f9d2942c>] ptlrpc_send_reply+0x24c/0x8b0 [ptlrpc]
      [<00000000f9cdb934>] target_send_reply+0x94/0x910 [ptlrpc]
      [<00000000f9d38d2c>] lustre_msg_get_conn_cnt+0xfc/0x1e0 [ptlrpc]
      [<00000000f92e851e>] mgs_handle+0x31e/0x1f10 [mgs]
      [<00000000f9d3234c>] lustre_msg_get_opc+0x10c/0x1f0 [ptlrpc]
      [<00000000f9d4ce07>] ptlrpc_main+0x1217/0x27b0 [ptlrpc]
      [<00000000c044d29c>] audit_syscall_exit+0x2d4/0x2ea
      [<00000000f9d4bbf0>] ptlrpc_main+0x0/0x27b0 [ptlrpc]
      [<00000000c0405c87>] kernel_thread_helper+0x7/0x10
      <IRQ>
      Kernel panic - not syncing: LBUG
      Memory for crash kernel (0x0 to 0x0) notwithin permissible range
      PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved

      ============

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              yong.fan nasf (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: