Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4949

Master to 2.5 server - LBUG: lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 13689

    Description

      While testing master clients (2.5.57) against 2.5 servers, we encountered this LBUG:

      2014-04-18T19:19:05.626354-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)
      2014-04-18T19:19:05.626401-05:00 c1-0c0s1n0 LustreError: 5456:0:(pack_generic.c:487:lustre_msg_buf()) LBUG
      2014-04-18T19:19:05.626409-05:00 c1-0c0s1n0 Pid: 5456, comm: IOR
      2014-04-18T19:19:05.626417-05:00 c1-0c0s1n0 Call Trace:
      2014-04-18T19:19:05.626424-05:00 c1-0c0s1n0 [<ffffffff810065b1>] try_stack_unwind+0x161/0x1a0
      2014-04-18T19:19:05.626431-05:00 c1-0c0s1n0 [<ffffffff81004dd9>] dump_trace+0x89/0x440
      2014-04-18T19:19:05.626437-05:00 c1-0c0s1n0 [<ffffffffa0168897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2014-04-18T19:19:05.626444-05:00 c1-0c0s1n0 [<ffffffffa0168de7>] lbug_with_loc+0x47/0xc0 [libcfs]
      2014-04-18T19:19:05.626449-05:00 c1-0c0s1n0 [<ffffffffa03e63c9>] lustre_msg_buf+0x49/0x60 [ptlrpc]
      2014-04-18T19:19:05.626455-05:00 c1-0c0s1n0 [<ffffffffa04128a0>] _sptlrpc_enlarge_msg_inplace+0x60/0x1c0 [ptlrpc]
      2014-04-18T19:19:05.626462-05:00 c1-0c0s1n0 [<ffffffffa0422c81>] null_enlarge_reqbuf+0xd1/0x200 [ptlrpc]
      2014-04-18T19:19:05.626468-05:00 c1-0c0s1n0 [<ffffffffa040fe9c>] sptlrpc_cli_enlarge_reqbuf+0x5c/0x160 [ptlrpc]
      2014-04-18T19:19:05.626479-05:00 c1-0c0s1n0 [<ffffffffa065d790>] mdc_finish_enqueue+0xa60/0x1090 [mdc]
      2014-04-18T19:19:05.626485-05:00 c1-0c0s1n0 [<ffffffffa065f6e6>] mdc_enqueue+0x13d6/0x1ce0 [mdc]
      2014-04-18T19:19:05.626495-05:00 c1-0c0s1n0 [<ffffffffa0660280>] mdc_intent_lock+0x290/0x55f [mdc]
      2014-04-18T19:19:05.626511-05:00 c1-0c0s1n0 [<ffffffffa06090bd>] lmv_intent_open+0x33d/0x9a0 [lmv]
      2014-04-18T19:19:05.626519-05:00 c1-0c0s1n0 [<ffffffffa06099ca>] lmv_intent_lock+0x2aa/0x370 [lmv]
      2014-04-18T19:19:05.626526-05:00 c1-0c0s1n0 [<ffffffffa0754fa2>] ll_lookup_it+0x4b2/0x1840 [lustre]
      2014-04-18T19:19:05.626535-05:00 c1-0c0s1n0 [<ffffffffa07563ba>] ll_lookup_nd+0x8a/0x550 [lustre]
      2014-04-18T19:19:05.626542-05:00 c1-0c0s1n0 [<ffffffff8114647c>] d_alloc_and_lookup+0x4c/0x80
      2014-04-18T19:19:05.626550-05:00 c1-0c0s1n0 [<ffffffff811465ae>] __lookup_hash+0xfe/0x180
      2014-04-18T19:19:05.626581-05:00 c1-0c0s1n0 [<ffffffff81148c6f>] do_last+0x2cf/0x7a0
      2014-04-18T19:19:05.626592-05:00 c1-0c0s1n0 [<ffffffff81149e08>] path_openat+0xc8/0x3c0
      2014-04-18T19:19:05.626619-05:00 c1-0c0s1n0 [<ffffffff8114a228>] do_filp_open+0x48/0xa0
      2014-04-18T19:19:05.626626-05:00 c1-0c0s1n0 [<ffffffff8113b0ee>] do_sys_open+0x16e/0x240
      2014-04-18T19:19:05.626641-05:00 c1-0c0s1n0 [<ffffffff8113b200>] sys_open+0x20/0x30
      2014-04-18T19:19:05.626666-05:00 c1-0c0s1n0 [<ffffffff81362e6b>] system_call_fastpath+0x16/0x1b
      2014-04-18T19:19:05.626684-05:00 c1-0c0s1n0 [<00000000005b8b31>] 0x5b8b31

      This was encountered during an IO performance test run, which is mostly IOR runs of various sizes. This occurred on a number of different nodes - At least 10% of the nodes used in the run fell to this bug, but by no means all the nodes. In all cases, the stack trace seemed to be the same.

      The given magic value suggests it was either not set or was overwritten somehow... Here's the contents of the lustre_msg_v2 struct:

      crash> struct lustre_msg_v2 0xffff88060f3c40f8
      struct lustre_msg_v2 {
      lm_bufcount = 4608,
      lm_secflvr = 2,
      lm_magic = 13,
      lm_repsize = 0,
      lm_cksum = 132270,
      lm_flags = 2,
      lm_padding_2 = 19,
      lm_padding_3 = 0,
      lm_buflens = 0xffff88060f3c4118
      }

      I'll be uploading a dump shortly. I do not have a clear reproducer for this currently, but if needed, I could probably manage it with a large test run.

      Attachments

        Issue Links

          Activity

            [LU-4949] Master to 2.5 server - LBUG: lustre_msg_buf()) ASSERTION( 0 ) failed: incorrect message magic: 0000000d(msg:ffff88020fe3c0f8)

            Dump at:
            ftp.whamcloud.com

            uploads/LU-4949/140423_compute_LBUG.tar.gz

            paf Patrick Farrell (Inactive) added a comment - Dump at: ftp.whamcloud.com uploads/ LU-4949 /140423_compute_LBUG.tar.gz

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: