Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3587

CR_MAXSIZE is too small

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0, Lustre 2.4.2
    • Lustre 2.4.0
    • None
    • 3
    • 9102

    Description

      One of our Lustre 2.4.0 based clients is hitting an assertion when we run robinhood.

      2013-07-09 14:35:15 vrelic8 login: LustreError: 5467:0:(mdc_request.c:1500:changelog_kuc_hdr()) ASSERTION( len <= cfs_size_round(2*255 + 1 + sizeof(struct changelog_rec)) ) failed:
      2013-07-09 14:41:07 LustreError: 5467:0:(mdc_request.c:1500:changelog_kuc_hdr()) LBUG
      

      Backtrace of the crashed thread:

      PID: 5467   TASK: ffff885c8668eaa0  CPU: 17  COMMAND: "mdc_clg_send_th"
       #0 [ffff885c866779e8] machine_kexec at ffffffff81035bfb
       #1 [ffff885c86677a48] crash_kexec at ffffffff810c0932
       #2 [ffff885c86677b18] panic at ffffffff8150d943
       #3 [ffff885c86677b98] lbug_with_loc at ffffffffa042bf4b [libcfs]
       #4 [ffff885c86677bb8] changelog_kuc_hdr at ffffffffa096bf3e [mdc]
       #5 [ffff885c86677bc8] changelog_kkuc_cb at ffffffffa096d28c [mdc]
       #6 [ffff885c86677bf8] llog_process_thread at ffffffffa05e110b [obdclass]
       #7 [ffff885c86677ca8] llog_process_or_fork at ffffffffa05e2e1d [obdclass]
       #8 [ffff885c86677cf8] llog_cat_process_cb at ffffffffa05e560a [obdclass]
       #9 [ffff885c86677d58] llog_process_thread at ffffffffa05e110b [obdclass]
      #10 [ffff885c86677e08] llog_process_or_fork at ffffffffa05e2e1d [obdclass]
      #11 [ffff885c86677e58] llog_cat_process_or_fork at ffffffffa05e3ef9 [obdclass]
      #12 [ffff885c86677ee8] llog_cat_process at ffffffffa05e41d9 [obdclass]
      #13 [ffff885c86677f08] mdc_changelog_send_thread at ffffffffa097157b [mdc]
      #14 [ffff885c86677f48] kernel_thread at ffffffff8100c0ca
      

      Digging into the crash dump, the values on the stack did not appear to contain either the "len" variable calculated by changelog_kkuc_cb(). But the problem was easily reproducible, so I added a debug message and tried again. This told me:

      2013-07-15 11:27:32 LustreError: 5472:0:(mdc_request.c:1504:changelog_kuc_hdr()) ASSERTION( len <= CR_MAXSIZE ) failed: CR_MAXSIZE=576, len=588, rec=ffff885ce8c9a148, changelog_rec_size=96
      2013-07-15 11:27:32 LustreError: 5472:0:(mdc_request.c:1504:changelog_kuc_hdr()) LBUG
      

      With a pointer to the llog_changelog_rec function, I could print it out in crash:

      crash> print *(struct llog_changelog_rec *)0xffff885ce8c9a148
      $4 = {
        cr_hdr = {
          lrh_len = 608,
          lrh_index = 29516,
          lrh_type = 275120128,
          padding = 0
        },
        cr = {
          cr_namelen = 484,
          cr_flags = 8192,
          cr_type = 8,
          cr_index = 80023354,
          cr_prev = 0,
          cr_time = 1474100508289501189,
          {
            cr_tfid = {
              f_seq = 0,
              f_oid = 0,
              f_ver = 0
            },
            cr_markerflags = 0
          },
          cr_pfid = {
            f_seq = 8589961859,
            f_oid = 28956,
            f_ver = 0
          },
          cr_name = 0xffff885ce8c9a198 "\263j"
        },
        cr_tail = {
          lrt_len = 27315,
          lrt_index = 2
        }
      }
      

      So here is what I understand:

      • cr_type tells us that this is a CL_RENAME
      • cr_flags tells us that cr is a struct changelog_ext_rec instead of the normal struct changelog_rec
      • cr_namelen is large at 484

      Note that cr_name looks corrupt, but is not. It is merely because crash does not know that we have an undeclared union here. If I manually add 32 bytes to the cr_namelen pointer, I am able to see a correct filename:

      crash> print (((struct llog_changelog_rec *)0xffff885ce8c9a148)->cr).cr_name+32

      I won't share the users's file name here, it is sufficient to say that it is 245 characters long, not including the terminating NULL character. If I then add 246 characters, I can see a second filename:

      crash> print (((struct llog_changelog_rec *)0xffff885ce8c9a148)->cr).cr_name+32+246

      This filename is nearly the same as the first one, but the trailing string of "-failed" was removed, resulting in a string 238 characters long, not including the terminating null character. Doing the math:

      245+238+1 = 484

      OK, that matches cr_len! Now I think I see why CR_MAXSIZE starts out with "2*LEN_MAX + 1". But I don't understand it. Why are we only accounting for one of the null characters??

      But the root of the bug appears to be that CR_MAXSIZE only accounts for the size of the smaller "sizeof(struct changelog_rec)", when it should be using "sizeof(struct changelog_ext_rec)". Those extra 32 bytes would have been enough to store this structure, and avoid the bug.

      I can change CR_MAXSIZE to do the right thing, and I will push a patch to do so soon. I have not yet fully checked to see if changing CR_MAXSIZE could break anything else.

      Attachments

        Activity

          People

            bogl Bob Glossman (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: