Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5635

(llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.6.0
    • Centos 6.5 with Lustre head of tree (commit f04237)
    • 3
    • 15772

    Description

      Some hsm operations (lfs hsm_*) were running on a client, and I wanted to check the list of actions:

      cat /proc/fs/lustre/mdt/tas01-MDT0000/hsm/actions
      

      This led to an immediate crash of the MDT:

      <0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: 
      <0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) LBUG
      <4>Pid: 3773, comm: cat
      <4>
      <4>Call Trace:
      <4> [<ffffffffa02f8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa02f8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0413fff>] llog_process_thread+0xd7f/0xdd0 [obdclass]
      <4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt]
      <4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass]
      <4> [<ffffffffa0418612>] llog_cat_process_cb+0x1b2/0x5e0 [obdclass]
      <4> [<ffffffffa0413c22>] llog_process_thread+0x9a2/0xdd0 [obdclass]
      <4> [<ffffffff810792c7>] ? current_fs_time+0x27/0x30
      <4> [<ffffffffa0418460>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      <4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass]
      <4> [<ffffffffa0416b9d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      <4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt]
      <4> [<ffffffffa0416d09>] llog_cat_process+0x19/0x20 [obdclass]
      <4> [<ffffffffa0dd7376>] mdt_hsm_actions_proc_show+0x1a6/0x240 [mdt]
      <4> [<ffffffffa0dd8b03>] ? mdt_hsm_actions_proc_start+0x63/0x380 [mdt]
      <4> [<ffffffff811ade02>] seq_read+0xf2/0x400
      <4> [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
      <4> [<ffffffff811f353e>] proc_reg_read+0x7e/0xc0
      <4> [<ffffffff81189695>] vfs_read+0xb5/0x1a0
      <4> [<ffffffff811897d1>] sys_read+0x51/0x90
      <4> [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
      <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>
      <0>Kernel panic - not syncing: LBUG
      

      Crash stack is:

      crash> bt
      PID: 3773   TASK: ffff880130d44ae0  CPU: 3   COMMAND: "cat"
       #0 [ffff8801144bf900] machine_kexec at ffffffff81038f3b
       #1 [ffff8801144bf960] crash_kexec at ffffffff810c5d82
       #2 [ffff8801144bfa30] panic at ffffffff8152798a
       #3 [ffff8801144bfab0] lbug_with_loc at ffffffffa02f8eeb [libcfs]
       #4 [ffff8801144bfad0] llog_process_thread at ffffffffa0413fff [obdclass]
       #5 [ffff8801144bfb80] llog_process_or_fork at ffffffffa041585f [obdclass]
       #6 [ffff8801144bfbd0] llog_cat_process_cb at ffffffffa0418612 [obdclass]
       #7 [ffff8801144bfc30] llog_process_thread at ffffffffa0413c22 [obdclass]
       #8 [ffff8801144bfce0] llog_process_or_fork at ffffffffa041585f [obdclass]
       #9 [ffff8801144bfd30] llog_cat_process_or_fork at ffffffffa0416b9d [obdclass]
      #10 [ffff8801144bfdc0] llog_cat_process at ffffffffa0416d09 [obdclass]
      #11 [ffff8801144bfde0] mdt_hsm_actions_proc_show at ffffffffa0dd7376 [mdt]
      #12 [ffff8801144bfe20] seq_read at ffffffff811ade02
      #13 [ffff8801144bfea0] proc_reg_read at ffffffff811f353e
      #14 [ffff8801144bfef0] vfs_read at ffffffff81189695
      #15 [ffff8801144bff30] sys_read at ffffffff811897d1
      #16 [ffff8801144bff80] system_call_fastpath at ffffffff8100b072
          RIP: 00007f6de5e4f730  RSP: 00007fff9aa26d98  RFLAGS: 00000206
          RAX: 0000000000000000  RBX: ffffffff8100b072  RCX: 00007f6de5e4f730
          RDX: 0000000000008000  RSI: 00000000019c7000  RDI: 0000000000000003
          RBP: 00000000019c7000   R8: 00007f6de6103ee8   R9: 0000000000000001
          R10: 00007fff9aa26b20  R11: 0000000000000246  R12: ffffffffffff8000
          R13: 0000000000000003  R14: 0000000000008000  R15: 0000000000000003
          ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
      

      Attachments

        Issue Links

          Activity

            [LU-5635] (llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed:

            The llog_cat_set_first_idx() looks guilty. Consider that there is cancel for last index in bitmap == bitmap_size - 1, i.e. 64767, then cat_idx will be assigned to that value first, then it enters cycle to find first set bit. The problem is that cycle counter 'i' is always inside bitmap because of this:
            i = (i + 1) % bitmap_size; but cat_idx is just incremented separately with each iteration, so it will became 64768 and beyond, which is out of range.

            tappro Mikhail Pershin added a comment - The llog_cat_set_first_idx() looks guilty. Consider that there is cancel for last index in bitmap == bitmap_size - 1, i.e. 64767, then cat_idx will be assigned to that value first, then it enters cycle to find first set bit. The problem is that cycle counter 'i' is always inside bitmap because of this: i = (i + 1) % bitmap_size; but cat_idx is just incremented separately with each iteration, so it will became 64768 and beyond, which is out of range.

            Patch landed to master (pre-2.7).

            jamesanunez James Nunez (Inactive) added a comment - Patch landed to master (pre-2.7).

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12161/
            Subject: LU-5635 llog: prevent out-of-bound index
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 28a963f266a57308ceb6903da238acf008e628bf

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12161/ Subject: LU-5635 llog: prevent out-of-bound index Project: fs/lustre-release Branch: master Current Patch Set: Commit: 28a963f266a57308ceb6903da238acf008e628bf

            I added some traces right before the LASSERT:

            index=64769, last_index=64767, cd=ffff88012144dbe8, last_called_index=64768, cd->lpcd_first_idx=64768, cd->lpcd_last_idx=0
            LustreError: 9008:0:(llog.c:316:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed:
            

            So the function is entered with index > last_index. Proposed patch: http://review.whamcloud.com/12161

            Although the patch does indeed prevent the crash, I'm unsure whether it is the correct place to do the check. May be that should be done earlier in llog_cat_process_cb(), or even in llog_cat_set_first_idx() as that function appears to push index beyond its range.

            fzago Frank Zago (Inactive) added a comment - I added some traces right before the LASSERT: index=64769, last_index=64767, cd=ffff88012144dbe8, last_called_index=64768, cd->lpcd_first_idx=64768, cd->lpcd_last_idx=0 LustreError: 9008:0:(llog.c:316:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: So the function is entered with index > last_index. Proposed patch: http://review.whamcloud.com/12161 Although the patch does indeed prevent the crash, I'm unsure whether it is the correct place to do the check. May be that should be done earlier in llog_cat_process_cb(), or even in llog_cat_set_first_idx() as that function appears to push index beyond its range.
            fzago Frank Zago (Inactive) added a comment - - edited

            I can reproduce quite easily by running

            while true ; do cat actions > /dev/null ; done

            and the crash happens within minutes.

            fzago Frank Zago (Inactive) added a comment - - edited I can reproduce quite easily by running while true ; do cat actions > /dev/null ; done and the crash happens within minutes.

            People

              jamesanunez James Nunez (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: