[LU-5635] (llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: Created: 17/Sep/14  Updated: 01/Mar/17  Resolved: 04/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Frank Zago (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Centos 6.5 with Lustre head of tree (commit f04237)


Issue Links:
Related
is related to LU-6714 llog_process_thread() may use wrong o... Resolved
Severity: 3
Rank (Obsolete): 15772

 Description   

Some hsm operations (lfs hsm_*) were running on a client, and I wanted to check the list of actions:

cat /proc/fs/lustre/mdt/tas01-MDT0000/hsm/actions

This led to an immediate crash of the MDT:

<0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: 
<0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) LBUG
<4>Pid: 3773, comm: cat
<4>
<4>Call Trace:
<4> [<ffffffffa02f8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa02f8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0413fff>] llog_process_thread+0xd7f/0xdd0 [obdclass]
<4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt]
<4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass]
<4> [<ffffffffa0418612>] llog_cat_process_cb+0x1b2/0x5e0 [obdclass]
<4> [<ffffffffa0413c22>] llog_process_thread+0x9a2/0xdd0 [obdclass]
<4> [<ffffffff810792c7>] ? current_fs_time+0x27/0x30
<4> [<ffffffffa0418460>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
<4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass]
<4> [<ffffffffa0416b9d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
<4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt]
<4> [<ffffffffa0416d09>] llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa0dd7376>] mdt_hsm_actions_proc_show+0x1a6/0x240 [mdt]
<4> [<ffffffffa0dd8b03>] ? mdt_hsm_actions_proc_start+0x63/0x380 [mdt]
<4> [<ffffffff811ade02>] seq_read+0xf2/0x400
<4> [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
<4> [<ffffffff811f353e>] proc_reg_read+0x7e/0xc0
<4> [<ffffffff81189695>] vfs_read+0xb5/0x1a0
<4> [<ffffffff811897d1>] sys_read+0x51/0x90
<4> [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>
<0>Kernel panic - not syncing: LBUG

Crash stack is:

crash> bt
PID: 3773   TASK: ffff880130d44ae0  CPU: 3   COMMAND: "cat"
 #0 [ffff8801144bf900] machine_kexec at ffffffff81038f3b
 #1 [ffff8801144bf960] crash_kexec at ffffffff810c5d82
 #2 [ffff8801144bfa30] panic at ffffffff8152798a
 #3 [ffff8801144bfab0] lbug_with_loc at ffffffffa02f8eeb [libcfs]
 #4 [ffff8801144bfad0] llog_process_thread at ffffffffa0413fff [obdclass]
 #5 [ffff8801144bfb80] llog_process_or_fork at ffffffffa041585f [obdclass]
 #6 [ffff8801144bfbd0] llog_cat_process_cb at ffffffffa0418612 [obdclass]
 #7 [ffff8801144bfc30] llog_process_thread at ffffffffa0413c22 [obdclass]
 #8 [ffff8801144bfce0] llog_process_or_fork at ffffffffa041585f [obdclass]
 #9 [ffff8801144bfd30] llog_cat_process_or_fork at ffffffffa0416b9d [obdclass]
#10 [ffff8801144bfdc0] llog_cat_process at ffffffffa0416d09 [obdclass]
#11 [ffff8801144bfde0] mdt_hsm_actions_proc_show at ffffffffa0dd7376 [mdt]
#12 [ffff8801144bfe20] seq_read at ffffffff811ade02
#13 [ffff8801144bfea0] proc_reg_read at ffffffff811f353e
#14 [ffff8801144bfef0] vfs_read at ffffffff81189695
#15 [ffff8801144bff30] sys_read at ffffffff811897d1
#16 [ffff8801144bff80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f6de5e4f730  RSP: 00007fff9aa26d98  RFLAGS: 00000206
    RAX: 0000000000000000  RBX: ffffffff8100b072  RCX: 00007f6de5e4f730
    RDX: 0000000000008000  RSI: 00000000019c7000  RDI: 0000000000000003
    RBP: 00000000019c7000   R8: 00007f6de6103ee8   R9: 0000000000000001
    R10: 00007fff9aa26b20  R11: 0000000000000246  R12: ffffffffffff8000
    R13: 0000000000000003  R14: 0000000000008000  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b


 Comments   
Comment by Frank Zago (Inactive) [ 01/Oct/14 ]

I can reproduce quite easily by running

while true ; do cat actions > /dev/null ; done

and the crash happens within minutes.

Comment by Frank Zago (Inactive) [ 01/Oct/14 ]

I added some traces right before the LASSERT:

index=64769, last_index=64767, cd=ffff88012144dbe8, last_called_index=64768, cd->lpcd_first_idx=64768, cd->lpcd_last_idx=0
LustreError: 9008:0:(llog.c:316:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed:

So the function is entered with index > last_index. Proposed patch: http://review.whamcloud.com/12161

Although the patch does indeed prevent the crash, I'm unsure whether it is the correct place to do the check. May be that should be done earlier in llog_cat_process_cb(), or even in llog_cat_set_first_idx() as that function appears to push index beyond its range.

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12161/
Subject: LU-5635 llog: prevent out-of-bound index
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 28a963f266a57308ceb6903da238acf008e628bf

Comment by James Nunez (Inactive) [ 04/Dec/14 ]

Patch landed to master (pre-2.7).

Comment by Mikhail Pershin [ 14/Jun/15 ]

The llog_cat_set_first_idx() looks guilty. Consider that there is cancel for last index in bitmap == bitmap_size - 1, i.e. 64767, then cat_idx will be assigned to that value first, then it enters cycle to find first set bit. The problem is that cycle counter 'i' is always inside bitmap because of this:
i = (i + 1) % bitmap_size; but cat_idx is just incremented separately with each iteration, so it will became 64768 and beyond, which is out of range.

Generated at Sat Feb 10 01:53:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.