[LU-5635] (llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: Created: 17/Sep/14 Updated: 01/Mar/17 Resolved: 04/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Frank Zago (Inactive) | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Centos 6.5 with Lustre head of tree (commit f04237) |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15772 | ||||||||
| Description |
|
Some hsm operations (lfs hsm_*) were running on a client, and I wanted to check the list of actions: cat /proc/fs/lustre/mdt/tas01-MDT0000/hsm/actions This led to an immediate crash of the MDT: <0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: <0>LustreError: 3773:0:(llog.c:310:llog_process_thread()) LBUG <4>Pid: 3773, comm: cat <4> <4>Call Trace: <4> [<ffffffffa02f8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa02f8e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0413fff>] llog_process_thread+0xd7f/0xdd0 [obdclass] <4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt] <4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass] <4> [<ffffffffa0418612>] llog_cat_process_cb+0x1b2/0x5e0 [obdclass] <4> [<ffffffffa0413c22>] llog_process_thread+0x9a2/0xdd0 [obdclass] <4> [<ffffffff810792c7>] ? current_fs_time+0x27/0x30 <4> [<ffffffffa0418460>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass] <4> [<ffffffffa041585f>] llog_process_or_fork+0x13f/0x540 [obdclass] <4> [<ffffffffa0416b9d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] <4> [<ffffffffa0dd7410>] ? hsm_actions_show_cb+0x0/0x3d0 [mdt] <4> [<ffffffffa0416d09>] llog_cat_process+0x19/0x20 [obdclass] <4> [<ffffffffa0dd7376>] mdt_hsm_actions_proc_show+0x1a6/0x240 [mdt] <4> [<ffffffffa0dd8b03>] ? mdt_hsm_actions_proc_start+0x63/0x380 [mdt] <4> [<ffffffff811ade02>] seq_read+0xf2/0x400 <4> [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13 <4> [<ffffffff811f353e>] proc_reg_read+0x7e/0xc0 <4> [<ffffffff81189695>] vfs_read+0xb5/0x1a0 <4> [<ffffffff811897d1>] sys_read+0x51/0x90 <4> [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <4> <0>Kernel panic - not syncing: LBUG Crash stack is: crash> bt
PID: 3773 TASK: ffff880130d44ae0 CPU: 3 COMMAND: "cat"
#0 [ffff8801144bf900] machine_kexec at ffffffff81038f3b
#1 [ffff8801144bf960] crash_kexec at ffffffff810c5d82
#2 [ffff8801144bfa30] panic at ffffffff8152798a
#3 [ffff8801144bfab0] lbug_with_loc at ffffffffa02f8eeb [libcfs]
#4 [ffff8801144bfad0] llog_process_thread at ffffffffa0413fff [obdclass]
#5 [ffff8801144bfb80] llog_process_or_fork at ffffffffa041585f [obdclass]
#6 [ffff8801144bfbd0] llog_cat_process_cb at ffffffffa0418612 [obdclass]
#7 [ffff8801144bfc30] llog_process_thread at ffffffffa0413c22 [obdclass]
#8 [ffff8801144bfce0] llog_process_or_fork at ffffffffa041585f [obdclass]
#9 [ffff8801144bfd30] llog_cat_process_or_fork at ffffffffa0416b9d [obdclass]
#10 [ffff8801144bfdc0] llog_cat_process at ffffffffa0416d09 [obdclass]
#11 [ffff8801144bfde0] mdt_hsm_actions_proc_show at ffffffffa0dd7376 [mdt]
#12 [ffff8801144bfe20] seq_read at ffffffff811ade02
#13 [ffff8801144bfea0] proc_reg_read at ffffffff811f353e
#14 [ffff8801144bfef0] vfs_read at ffffffff81189695
#15 [ffff8801144bff30] sys_read at ffffffff811897d1
#16 [ffff8801144bff80] system_call_fastpath at ffffffff8100b072
RIP: 00007f6de5e4f730 RSP: 00007fff9aa26d98 RFLAGS: 00000206
RAX: 0000000000000000 RBX: ffffffff8100b072 RCX: 00007f6de5e4f730
RDX: 0000000000008000 RSI: 00000000019c7000 RDI: 0000000000000003
RBP: 00000000019c7000 R8: 00007f6de6103ee8 R9: 0000000000000001
R10: 00007fff9aa26b20 R11: 0000000000000246 R12: ffffffffffff8000
R13: 0000000000000003 R14: 0000000000008000 R15: 0000000000000003
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
|
| Comments |
| Comment by Frank Zago (Inactive) [ 01/Oct/14 ] |
|
I can reproduce quite easily by running while true ; do cat actions > /dev/null ; done and the crash happens within minutes. |
| Comment by Frank Zago (Inactive) [ 01/Oct/14 ] |
|
I added some traces right before the LASSERT: index=64769, last_index=64767, cd=ffff88012144dbe8, last_called_index=64768, cd->lpcd_first_idx=64768, cd->lpcd_last_idx=0 LustreError: 9008:0:(llog.c:316:llog_process_thread()) ASSERTION( index <= last_index + 1 ) failed: So the function is entered with index > last_index. Proposed patch: http://review.whamcloud.com/12161 Although the patch does indeed prevent the crash, I'm unsure whether it is the correct place to do the check. May be that should be done earlier in llog_cat_process_cb(), or even in llog_cat_set_first_idx() as that function appears to push index beyond its range. |
| Comment by Gerrit Updater [ 04/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12161/ |
| Comment by James Nunez (Inactive) [ 04/Dec/14 ] |
|
Patch landed to master (pre-2.7). |
| Comment by Mikhail Pershin [ 14/Jun/15 ] |
|
The llog_cat_set_first_idx() looks guilty. Consider that there is cancel for last index in bitmap == bitmap_size - 1, i.e. 64767, then cat_idx will be assigned to that value first, then it enters cycle to find first set bit. The problem is that cycle counter 'i' is always inside bitmap because of this: |