[LU-4692] (osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 Created: 03/Mar/14  Updated: 12/Nov/14  Resolved: 11/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Li Xi (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

2.5.54


Issue Links:
Duplicate
is duplicated by LU-4443 Failure on test suite parallel-scale ... Resolved
Related
is related to LU-5492 osc_lock.c:1165:osc_lock_enqueue()) A... Resolved
is related to LU-4591 Related cl_lock failures on master/2.5 Resolved
Severity: 3
Rank (Obsolete): 12901

 Description   

The assertion failed on our system. The system are running Lustre-2.5.54, so patch from LU-3889 is already merged.

<3>LustreError: 109372:0:(file.c:3087:ll_inode_revalidate_fini()) home2: revalidate FID [0x2000048a2:0x17:0x0] error: rc = -116
<0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
<0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) LBUG
<4>Pid: 237806, comm: tail
<4>
<4>Call Trace:
<4> [<ffffffffa054c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa054ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
<4> [<ffffffffa0c62820>] osc_lock_enqueue+0x730/0x8c0 [osc]
<4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
<4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa0a5c42a>] lov_lock_enqueue+0x22a/0x850 [lov]
<4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa069ef4f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
<4> [<ffffffffa069fb9e>] cl_lock_request+0x7e/0x270 [obdclass]
<4> [<ffffffffa06a4b2c>] cl_io_lock+0x3cc/0x560 [obdclass]
<4> [<ffffffffa06a4d62>] cl_io_loop+0xa2/0x1b0 [obdclass]
<4> [<ffffffffa0addca6>] ll_file_io_generic+0x2b6/0x710 [lustre]
<4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
<4> [<ffffffffa0ade23f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
<4> [<ffffffffa0ade6fc>] ll_file_read+0x16c/0x2a0 [lustre]
<4> [<ffffffff811816c5>] vfs_read+0xb5/0x1a0
<4> [<ffffffff81181801>] sys_read+0x51/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 237806, comm: tail Tainted: G W --------------- 2.6.32-358.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
<4> [<ffffffffa054ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
<4> [<ffffffffa0c62820>] ? osc_lock_enqueue+0x730/0x8c0 [osc]
<4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
<4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa0a5c42a>] ? lov_lock_enqueue+0x22a/0x850 [lov]
<4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa069ef4f>] ? cl_enqueue_locked+0x6f/0x1f0 [obdclass]
<4> [<ffffffffa069fb9e>] ? cl_lock_request+0x7e/0x270 [obdclass]
<4> [<ffffffffa06a4b2c>] ? cl_io_lock+0x3cc/0x560 [obdclass]
<4> [<ffffffffa06a4d62>] ? cl_io_loop+0xa2/0x1b0 [obdclass]
<4> [<ffffffffa0addca6>] ? ll_file_io_generic+0x2b6/0x710 [lustre]
<4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
<4> [<ffffffffa0ade23f>] ? ll_file_aio_read+0x13f/0x2c0 [lustre]
<4> [<ffffffffa0ade6fc>] ? ll_file_read+0x16c/0x2a0 [lustre]
<4> [<ffffffff811816c5>] ? vfs_read+0xb5/0x1a0
<4> [<ffffffff81181801>] ? sys_read+0x51/0x90
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b



 Comments   
Comment by Peter Jones [ 03/Mar/14 ]

Bobijam

Could you pleases assess this issue reported against master?

Thanks

Peter

Comment by Zhenyu Xu [ 04/Mar/14 ]

when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is

cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup()

                matched = cl_lock_ext_match(&lock->cll_descr, need) &&
                          lock->cll_state < CLS_FREEING &&
                          lock->cll_error == 0 &&
                          !(lock->cll_flags & CLF_CANCELLED) &&
                          cl_lock_fits_into(env, lock, need, io);

and in osc_lock_fits_into()

        struct osc_lock *ols = cl2osc_lock(slice);
                          
        if (need->cld_enq_flags & CEF_NEVER)
                return 0;

        if (ols->ols_state >= OLS_CANCELLED)
                return 0;

So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW.

Comment by Jinshan Xiong (Inactive) [ 05/Mar/14 ]

The lock will be held in this case so it won't be cancelled.

Comment by Zhenyu Xu [ 05/Mar/14 ]

Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server

thread1                           thread2               note
cl_lock_request()
+->cl_lock_hold_mutex()
  +-->cl_lock_find()                                 found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED
  |
  |                          osc_ldlm_blocking_ast()
  |                          +->osc_ldlm_blocking_ast0()   get mutex of sub lock 
  |                          |+->osc_lock_blocking()
  |                          | +->cl_lock_cancel(sub_lock)
  |                          |  +->cl_lock_cancel0(sub_lock)   set sub_lock as CLF_CANCELLED, reverse call clo_cancel()
  |                          |  |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED
  |                          |  +->cl_lock_delete(sub_lock)
  |                          +->cl_lock_mutex_put(sub_lock)
  |
  |
  cl_lock_mutex_get(top_lock)

at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state.

Comment by Andreas Dilger [ 10/Mar/14 ]

Li Xi, could you please comment about what test was being run to cause this problem?

Comment by Jinshan Xiong (Inactive) [ 10/Mar/14 ]

This is probably a different representation of LU-4591. It would be really helpful if someone on our team can help reproduce the problem local side.

Comment by Li Xi (Inactive) [ 12/Mar/14 ]

Hi Andreas,

Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot.

Comment by Wojciech Turek (Inactive) [ 12/Mar/14 ]

We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1

crash> bt
PID: 18546 TASK: ffff8809f1aa8040 CPU: 0 COMMAND: "turbostream"
#0 [ffff8809f0cf98a0] machine_kexec at ffffffff81038f3b
#1 [ffff8809f0cf9900] crash_kexec at ffffffff810c5da2
#2 [ffff8809f0cf99d0] panic at ffffffff8152721a
#3 [ffff8809f0cf9a50] lbug_with_loc at ffffffffa0f47eeb [libcfs]
#4 [ffff8809f0cf9a70] lovsub_lock_state at ffffffffa14b7d51 [lov]
#5 [ffff8809f0cf9ac0] cl_lock_state_signal at ffffffffa10aedd8 [obdclass]
#6 [ffff8809f0cf9b00] cl_lock_state_set at ffffffffa10aef25 [obdclass]
#7 [ffff8809f0cf9b30] cl_enqueue_try at ffffffffa10b2c29 [obdclass]
#8 [ffff8809f0cf9b80] lov_lock_enqueue at ffffffffa14b00da [lov]
#9 [ffff8809f0cf9c20] cl_enqueue_try at ffffffffa10b2bdc [obdclass]
#10 [ffff8809f0cf9c70] cl_enqueue_locked at ffffffffa10b3fcf [obdclass]
#11 [ffff8809f0cf9cb0] cl_lock_request at ffffffffa10b4c3e [obdclass]
#12 [ffff8809f0cf9d10] cl_io_lock at ffffffffa10b9edc [obdclass]
#13 [ffff8809f0cf9d70] cl_io_loop at ffffffffa10ba112 [obdclass]
#14 [ffff8809f0cf9da0] ll_file_io_generic at ffffffffa152ca90 [lustre]
#15 [ffff8809f0cf9e20] ll_file_aio_write at ffffffffa152d9d2 [lustre]
#16 [ffff8809f0cf9e80] ll_file_write at ffffffffa152dcbc [lustre]
#17 [ffff8809f0cf9ef0] vfs_write at ffffffff81188f88
#18 [ffff8809f0cf9f30] sys_write at ffffffff81189881
#19 [ffff8809f0cf9f80] system_call_fastpath at ffffffff8100b072
RIP: 00007fdf7eeea6fd RSP: 00007fffad51c8b8 RFLAGS: 00000202
RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000001
RDX: 00000000000009c4 RSI: 0000000008b32154 RDI: 000000000000001b
RBP: 0000000013252a00 R8: 00007fdf8069d6a8 R9: 00000000000009c4
R10: 00007fffad512d3c R11: 0000000000000293 R12: 000000004c00010d
R13: 00000000000009c4 R14: 0000000000e6d964 R15: 000000000000001b
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash> sys
SYSTEM MAP: /boot/System.map-2.6.32-431.1.2.el6.hpcs.x86_64
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.1.2.el6.hpcs.x86_64/vmlinux (2.6.32-431.1.2.el6.hpcs.x86_64)
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Wed Mar 12 16:17:17 2014
UPTIME: 1 days, 23:37:03
LOAD AVERAGE: 1.99, 1.97, 1.91
TASKS: 449
NODENAME: tesla14
RELEASE: 2.6.32-431.1.2.el6.hpcs.x86_64
VERSION: #1 SMP Sun Dec 15 04:51:47 GMT 2013
MACHINE: x86_64 (2599 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: LBUG"

Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ]

Hi Turek,

Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks.

Jinshan

Comment by Wojciech Turek (Inactive) [ 14/Mar/14 ]

uploaded to ftp://ftp.whamcloud.com/uploads/LU-4692/

Comment by Zhenyu Xu [ 21/Mar/14 ]

Hi Wojciech,

Can you upload the supporting files (uncompressed vmlinuz, System.map) as well?

Comment by Shuichi Ihara (Inactive) [ 24/Mar/14 ]

Here is crashdump of original this problem. ftp://ftp.whamcloud.com/uploads/LU-4692-2/
The kernel is standard RHEL6.4, you can download vmlinux and System.map from here.
http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-2.6.32-358.el6.x86_64.rpm
http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-common-x86_64-2.6.32-358.el6.x86_64.rpm

Comment by Wojciech Turek (Inactive) [ 25/Mar/14 ]

we use patchless clients so the kernel is standard rhel6.5 kernel
ftp://ftp.scientificlinux.org/linux/scientific/6.5/archive/debuginfo/kernel-debuginfo-2.6.32-431.1.2.el6.x86_64.rpm
ftp://ftp.scientificlinux.org/linux/scientific/6.5/archive/debuginfo/kernel-debuginfo-common-x86_64-2.6.32-431.1.2.el6.x86_64.rpm

Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ]

This problem should be imported by LU-3027, I'm taking a look.

Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ]

I think this is imported by LU-3889, I pushed http://review.whamcloud.com/9783 please give it a try.

It lacks of lustre module files so I can't analyze the core dump file.

Comment by Aurelien Degremont (Inactive) [ 26/Mar/14 ]

Jinshan,
Wojciech seems to be running 2.4.2, as we do. And we both hit this bug and I'm sure we did not apply LU-3889 on top of it. So it is possible to hit this bug without LU-3889. So I do not think it is responsible for this issue.

Comment by Jinshan Xiong (Inactive) [ 26/Mar/14 ]

I see, thanks Aurelien.

Hi Wojciech, please upload lustre modules into ftp so I can take a further look at the core dump.

Comment by Jodi Levi (Inactive) [ 02/Jun/14 ]

The first part of this bug was related to LU-4591

Comment by Zhenyu Xu [ 11/Sep/14 ]

dup of LU-5492

Generated at Sat Feb 10 01:45:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.