[LU-4692] (osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.3
Labels:
None
Environment:
2.5.54

Severity:
3
Rank (Obsolete):
12901

Description

The assertion failed on our system. The system are running Lustre-2.5.54, so patch from ~~LU-3889~~ is already merged.

<3>LustreError: 109372:0:(file.c:3087:ll_inode_revalidate_fini()) home2: revalidate FID [0x2000048a2:0x17:0x0] error: rc = -116
<0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
<0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) LBUG
<4>Pid: 237806, comm: tail
<4>
<4>Call Trace:
<4> [<ffffffffa054c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa054ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
<4> [<ffffffffa0c62820>] osc_lock_enqueue+0x730/0x8c0 [osc]
<4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
<4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa0a5c42a>] lov_lock_enqueue+0x22a/0x850 [lov]
<4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa069ef4f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
<4> [<ffffffffa069fb9e>] cl_lock_request+0x7e/0x270 [obdclass]
<4> [<ffffffffa06a4b2c>] cl_io_lock+0x3cc/0x560 [obdclass]
<4> [<ffffffffa06a4d62>] cl_io_loop+0xa2/0x1b0 [obdclass]
<4> [<ffffffffa0addca6>] ll_file_io_generic+0x2b6/0x710 [lustre]
<4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
<4> [<ffffffffa0ade23f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
<4> [<ffffffffa0ade6fc>] ll_file_read+0x16c/0x2a0 [lustre]
<4> [<ffffffff811816c5>] vfs_read+0xb5/0x1a0
<4> [<ffffffff81181801>] sys_read+0x51/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 237806, comm: tail Tainted: G W --------------- 2.6.32-358.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
<4> [<ffffffffa054ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
<4> [<ffffffffa0c62820>] ? osc_lock_enqueue+0x730/0x8c0 [osc]
<4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
<4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa0a5c42a>] ? lov_lock_enqueue+0x22a/0x850 [lov]
<4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
<4> [<ffffffffa069ef4f>] ? cl_enqueue_locked+0x6f/0x1f0 [obdclass]
<4> [<ffffffffa069fb9e>] ? cl_lock_request+0x7e/0x270 [obdclass]
<4> [<ffffffffa06a4b2c>] ? cl_io_lock+0x3cc/0x560 [obdclass]
<4> [<ffffffffa06a4d62>] ? cl_io_loop+0xa2/0x1b0 [obdclass]
<4> [<ffffffffa0addca6>] ? ll_file_io_generic+0x2b6/0x710 [lustre]
<4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
<4> [<ffffffffa0ade23f>] ? ll_file_aio_read+0x13f/0x2c0 [lustre]
<4> [<ffffffffa0ade6fc>] ? ll_file_read+0x16c/0x2a0 [lustre]
<4> [<ffffffff811816c5>] ? vfs_read+0xb5/0x1a0
<4> [<ffffffff81181801>] ? sys_read+0x51/0x90
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Attachments

Issue Links

is duplicated by

LU-4443 Failure on test suite parallel-scale test_write_disjoint: ASSERTION( ols->ols_state == OLS_NEW ) failed

Resolved

is related to

LU-4591 Related cl_lock failures on master/2.5

Resolved

is related to

LU-5492 osc_lock.c:1165:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

Resolved

Activity

[LU-4692] (osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

Wojciech Turek added a comment - 14/Mar/14 2:37 PM

uploaded to ftp://ftp.whamcloud.com/uploads/LU-4692/

Wojciech Turek added a comment - 14/Mar/14 2:37 PM uploaded to ftp://ftp.whamcloud.com/uploads/LU-4692/

Jinshan Xiong (Inactive) added a comment - 12/Mar/14 7:01 PM

Hi Turek,

Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks.

Jinshan

Jinshan Xiong (Inactive) added a comment - 12/Mar/14 7:01 PM Hi Turek, Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks. Jinshan

Wojciech Turek added a comment - 12/Mar/14 6:55 PM

We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1

crash> bt
PID: 18546 TASK: ffff8809f1aa8040 CPU: 0 COMMAND: "turbostream"
#0 [ffff8809f0cf98a0] machine_kexec at ffffffff81038f3b
#1 [ffff8809f0cf9900] crash_kexec at ffffffff810c5da2
#2 [ffff8809f0cf99d0] panic at ffffffff8152721a
#3 [ffff8809f0cf9a50] lbug_with_loc at ffffffffa0f47eeb [libcfs]
#4 [ffff8809f0cf9a70] lovsub_lock_state at ffffffffa14b7d51 [lov]
#5 [ffff8809f0cf9ac0] cl_lock_state_signal at ffffffffa10aedd8 [obdclass]
#6 [ffff8809f0cf9b00] cl_lock_state_set at ffffffffa10aef25 [obdclass]
#7 [ffff8809f0cf9b30] cl_enqueue_try at ffffffffa10b2c29 [obdclass]
#8 [ffff8809f0cf9b80] lov_lock_enqueue at ffffffffa14b00da [lov]
#9 [ffff8809f0cf9c20] cl_enqueue_try at ffffffffa10b2bdc [obdclass]
#10 [ffff8809f0cf9c70] cl_enqueue_locked at ffffffffa10b3fcf [obdclass]
#11 [ffff8809f0cf9cb0] cl_lock_request at ffffffffa10b4c3e [obdclass]
#12 [ffff8809f0cf9d10] cl_io_lock at ffffffffa10b9edc [obdclass]
#13 [ffff8809f0cf9d70] cl_io_loop at ffffffffa10ba112 [obdclass]
#14 [ffff8809f0cf9da0] ll_file_io_generic at ffffffffa152ca90 [lustre]
#15 [ffff8809f0cf9e20] ll_file_aio_write at ffffffffa152d9d2 [lustre]
#16 [ffff8809f0cf9e80] ll_file_write at ffffffffa152dcbc [lustre]
#17 [ffff8809f0cf9ef0] vfs_write at ffffffff81188f88
#18 [ffff8809f0cf9f30] sys_write at ffffffff81189881
#19 [ffff8809f0cf9f80] system_call_fastpath at ffffffff8100b072
RIP: 00007fdf7eeea6fd RSP: 00007fffad51c8b8 RFLAGS: 00000202
RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000001
RDX: 00000000000009c4 RSI: 0000000008b32154 RDI: 000000000000001b
RBP: 0000000013252a00 R8: 00007fdf8069d6a8 R9: 00000000000009c4
R10: 00007fffad512d3c R11: 0000000000000293 R12: 000000004c00010d
R13: 00000000000009c4 R14: 0000000000e6d964 R15: 000000000000001b
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash> sys
SYSTEM MAP: /boot/System.map-2.6.32-431.1.2.el6.hpcs.x86_64
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.1.2.el6.hpcs.x86_64/vmlinux (2.6.32-431.1.2.el6.hpcs.x86_64)
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Wed Mar 12 16:17:17 2014
UPTIME: 1 days, 23:37:03
LOAD AVERAGE: 1.99, 1.97, 1.91
TASKS: 449
NODENAME: tesla14
RELEASE: 2.6.32-431.1.2.el6.hpcs.x86_64
VERSION: #1 SMP Sun Dec 15 04:51:47 GMT 2013
MACHINE: x86_64 (2599 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: LBUG"

Wojciech Turek added a comment - 12/Mar/14 6:55 PM We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1 crash> bt PID: 18546 TASK: ffff8809f1aa8040 CPU: 0 COMMAND: "turbostream" #0 [ffff8809f0cf98a0] machine_kexec at ffffffff81038f3b #1 [ffff8809f0cf9900] crash_kexec at ffffffff810c5da2 #2 [ffff8809f0cf99d0] panic at ffffffff8152721a #3 [ffff8809f0cf9a50] lbug_with_loc at ffffffffa0f47eeb [libcfs] #4 [ffff8809f0cf9a70] lovsub_lock_state at ffffffffa14b7d51 [lov] #5 [ffff8809f0cf9ac0] cl_lock_state_signal at ffffffffa10aedd8 [obdclass] #6 [ffff8809f0cf9b00] cl_lock_state_set at ffffffffa10aef25 [obdclass] #7 [ffff8809f0cf9b30] cl_enqueue_try at ffffffffa10b2c29 [obdclass] #8 [ffff8809f0cf9b80] lov_lock_enqueue at ffffffffa14b00da [lov] #9 [ffff8809f0cf9c20] cl_enqueue_try at ffffffffa10b2bdc [obdclass] #10 [ffff8809f0cf9c70] cl_enqueue_locked at ffffffffa10b3fcf [obdclass] #11 [ffff8809f0cf9cb0] cl_lock_request at ffffffffa10b4c3e [obdclass] #12 [ffff8809f0cf9d10] cl_io_lock at ffffffffa10b9edc [obdclass] #13 [ffff8809f0cf9d70] cl_io_loop at ffffffffa10ba112 [obdclass] #14 [ffff8809f0cf9da0] ll_file_io_generic at ffffffffa152ca90 [lustre] #15 [ffff8809f0cf9e20] ll_file_aio_write at ffffffffa152d9d2 [lustre] #16 [ffff8809f0cf9e80] ll_file_write at ffffffffa152dcbc [lustre] #17 [ffff8809f0cf9ef0] vfs_write at ffffffff81188f88 #18 [ffff8809f0cf9f30] sys_write at ffffffff81189881 #19 [ffff8809f0cf9f80] system_call_fastpath at ffffffff8100b072 RIP: 00007fdf7eeea6fd RSP: 00007fffad51c8b8 RFLAGS: 00000202 RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000001 RDX: 00000000000009c4 RSI: 0000000008b32154 RDI: 000000000000001b RBP: 0000000013252a00 R8: 00007fdf8069d6a8 R9: 00000000000009c4 R10: 00007fffad512d3c R11: 0000000000000293 R12: 000000004c00010d R13: 00000000000009c4 R14: 0000000000e6d964 R15: 000000000000001b ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> sys SYSTEM MAP: /boot/System.map-2.6.32-431.1.2.el6.hpcs.x86_64 DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.1.2.el6.hpcs.x86_64/vmlinux (2.6.32-431.1.2.el6.hpcs.x86_64) DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 12 DATE: Wed Mar 12 16:17:17 2014 UPTIME: 1 days, 23:37:03 LOAD AVERAGE: 1.99, 1.97, 1.91 TASKS: 449 NODENAME: tesla14 RELEASE: 2.6.32-431.1.2.el6.hpcs.x86_64 VERSION: #1 SMP Sun Dec 15 04:51:47 GMT 2013 MACHINE: x86_64 (2599 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: LBUG"

Li Xi (Inactive) added a comment - 12/Mar/14 3:04 AM

Hi Andreas,

Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot.

Li Xi (Inactive) added a comment - 12/Mar/14 3:04 AM Hi Andreas, Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot.

Jinshan Xiong (Inactive) added a comment - 10/Mar/14 6:40 PM

This is probably a different representation of ~~LU-4591~~. It would be really helpful if someone on our team can help reproduce the problem local side.

Jinshan Xiong (Inactive) added a comment - 10/Mar/14 6:40 PM This is probably a different representation of LU-4591 . It would be really helpful if someone on our team can help reproduce the problem local side.

Andreas Dilger added a comment - 10/Mar/14 6:12 PM

Li Xi, could you please comment about what test was being run to cause this problem?

Andreas Dilger added a comment - 10/Mar/14 6:12 PM Li Xi, could you please comment about what test was being run to cause this problem?

Zhenyu Xu added a comment - 05/Mar/14 9:20 AM - edited

Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server

thread1                           thread2               note
cl_lock_request()
+->cl_lock_hold_mutex()
  +-->cl_lock_find()                                 found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED
  |
  |                          osc_ldlm_blocking_ast()
  |                          +->osc_ldlm_blocking_ast0()   get mutex of sub lock 
  |                          |+->osc_lock_blocking()
  |                          | +->cl_lock_cancel(sub_lock)
  |                          |  +->cl_lock_cancel0(sub_lock)   set sub_lock as CLF_CANCELLED, reverse call clo_cancel()
  |                          |  |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED
  |                          |  +->cl_lock_delete(sub_lock)
  |                          +->cl_lock_mutex_put(sub_lock)
  |
  |
  cl_lock_mutex_get(top_lock)

at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state.

Zhenyu Xu added a comment - 05/Mar/14 9:20 AM - edited Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server thread1 thread2 note cl_lock_request() +->cl_lock_hold_mutex() +-->cl_lock_find() found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED | | osc_ldlm_blocking_ast() | +->osc_ldlm_blocking_ast0() get mutex of sub lock | |+->osc_lock_blocking() | | +->cl_lock_cancel(sub_lock) | | +->cl_lock_cancel0(sub_lock) set sub_lock as CLF_CANCELLED, reverse call clo_cancel() | | |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED | | +->cl_lock_delete(sub_lock) | +->cl_lock_mutex_put(sub_lock) | | cl_lock_mutex_get(top_lock) at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state.

Jinshan Xiong (Inactive) added a comment - 05/Mar/14 6:42 AM

The lock will be held in this case so it won't be cancelled.

Jinshan Xiong (Inactive) added a comment - 05/Mar/14 6:42 AM The lock will be held in this case so it won't be cancelled.

Zhenyu Xu added a comment - 04/Mar/14 10:58 AM

when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is

cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup()

                matched = cl_lock_ext_match(&lock->cll_descr, need) &&
                          lock->cll_state < CLS_FREEING &&
                          lock->cll_error == 0 &&
                          !(lock->cll_flags & CLF_CANCELLED) &&
                          cl_lock_fits_into(env, lock, need, io);

and in osc_lock_fits_into()

        struct osc_lock *ols = cl2osc_lock(slice);
                          
        if (need->cld_enq_flags & CEF_NEVER)
                return 0;

        if (ols->ols_state >= OLS_CANCELLED)
                return 0;

So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW.

Zhenyu Xu added a comment - 04/Mar/14 10:58 AM when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup() matched = cl_lock_ext_match(&lock->cll_descr, need) && lock->cll_state < CLS_FREEING && lock->cll_error == 0 && !(lock->cll_flags & CLF_CANCELLED) && cl_lock_fits_into(env, lock, need, io); and in osc_lock_fits_into() struct osc_lock *ols = cl2osc_lock(slice); if (need->cld_enq_flags & CEF_NEVER) return 0; if (ols->ols_state >= OLS_CANCELLED) return 0; So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW.

Peter Jones added a comment - 03/Mar/14 2:52 PM

Bobijam

Could you pleases assess this issue reported against master?

Thanks

Peter

Peter Jones added a comment - 03/Mar/14 2:52 PM Bobijam Could you pleases assess this issue reported against master? Thanks Peter

People

Assignee:: Zhenyu Xu

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 03/Mar/14 7:09 AM

Updated:: 12/Nov/14 6:08 PM

Resolved:: 11/Sep/14 12:53 AM