Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4692

(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.3
    • None
    • 2.5.54
    • 3
    • 12901

    Description

      The assertion failed on our system. The system are running Lustre-2.5.54, so patch from LU-3889 is already merged.

      <3>LustreError: 109372:0:(file.c:3087:ll_inode_revalidate_fini()) home2: revalidate FID [0x2000048a2:0x17:0x0] error: rc = -116
      <0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
      <0>LustreError: 237806:0:(osc_lock.c:1204:osc_lock_enqueue()) LBUG
      <4>Pid: 237806, comm: tail
      <4>
      <4>Call Trace:
      <4> [<ffffffffa054c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa054ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
      <4> [<ffffffffa0c62820>] osc_lock_enqueue+0x730/0x8c0 [osc]
      <4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
      <4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
      <4> [<ffffffffa0a5c42a>] lov_lock_enqueue+0x22a/0x850 [lov]
      <4> [<ffffffffa069dcfc>] cl_enqueue_try+0xfc/0x300 [obdclass]
      <4> [<ffffffffa069ef4f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
      <4> [<ffffffffa069fb9e>] cl_lock_request+0x7e/0x270 [obdclass]
      <4> [<ffffffffa06a4b2c>] cl_io_lock+0x3cc/0x560 [obdclass]
      <4> [<ffffffffa06a4d62>] cl_io_loop+0xa2/0x1b0 [obdclass]
      <4> [<ffffffffa0addca6>] ll_file_io_generic+0x2b6/0x710 [lustre]
      <4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
      <4> [<ffffffffa0ade23f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      <4> [<ffffffffa0ade6fc>] ll_file_read+0x16c/0x2a0 [lustre]
      <4> [<ffffffff811816c5>] vfs_read+0xb5/0x1a0
      <4> [<ffffffff81181801>] sys_read+0x51/0x90
      <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 237806, comm: tail Tainted: G W --------------- 2.6.32-358.el6.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
      <4> [<ffffffffa054ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0c620f0>] ? osc_lock_enqueue+0x0/0x8c0 [osc]
      <4> [<ffffffffa0c62820>] ? osc_lock_enqueue+0x730/0x8c0 [osc]
      <4> [<ffffffffa069a337>] ? cl_lock_state_signal+0x87/0x160 [obdclass]
      <4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
      <4> [<ffffffffa0a5c42a>] ? lov_lock_enqueue+0x22a/0x850 [lov]
      <4> [<ffffffffa069dcfc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
      <4> [<ffffffffa069ef4f>] ? cl_enqueue_locked+0x6f/0x1f0 [obdclass]
      <4> [<ffffffffa069fb9e>] ? cl_lock_request+0x7e/0x270 [obdclass]
      <4> [<ffffffffa06a4b2c>] ? cl_io_lock+0x3cc/0x560 [obdclass]
      <4> [<ffffffffa06a4d62>] ? cl_io_loop+0xa2/0x1b0 [obdclass]
      <4> [<ffffffffa0addca6>] ? ll_file_io_generic+0x2b6/0x710 [lustre]
      <4> [<ffffffffa0694ce9>] ? cl_env_get+0x29/0x350 [obdclass]
      <4> [<ffffffffa0ade23f>] ? ll_file_aio_read+0x13f/0x2c0 [lustre]
      <4> [<ffffffffa0ade6fc>] ? ll_file_read+0x16c/0x2a0 [lustre]
      <4> [<ffffffff811816c5>] ? vfs_read+0xb5/0x1a0
      <4> [<ffffffff81181801>] ? sys_read+0x51/0x90
      <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

      Attachments

        Issue Links

          Activity

            [LU-4692] (osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
            wjt27 Wojciech Turek added a comment - uploaded to ftp://ftp.whamcloud.com/uploads/LU-4692/

            Hi Turek,

            Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks.

            Jinshan

            jay Jinshan Xiong (Inactive) added a comment - Hi Turek, Can you share us the core dump? You can upload it to ftp.whamcloud.com, thanks. Jinshan

            We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1

            crash> bt
            PID: 18546 TASK: ffff8809f1aa8040 CPU: 0 COMMAND: "turbostream"
            #0 [ffff8809f0cf98a0] machine_kexec at ffffffff81038f3b
            #1 [ffff8809f0cf9900] crash_kexec at ffffffff810c5da2
            #2 [ffff8809f0cf99d0] panic at ffffffff8152721a
            #3 [ffff8809f0cf9a50] lbug_with_loc at ffffffffa0f47eeb [libcfs]
            #4 [ffff8809f0cf9a70] lovsub_lock_state at ffffffffa14b7d51 [lov]
            #5 [ffff8809f0cf9ac0] cl_lock_state_signal at ffffffffa10aedd8 [obdclass]
            #6 [ffff8809f0cf9b00] cl_lock_state_set at ffffffffa10aef25 [obdclass]
            #7 [ffff8809f0cf9b30] cl_enqueue_try at ffffffffa10b2c29 [obdclass]
            #8 [ffff8809f0cf9b80] lov_lock_enqueue at ffffffffa14b00da [lov]
            #9 [ffff8809f0cf9c20] cl_enqueue_try at ffffffffa10b2bdc [obdclass]
            #10 [ffff8809f0cf9c70] cl_enqueue_locked at ffffffffa10b3fcf [obdclass]
            #11 [ffff8809f0cf9cb0] cl_lock_request at ffffffffa10b4c3e [obdclass]
            #12 [ffff8809f0cf9d10] cl_io_lock at ffffffffa10b9edc [obdclass]
            #13 [ffff8809f0cf9d70] cl_io_loop at ffffffffa10ba112 [obdclass]
            #14 [ffff8809f0cf9da0] ll_file_io_generic at ffffffffa152ca90 [lustre]
            #15 [ffff8809f0cf9e20] ll_file_aio_write at ffffffffa152d9d2 [lustre]
            #16 [ffff8809f0cf9e80] ll_file_write at ffffffffa152dcbc [lustre]
            #17 [ffff8809f0cf9ef0] vfs_write at ffffffff81188f88
            #18 [ffff8809f0cf9f30] sys_write at ffffffff81189881
            #19 [ffff8809f0cf9f80] system_call_fastpath at ffffffff8100b072
            RIP: 00007fdf7eeea6fd RSP: 00007fffad51c8b8 RFLAGS: 00000202
            RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000001
            RDX: 00000000000009c4 RSI: 0000000008b32154 RDI: 000000000000001b
            RBP: 0000000013252a00 R8: 00007fdf8069d6a8 R9: 00000000000009c4
            R10: 00007fffad512d3c R11: 0000000000000293 R12: 000000004c00010d
            R13: 00000000000009c4 R14: 0000000000e6d964 R15: 000000000000001b
            ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
            crash> sys
            SYSTEM MAP: /boot/System.map-2.6.32-431.1.2.el6.hpcs.x86_64
            DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.1.2.el6.hpcs.x86_64/vmlinux (2.6.32-431.1.2.el6.hpcs.x86_64)
            DUMPFILE: vmcore [PARTIAL DUMP]
            CPUS: 12
            DATE: Wed Mar 12 16:17:17 2014
            UPTIME: 1 days, 23:37:03
            LOAD AVERAGE: 1.99, 1.97, 1.91
            TASKS: 449
            NODENAME: tesla14
            RELEASE: 2.6.32-431.1.2.el6.hpcs.x86_64
            VERSION: #1 SMP Sun Dec 15 04:51:47 GMT 2013
            MACHINE: x86_64 (2599 Mhz)
            MEMORY: 64 GB
            PANIC: "Kernel panic - not syncing: LBUG"

            wjt27 Wojciech Turek added a comment - We hit the same problem shortly after upgrading to 2.4.2 clients. Have not seen this on 2.4.1 crash> bt PID: 18546 TASK: ffff8809f1aa8040 CPU: 0 COMMAND: "turbostream" #0 [ffff8809f0cf98a0] machine_kexec at ffffffff81038f3b #1 [ffff8809f0cf9900] crash_kexec at ffffffff810c5da2 #2 [ffff8809f0cf99d0] panic at ffffffff8152721a #3 [ffff8809f0cf9a50] lbug_with_loc at ffffffffa0f47eeb [libcfs] #4 [ffff8809f0cf9a70] lovsub_lock_state at ffffffffa14b7d51 [lov] #5 [ffff8809f0cf9ac0] cl_lock_state_signal at ffffffffa10aedd8 [obdclass] #6 [ffff8809f0cf9b00] cl_lock_state_set at ffffffffa10aef25 [obdclass] #7 [ffff8809f0cf9b30] cl_enqueue_try at ffffffffa10b2c29 [obdclass] #8 [ffff8809f0cf9b80] lov_lock_enqueue at ffffffffa14b00da [lov] #9 [ffff8809f0cf9c20] cl_enqueue_try at ffffffffa10b2bdc [obdclass] #10 [ffff8809f0cf9c70] cl_enqueue_locked at ffffffffa10b3fcf [obdclass] #11 [ffff8809f0cf9cb0] cl_lock_request at ffffffffa10b4c3e [obdclass] #12 [ffff8809f0cf9d10] cl_io_lock at ffffffffa10b9edc [obdclass] #13 [ffff8809f0cf9d70] cl_io_loop at ffffffffa10ba112 [obdclass] #14 [ffff8809f0cf9da0] ll_file_io_generic at ffffffffa152ca90 [lustre] #15 [ffff8809f0cf9e20] ll_file_aio_write at ffffffffa152d9d2 [lustre] #16 [ffff8809f0cf9e80] ll_file_write at ffffffffa152dcbc [lustre] #17 [ffff8809f0cf9ef0] vfs_write at ffffffff81188f88 #18 [ffff8809f0cf9f30] sys_write at ffffffff81189881 #19 [ffff8809f0cf9f80] system_call_fastpath at ffffffff8100b072 RIP: 00007fdf7eeea6fd RSP: 00007fffad51c8b8 RFLAGS: 00000202 RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000001 RDX: 00000000000009c4 RSI: 0000000008b32154 RDI: 000000000000001b RBP: 0000000013252a00 R8: 00007fdf8069d6a8 R9: 00000000000009c4 R10: 00007fffad512d3c R11: 0000000000000293 R12: 000000004c00010d R13: 00000000000009c4 R14: 0000000000e6d964 R15: 000000000000001b ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> sys SYSTEM MAP: /boot/System.map-2.6.32-431.1.2.el6.hpcs.x86_64 DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.1.2.el6.hpcs.x86_64/vmlinux (2.6.32-431.1.2.el6.hpcs.x86_64) DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 12 DATE: Wed Mar 12 16:17:17 2014 UPTIME: 1 days, 23:37:03 LOAD AVERAGE: 1.99, 1.97, 1.91 TASKS: 449 NODENAME: tesla14 RELEASE: 2.6.32-431.1.2.el6.hpcs.x86_64 VERSION: #1 SMP Sun Dec 15 04:51:47 GMT 2013 MACHINE: x86_64 (2599 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: LBUG"

            Hi Andreas,

            Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot.

            lixi Li Xi (Inactive) added a comment - Hi Andreas, Unfortunately, I don't know what test was running. I will let you know as soon as I get more information about ot.

            This is probably a different representation of LU-4591. It would be really helpful if someone on our team can help reproduce the problem local side.

            jay Jinshan Xiong (Inactive) added a comment - This is probably a different representation of LU-4591 . It would be really helpful if someone on our team can help reproduce the problem local side.

            Li Xi, could you please comment about what test was being run to cause this problem?

            adilger Andreas Dilger added a comment - Li Xi, could you please comment about what test was being run to cause this problem?
            bobijam Zhenyu Xu added a comment - - edited

            Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server

            thread1                           thread2               note
            cl_lock_request()
            +->cl_lock_hold_mutex()
              +-->cl_lock_find()                                 found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED
              |
              |                          osc_ldlm_blocking_ast()
              |                          +->osc_ldlm_blocking_ast0()   get mutex of sub lock 
              |                          |+->osc_lock_blocking()
              |                          | +->cl_lock_cancel(sub_lock)
              |                          |  +->cl_lock_cancel0(sub_lock)   set sub_lock as CLF_CANCELLED, reverse call clo_cancel()
              |                          |  |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED
              |                          |  +->cl_lock_delete(sub_lock)
              |                          +->cl_lock_mutex_put(sub_lock)
              |
              |
              cl_lock_mutex_get(top_lock)
            

            at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state.

            bobijam Zhenyu Xu added a comment - - edited Let's make up a scenario, thread1 is starting an IO, thread2 has the fitting lock (both top_lock and sub_lock), and thread2 is handling osc lock cancel ast requested from server thread1 thread2 note cl_lock_request() +->cl_lock_hold_mutex() +-->cl_lock_find() found the fitting top lock, and its sub osc lock is not in OLS_CANCELLED | | osc_ldlm_blocking_ast() | +->osc_ldlm_blocking_ast0() get mutex of sub lock | |+->osc_lock_blocking() | | +->cl_lock_cancel(sub_lock) | | +->cl_lock_cancel0(sub_lock) set sub_lock as CLF_CANCELLED, reverse call clo_cancel() | | |+->osc_lock_cancel(sub_osc_lock) set osc_lock as OLS_CANCELLED | | +->cl_lock_delete(sub_lock) | +->cl_lock_mutex_put(sub_lock) | | cl_lock_mutex_get(top_lock) at this time, top_lock is not in CLF_CANCELLED while sub osc_lock is in OLS_CANCELLED state.

            The lock will be held in this case so it won't be cancelled.

            jay Jinshan Xiong (Inactive) added a comment - The lock will be held in this case so it won't be cancelled.
            bobijam Zhenyu Xu added a comment -

            when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is

            cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup()

                            matched = cl_lock_ext_match(&lock->cll_descr, need) &&
                                      lock->cll_state < CLS_FREEING &&
                                      lock->cll_error == 0 &&
                                      !(lock->cll_flags & CLF_CANCELLED) &&
                                      cl_lock_fits_into(env, lock, need, io);
            

            and in osc_lock_fits_into()

                    struct osc_lock *ols = cl2osc_lock(slice);
                                      
                    if (need->cld_enq_flags & CEF_NEVER)
                            return 0;
            
                    if (ols->ols_state >= OLS_CANCELLED)
                            return 0;
            

            So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW.

            bobijam Zhenyu Xu added a comment - when an IO is starting, it will check whether there are existing locks matching to what this IO needs, the calling path is cl_lockset_lock_one()=>cl_lock_request()=>cl_lock_hold_mutex()=>cl_lock_find()=>cl_lock_lookup() matched = cl_lock_ext_match(&lock->cll_descr, need) && lock->cll_state < CLS_FREEING && lock->cll_error == 0 && !(lock->cll_flags & CLF_CANCELLED) && cl_lock_fits_into(env, lock, need, io); and in osc_lock_fits_into() struct osc_lock *ols = cl2osc_lock(slice); if (need->cld_enq_flags & CEF_NEVER) return 0; if (ols->ols_state >= OLS_CANCELLED) return 0; So means after osc_lock_fits_into() checking, the osc_lock was cancelled by another thread, so that the new IO think it finds a fitting lock, and when it tries to enqueue the lock, it finds that osc_lock is in OLS_CANCELLED instead of OLS_NEW.
            pjones Peter Jones added a comment -

            Bobijam

            Could you pleases assess this issue reported against master?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you pleases assess this issue reported against master? Thanks Peter

            People

              bobijam Zhenyu Xu
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: