Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1217

ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • 3
    • 4609

    Description

      system crashed with following assertion

      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) LBUG
      Pid: 35225, comm: ptlrpcd_6

      Call Trace:
      [<ffffffffa0369855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0369e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa0375fe6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa064ca06>] osc_page_putref_lock+0xb6/0xc0 [osc]
      [<ffffffffa064ca28>] osc_page_completion_read+0x18/0x20 [osc]
      [<ffffffffa04555af>] cl_page_completion+0x21f/0x550 [obdclass]
      [<ffffffffa045610b>] ? cl_page_put+0x1ab/0x440 [obdclass]
      [<ffffffffa053b604>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa045f1e8>] ? cl_req_page_done+0x38/0x180 [obdclass]
      [<ffffffffa064d145>] osc_completion+0x165/0x380 [osc]
      [<ffffffffa055d734>] ? at_measured+0x114/0x360 [ptlrpc]
      [<ffffffffa054099e>] ? ptlrpc_master_callback+0x3e/0xc0 [ptlrpc]
      [<ffffffffa0634f48>] osc_ap_completion+0x128/0x300 [osc]
      [<ffffffffa0641cb9>] brw_interpret+0x4a9/0x710 [osc]
      [<ffffffffa052c066>] ? after_reply+0x7b6/0xe30 [ptlrpc]
      [<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
      [<ffffffff81041760>] ? fair___dequeue_entity+0x30/0x50
      [<ffffffffa0530c0a>] ptlrpc_check_set+0x48a/0x1ba0 [ptlrpc]
      [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
      [<ffffffffa0565147>] ptlrpcd_check+0x547/0x560 [ptlrpc]
      [<ffffffffa056543c>] ptlrpcd+0x2dc/0x460 [ptlrpc]
      [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20

      Unless I did something really wrong,
      cfs_atomic_read(&lock->cll_ref) = cfs_atomic_read(0xffff881dea5c9e78) = 258

      At time of crash, the system was higly loaded with a lot of interactive commands and did suffer of disconnection from some OST & MDT.

      Attachments

        Issue Links

          Activity

            [LU-1217] ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
            pjones Peter Jones added a comment -

            Landed for 2.1.2 and 2.3

            pjones Peter Jones added a comment - Landed for 2.1.2 and 2.3
            jay Jinshan Xiong (Inactive) added a comment - Please apply patch: http://review.whamcloud.com/2604 .

            Ned, thank you for the information.

            I realize this problem is a race between lock use and assertion check as following:
            1. this page held the last refcount of the lock;
            2. in osc_page_putref_lock(), cfs_atomic_read(&lock->cll_ref) == 1 was true, but it was delayed for a while before checking ols_hold due to interrupt...
            3. another proc calls cl_use() to bring this lock into CLS_HELD state, so ols_hold is set;
            4. osc_page_putref_lock() continued to check ols_hold, definitely it was set and of course lock's refcount should be greater 1 for sure at this moment.

            Fix this problem is easy - just to remove the assertion.

            jay Jinshan Xiong (Inactive) added a comment - Ned, thank you for the information. I realize this problem is a race between lock use and assertion check as following: 1. this page held the last refcount of the lock; 2. in osc_page_putref_lock(), cfs_atomic_read(&lock->cll_ref) == 1 was true, but it was delayed for a while before checking ols_hold due to interrupt... 3. another proc calls cl_use() to bring this lock into CLS_HELD state, so ols_hold is set; 4. osc_page_putref_lock() continued to check ols_hold, definitely it was set and of course lock's refcount should be greater 1 for sure at this moment. Fix this problem is easy - just to remove the assertion.
            lock@ffff8803948bc978 [2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100cf000:0x3e138b:0x0] {
                lovsub@ffff8804340bd9a0 [0 ffff8804340db9a0: [ 0 ffff8801679d2d60 R(1):[256, 18446744073709551615]@[0x3f5c87bae:0x6a:0x0]]
                osc@ffff88041a6671f8: ffff880429e7eb40 00101001 0xa29f718631a3c48d 3 ffff8803b773f2b8 size: 367001600 mtime: 1335282921 atime: 1335282807 ctime: 1335282921 blocks: 716808
            } lock@ffff8803948bc978
            olock: ffff88041a6671f8, hold: 1.
            
            PID: 14280 COMMAND: "ptlrpcd-brw"
            machine_kexec
            crash_kexec
            panic
            lbug_with_loc
            libcfs_assertion_failed
            osc_page_putref_lock
            osc_page_completion_read
            cl_page_completion
            osc_completion
            osc_ap_completion
            brw_interpret
            ptlrpc_check_set
            ptlrpcd_check
            ptlrpcd
            kernel_thread
            
            nedbass Ned Bass (Inactive) added a comment - lock@ffff8803948bc978 [2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100cf000:0x3e138b:0x0] { lovsub@ffff8804340bd9a0 [0 ffff8804340db9a0: [ 0 ffff8801679d2d60 R(1):[256, 18446744073709551615]@[0x3f5c87bae:0x6a:0x0]] osc@ffff88041a6671f8: ffff880429e7eb40 00101001 0xa29f718631a3c48d 3 ffff8803b773f2b8 size: 367001600 mtime: 1335282921 atime: 1335282807 ctime: 1335282921 blocks: 716808 } lock@ffff8803948bc978 olock: ffff88041a6671f8, hold: 1. PID: 14280 COMMAND: "ptlrpcd-brw" machine_kexec crash_kexec panic lbug_with_loc libcfs_assertion_failed osc_page_putref_lock osc_page_completion_read cl_page_completion osc_completion osc_ap_completion brw_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd kernel_thread

            Yes, I knew this from your previous comment either. Let's do it after you guys come back from LUG.

            jay Jinshan Xiong (Inactive) added a comment - Yes, I knew this from your previous comment either. Let's do it after you guys come back from LUG.

            Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            morrone Christopher Morrone (Inactive) added a comment - Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            I will need the detailed info of the lock to address the problem.

            jay Jinshan Xiong (Inactive) added a comment - I will need the detailed info of the lock to address the problem.

            A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            morrone Christopher Morrone (Inactive) added a comment - - edited A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            Hi Jinshan, we have applied the patch to our tree and it will by rolled out in the coming weeks.

            nedbass Ned Bass (Inactive) added a comment - Hi Jinshan, we have applied the patch to our tree and it will by rolled out in the coming weeks.

            People

              jay Jinshan Xiong (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: