Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1217

ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • 3
    • 4609

    Description

      system crashed with following assertion

      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) LBUG
      Pid: 35225, comm: ptlrpcd_6

      Call Trace:
      [<ffffffffa0369855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0369e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa0375fe6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa064ca06>] osc_page_putref_lock+0xb6/0xc0 [osc]
      [<ffffffffa064ca28>] osc_page_completion_read+0x18/0x20 [osc]
      [<ffffffffa04555af>] cl_page_completion+0x21f/0x550 [obdclass]
      [<ffffffffa045610b>] ? cl_page_put+0x1ab/0x440 [obdclass]
      [<ffffffffa053b604>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa045f1e8>] ? cl_req_page_done+0x38/0x180 [obdclass]
      [<ffffffffa064d145>] osc_completion+0x165/0x380 [osc]
      [<ffffffffa055d734>] ? at_measured+0x114/0x360 [ptlrpc]
      [<ffffffffa054099e>] ? ptlrpc_master_callback+0x3e/0xc0 [ptlrpc]
      [<ffffffffa0634f48>] osc_ap_completion+0x128/0x300 [osc]
      [<ffffffffa0641cb9>] brw_interpret+0x4a9/0x710 [osc]
      [<ffffffffa052c066>] ? after_reply+0x7b6/0xe30 [ptlrpc]
      [<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
      [<ffffffff81041760>] ? fair___dequeue_entity+0x30/0x50
      [<ffffffffa0530c0a>] ptlrpc_check_set+0x48a/0x1ba0 [ptlrpc]
      [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
      [<ffffffffa0565147>] ptlrpcd_check+0x547/0x560 [ptlrpc]
      [<ffffffffa056543c>] ptlrpcd+0x2dc/0x460 [ptlrpc]
      [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20

      Unless I did something really wrong,
      cfs_atomic_read(&lock->cll_ref) = cfs_atomic_read(0xffff881dea5c9e78) = 258

      At time of crash, the system was higly loaded with a lot of interactive commands and did suffer of disconnection from some OST & MDT.

      Attachments

        Issue Links

          Activity

            [LU-1217] ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
            pjones Peter Jones added a comment -

            Landed for 2.1.2 and 2.3

            pjones Peter Jones added a comment - Landed for 2.1.2 and 2.3
            jay Jinshan Xiong (Inactive) added a comment - Please apply patch: http://review.whamcloud.com/2604 .

            Ned, thank you for the information.

            I realize this problem is a race between lock use and assertion check as following:
            1. this page held the last refcount of the lock;
            2. in osc_page_putref_lock(), cfs_atomic_read(&lock->cll_ref) == 1 was true, but it was delayed for a while before checking ols_hold due to interrupt...
            3. another proc calls cl_use() to bring this lock into CLS_HELD state, so ols_hold is set;
            4. osc_page_putref_lock() continued to check ols_hold, definitely it was set and of course lock's refcount should be greater 1 for sure at this moment.

            Fix this problem is easy - just to remove the assertion.

            jay Jinshan Xiong (Inactive) added a comment - Ned, thank you for the information. I realize this problem is a race between lock use and assertion check as following: 1. this page held the last refcount of the lock; 2. in osc_page_putref_lock(), cfs_atomic_read(&lock->cll_ref) == 1 was true, but it was delayed for a while before checking ols_hold due to interrupt... 3. another proc calls cl_use() to bring this lock into CLS_HELD state, so ols_hold is set; 4. osc_page_putref_lock() continued to check ols_hold, definitely it was set and of course lock's refcount should be greater 1 for sure at this moment. Fix this problem is easy - just to remove the assertion.
            lock@ffff8803948bc978 [2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100cf000:0x3e138b:0x0] {
                lovsub@ffff8804340bd9a0 [0 ffff8804340db9a0: [ 0 ffff8801679d2d60 R(1):[256, 18446744073709551615]@[0x3f5c87bae:0x6a:0x0]]
                osc@ffff88041a6671f8: ffff880429e7eb40 00101001 0xa29f718631a3c48d 3 ffff8803b773f2b8 size: 367001600 mtime: 1335282921 atime: 1335282807 ctime: 1335282921 blocks: 716808
            } lock@ffff8803948bc978
            olock: ffff88041a6671f8, hold: 1.
            
            PID: 14280 COMMAND: "ptlrpcd-brw"
            machine_kexec
            crash_kexec
            panic
            lbug_with_loc
            libcfs_assertion_failed
            osc_page_putref_lock
            osc_page_completion_read
            cl_page_completion
            osc_completion
            osc_ap_completion
            brw_interpret
            ptlrpc_check_set
            ptlrpcd_check
            ptlrpcd
            kernel_thread
            
            nedbass Ned Bass (Inactive) added a comment - lock@ffff8803948bc978 [2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100cf000:0x3e138b:0x0] { lovsub@ffff8804340bd9a0 [0 ffff8804340db9a0: [ 0 ffff8801679d2d60 R(1):[256, 18446744073709551615]@[0x3f5c87bae:0x6a:0x0]] osc@ffff88041a6671f8: ffff880429e7eb40 00101001 0xa29f718631a3c48d 3 ffff8803b773f2b8 size: 367001600 mtime: 1335282921 atime: 1335282807 ctime: 1335282921 blocks: 716808 } lock@ffff8803948bc978 olock: ffff88041a6671f8, hold: 1. PID: 14280 COMMAND: "ptlrpcd-brw" machine_kexec crash_kexec panic lbug_with_loc libcfs_assertion_failed osc_page_putref_lock osc_page_completion_read cl_page_completion osc_completion osc_ap_completion brw_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd kernel_thread

            Yes, I knew this from your previous comment either. Let's do it after you guys come back from LUG.

            jay Jinshan Xiong (Inactive) added a comment - Yes, I knew this from your previous comment either. Let's do it after you guys come back from LUG.

            Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            morrone Christopher Morrone (Inactive) added a comment - Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            I will need the detailed info of the lock to address the problem.

            jay Jinshan Xiong (Inactive) added a comment - I will need the detailed info of the lock to address the problem.

            A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            morrone Christopher Morrone (Inactive) added a comment - - edited A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            People

              jay Jinshan Xiong (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: