Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1217

ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • 3
    • 4609

    Description

      system crashed with following assertion

      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) LBUG
      Pid: 35225, comm: ptlrpcd_6

      Call Trace:
      [<ffffffffa0369855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0369e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa0375fe6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa064ca06>] osc_page_putref_lock+0xb6/0xc0 [osc]
      [<ffffffffa064ca28>] osc_page_completion_read+0x18/0x20 [osc]
      [<ffffffffa04555af>] cl_page_completion+0x21f/0x550 [obdclass]
      [<ffffffffa045610b>] ? cl_page_put+0x1ab/0x440 [obdclass]
      [<ffffffffa053b604>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa045f1e8>] ? cl_req_page_done+0x38/0x180 [obdclass]
      [<ffffffffa064d145>] osc_completion+0x165/0x380 [osc]
      [<ffffffffa055d734>] ? at_measured+0x114/0x360 [ptlrpc]
      [<ffffffffa054099e>] ? ptlrpc_master_callback+0x3e/0xc0 [ptlrpc]
      [<ffffffffa0634f48>] osc_ap_completion+0x128/0x300 [osc]
      [<ffffffffa0641cb9>] brw_interpret+0x4a9/0x710 [osc]
      [<ffffffffa052c066>] ? after_reply+0x7b6/0xe30 [ptlrpc]
      [<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
      [<ffffffff81041760>] ? fair___dequeue_entity+0x30/0x50
      [<ffffffffa0530c0a>] ptlrpc_check_set+0x48a/0x1ba0 [ptlrpc]
      [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
      [<ffffffffa0565147>] ptlrpcd_check+0x547/0x560 [ptlrpc]
      [<ffffffffa056543c>] ptlrpcd+0x2dc/0x460 [ptlrpc]
      [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20

      Unless I did something really wrong,
      cfs_atomic_read(&lock->cll_ref) = cfs_atomic_read(0xffff881dea5c9e78) = 258

      At time of crash, the system was higly loaded with a lot of interactive commands and did suffer of disconnection from some OST & MDT.

      Attachments

        Issue Links

          Activity

            [LU-1217] ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

            OK, but it will be several weeks before we can update this system, and even then the bug may be rare to reproduce.

            nedbass Ned Bass (Inactive) added a comment - OK, but it will be several weeks before we can update this system, and even then the bug may be rare to reproduce.

            let's try to print the lock state precisely with patch: http://review.whamcloud.com/#change,2383

            jay Jinshan Xiong (Inactive) added a comment - let's try to print the lock state precisely with patch: http://review.whamcloud.com/#change,2383

            first, we need to know the troublesome cl_lock's content.

            I've only been able to find one address on the stack that looks like a cl_lock. But, with lock->cll_ref = 99 it shouldn't have hit the assertion. Maybe something touched that memory after the atomic read?

            struct cl_lock {
              cll_ref = { 
                counter = 99
              },  
              cll_layers = { 
                next = 0xffff88...
                prev = 0xffff88...
              },  
              cll_linkage = { 
                next = 0xffff88...
                prev = 0xffff88...
              },  
              cll_descr = { 
                cld_obj = 0xffff88...
                cld_start = 0,
                cld_end = 18446744073709551615,
                cld_gid = 0,
                cld_mode = CLM_READ,
                cld_enq_flags = 0 
              },  
              cll_state = CLS_HELD,
              cll_wq = { 
                lock = { 
                   raw_lock = { 
                      slock = 169675293
                   }   
                },  
                task_list = { 
                  next = 0xffff88...
                  prev = 0xffff88...
                }   
              },  
              cll_guard = { 
                count = { 
                  counter = 1 
                },  
                wait_lock = { 
                   raw_lock = { 
                      slock = 1179666
                   }   
                },  
                wait_list = { 
                  next = 0xffff88...
                  prev = 0xffff88...
                },  
                owner = 0x0 
              },  
              cll_guarder = 0x0,
              cll_depth = 0,
              cll_intransit_owner = 0x0,
              cll_error = 0,
              cll_holds = 1,
              cll_users = 1,
              cll_flags = 0,
              cll_inclosure = { 
                  next = 0xffff88...
                  prev = 0xffff88...
              },  
              cll_conflict = 0x0,
              cll_reference = {<No data fields>},
              cll_holders = {<No data fields>},
              cll_obj_ref = 0x0 
            }
            
            nedbass Ned Bass (Inactive) added a comment - first, we need to know the troublesome cl_lock's content. I've only been able to find one address on the stack that looks like a cl_lock. But, with lock->cll_ref = 99 it shouldn't have hit the assertion. Maybe something touched that memory after the atomic read? struct cl_lock { cll_ref = { counter = 99 }, cll_layers = { next = 0xffff88... prev = 0xffff88... }, cll_linkage = { next = 0xffff88... prev = 0xffff88... }, cll_descr = { cld_obj = 0xffff88... cld_start = 0, cld_end = 18446744073709551615, cld_gid = 0, cld_mode = CLM_READ, cld_enq_flags = 0 }, cll_state = CLS_HELD, cll_wq = { lock = { raw_lock = { slock = 169675293 } }, task_list = { next = 0xffff88... prev = 0xffff88... } }, cll_guard = { count = { counter = 1 }, wait_lock = { raw_lock = { slock = 1179666 } }, wait_list = { next = 0xffff88... prev = 0xffff88... }, owner = 0x0 }, cll_guarder = 0x0, cll_depth = 0, cll_intransit_owner = 0x0, cll_error = 0, cll_holds = 1, cll_users = 1, cll_flags = 0, cll_inclosure = { next = 0xffff88... prev = 0xffff88... }, cll_conflict = 0x0, cll_reference = {<No data fields>}, cll_holders = {<No data fields>}, cll_obj_ref = 0x0 }
            bobijam Zhenyu Xu added a comment -

            first, we need to know the troublesome cl_lock's content.

            bobijam Zhenyu Xu added a comment - first, we need to know the troublesome cl_lock's content.

            We hit this today as well on one of our classified 2.1 clients. We have a crash dump, but can't upload it. I'm happy to do any analysis of the dump at your direction.

            nedbass Ned Bass (Inactive) added a comment - We hit this today as well on one of our classified 2.1 clients. We have a crash dump, but can't upload it. I'm happy to do any analysis of the dump at your direction.
            bobijam Zhenyu Xu added a comment -

            Would you mind uploading crash image/kernel image/symbol files to our ftp site?

            bobijam Zhenyu Xu added a comment - Would you mind uploading crash image/kernel image/symbol files to our ftp site?
            pjones Peter Jones added a comment -

            Alexandre

            You mentioned on yesterday's call that it would be possible to supply a crash dump for this ticket. That would definitely help expedite us understanding this scenario.

            Thanks

            Peter

            pjones Peter Jones added a comment - Alexandre You mentioned on yesterday's call that it would be possible to supply a crash dump for this ticket. That would definitely help expedite us understanding this scenario. Thanks Peter

            it's impossible for this atomic count to be 258 if it hit this assertion. I don't know why it's `cfs_atomic_read(&lock->cll_ref) > 1' instead of `> 0'.

            jay Jinshan Xiong (Inactive) added a comment - it's impossible for this atomic count to be 258 if it hit this assertion. I don't know why it's `cfs_atomic_read(&lock->cll_ref) > 1' instead of `> 0'.
            pjones Peter Jones added a comment -

            Bobi

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobi Could you please look into this one? Thanks Peter

            People

              jay Jinshan Xiong (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: