Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1217

ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • 3
    • 4609

    Description

      system crashed with following assertion

      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed
      LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) LBUG
      Pid: 35225, comm: ptlrpcd_6

      Call Trace:
      [<ffffffffa0369855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0369e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa0375fe6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa064ca06>] osc_page_putref_lock+0xb6/0xc0 [osc]
      [<ffffffffa064ca28>] osc_page_completion_read+0x18/0x20 [osc]
      [<ffffffffa04555af>] cl_page_completion+0x21f/0x550 [obdclass]
      [<ffffffffa045610b>] ? cl_page_put+0x1ab/0x440 [obdclass]
      [<ffffffffa053b604>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa045f1e8>] ? cl_req_page_done+0x38/0x180 [obdclass]
      [<ffffffffa064d145>] osc_completion+0x165/0x380 [osc]
      [<ffffffffa055d734>] ? at_measured+0x114/0x360 [ptlrpc]
      [<ffffffffa054099e>] ? ptlrpc_master_callback+0x3e/0xc0 [ptlrpc]
      [<ffffffffa0634f48>] osc_ap_completion+0x128/0x300 [osc]
      [<ffffffffa0641cb9>] brw_interpret+0x4a9/0x710 [osc]
      [<ffffffffa052c066>] ? after_reply+0x7b6/0xe30 [ptlrpc]
      [<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
      [<ffffffff81041760>] ? fair___dequeue_entity+0x30/0x50
      [<ffffffffa0530c0a>] ptlrpc_check_set+0x48a/0x1ba0 [ptlrpc]
      [<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
      [<ffffffffa0565147>] ptlrpcd_check+0x547/0x560 [ptlrpc]
      [<ffffffffa056543c>] ptlrpcd+0x2dc/0x460 [ptlrpc]
      [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffffa0565160>] ? ptlrpcd+0x0/0x460 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20

      Unless I did something really wrong,
      cfs_atomic_read(&lock->cll_ref) = cfs_atomic_read(0xffff881dea5c9e78) = 258

      At time of crash, the system was higly loaded with a lot of interactive commands and did suffer of disconnection from some OST & MDT.

      Attachments

        Issue Links

          Activity

            [LU-1217] ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed

            Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            morrone Christopher Morrone (Inactive) added a comment - Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG.

            I will need the detailed info of the lock to address the problem.

            jay Jinshan Xiong (Inactive) added a comment - I will need the detailed info of the lock to address the problem.

            A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            morrone Christopher Morrone (Inactive) added a comment - - edited A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG.

            Hi Jinshan, we have applied the patch to our tree and it will by rolled out in the coming weeks.

            nedbass Ned Bass (Inactive) added a comment - Hi Jinshan, we have applied the patch to our tree and it will by rolled out in the coming weeks.

            I have no idea what happened. Can you please apply the patch I mentioned so that we can see what was exactly the lock state?

            jay Jinshan Xiong (Inactive) added a comment - I have no idea what happened. Can you please apply the patch I mentioned so that we can see what was exactly the lock state?

            No, we do not have multi-threaded pltrpcd.

            https://github.com/chaos/lustre/tree/2.1.0-llnl

            morrone Christopher Morrone (Inactive) added a comment - No, we do not have multi-threaded pltrpcd. https://github.com/chaos/lustre/tree/2.1.0-llnl
            green Oleg Drokin added a comment -

            Ned, do you also apply multi-threaded ptlrpcd patch on top of your 2.1 tree?

            green Oleg Drokin added a comment - Ned, do you also apply multi-threaded ptlrpcd patch on top of your 2.1 tree?

            from the attached crash dump, the lock looks okay. I don't know why.

            crash> p *(struct osc_lock *)0xffff881fc381a140
            $28 = {
              ols_cl = {
                cls_lock = 0xffff881dea5c9e78, 
                cls_obj = 0xffff881f30d9e4e0, 
                cls_ops = 0xffffffffa06537e0, 
                cls_linkage = {
                  next = 0xffff881dea5c9e80, 
                  prev = 0xffff881e8c1e8938
                }
              }, 
              ols_lock = 0xffff881e974df240, 
              ols_lvb = {
                lvb_size = 1443889168, 
                lvb_mtime = 1329614555, 
                lvb_atime = 1329614548, 
                lvb_ctime = 1329614555, 
                lvb_blocks = 2820112
              }, 
              ols_flags = 1052673, 
              ols_handle = {
                cookie = 8399946539139540594
              }, 
              ols_einfo = {
                ei_type = 11, 
                ei_mode = 4, 
                ei_cb_bl = 0xffffffffa064ee10, 
                ei_cb_cp = 0xffffffffa064fcb0, 
                ei_cb_gl = 0xffffffffa064ea50, 
                ei_cb_wg = 0xffffffffa064e7a0, 
                ei_cbdata = 0xffff881fc381a140, 
                ei_async = 0
              }, 
              ols_state = OLS_GRANTED, 
              ols_pageref = {
                counter = 256
              }, 
              ols_hold = 1, 
              ols_has_ref = 1, 
              ols_locklessable = 0, 
              ols_ast_wait = 0, 
              ols_flush = 0, 
              ols_glimpse = 0, 
              ols_owner = 0xffff881aac5471d0
            }
            crash> p *(struct cl_lock *)0xffff881dea5c9e78
            $29 = {
              cll_ref = {
                counter = 258
              }, 
              cll_layers = {
                next = 0xffff881e8c1e8938, 
                prev = 0xffff881fc381a158
              }, 
              cll_linkage = {
                next = 0xffff8819d4c0df38, 
                prev = 0xffff8819d4c0df38
              }, 
              cll_descr = {
                cld_obj = 0xffff8819d4c0df58, 
                cld_start = 0, 
                cld_end = 18446744073709551615, 
                cld_gid = 0, 
                cld_mode = CLM_READ, 
                cld_enq_flags = 0
              }, 
              cll_state = CLS_HELD, 
              cll_wq = {
                lock = {
                  raw_lock = {
                    slock = 76350605
                  }
                }, 
                task_list = {
                  next = 0xffff881dea5c9ed8, 
                  prev = 0xffff881dea5c9ed8
                }
              }, 
              cll_guard = {
                count = {
                  counter = 1
                }, 
                wait_lock = {
                  raw_lock = {
                    slock = 0
                  }
                }, 
                wait_list = {
                  next = 0xffff881dea5c9ef0, 
                  prev = 0xffff881dea5c9ef0
                }, 
                owner = 0x0
              }, 
              cll_guarder = 0x0, 
              cll_depth = 0, 
              cll_intransit_owner = 0x0, 
              cll_error = 0, 
              cll_holds = 1, 
              cll_users = 1, 
              cll_flags = 0, 
              cll_inclosure = {
                next = 0xffff881dea5c9f38, 
                prev = 0xffff881dea5c9f38
              }, 
              cll_conflict = 0x0, 
              cll_reference = {<No data fields>}, 
              cll_holders = {<No data fields>}, 
              cll_obj_ref = 0x0
            }
            
            jay Jinshan Xiong (Inactive) added a comment - from the attached crash dump, the lock looks okay. I don't know why. crash> p *(struct osc_lock *)0xffff881fc381a140 $28 = { ols_cl = { cls_lock = 0xffff881dea5c9e78, cls_obj = 0xffff881f30d9e4e0, cls_ops = 0xffffffffa06537e0, cls_linkage = { next = 0xffff881dea5c9e80, prev = 0xffff881e8c1e8938 } }, ols_lock = 0xffff881e974df240, ols_lvb = { lvb_size = 1443889168, lvb_mtime = 1329614555, lvb_atime = 1329614548, lvb_ctime = 1329614555, lvb_blocks = 2820112 }, ols_flags = 1052673, ols_handle = { cookie = 8399946539139540594 }, ols_einfo = { ei_type = 11, ei_mode = 4, ei_cb_bl = 0xffffffffa064ee10, ei_cb_cp = 0xffffffffa064fcb0, ei_cb_gl = 0xffffffffa064ea50, ei_cb_wg = 0xffffffffa064e7a0, ei_cbdata = 0xffff881fc381a140, ei_async = 0 }, ols_state = OLS_GRANTED, ols_pageref = { counter = 256 }, ols_hold = 1, ols_has_ref = 1, ols_locklessable = 0, ols_ast_wait = 0, ols_flush = 0, ols_glimpse = 0, ols_owner = 0xffff881aac5471d0 } crash> p *(struct cl_lock *)0xffff881dea5c9e78 $29 = { cll_ref = { counter = 258 }, cll_layers = { next = 0xffff881e8c1e8938, prev = 0xffff881fc381a158 }, cll_linkage = { next = 0xffff8819d4c0df38, prev = 0xffff8819d4c0df38 }, cll_descr = { cld_obj = 0xffff8819d4c0df58, cld_start = 0, cld_end = 18446744073709551615, cld_gid = 0, cld_mode = CLM_READ, cld_enq_flags = 0 }, cll_state = CLS_HELD, cll_wq = { lock = { raw_lock = { slock = 76350605 } }, task_list = { next = 0xffff881dea5c9ed8, prev = 0xffff881dea5c9ed8 } }, cll_guard = { count = { counter = 1 }, wait_lock = { raw_lock = { slock = 0 } }, wait_list = { next = 0xffff881dea5c9ef0, prev = 0xffff881dea5c9ef0 }, owner = 0x0 }, cll_guarder = 0x0, cll_depth = 0, cll_intransit_owner = 0x0, cll_error = 0, cll_holds = 1, cll_users = 1, cll_flags = 0, cll_inclosure = { next = 0xffff881dea5c9f38, prev = 0xffff881dea5c9f38 }, cll_conflict = 0x0, cll_reference = {<No data fields>}, cll_holders = {<No data fields>}, cll_obj_ref = 0x0 }

            OK, but it will be several weeks before we can update this system, and even then the bug may be rare to reproduce.

            nedbass Ned Bass (Inactive) added a comment - OK, but it will be several weeks before we can update this system, and even then the bug may be rare to reproduce.

            let's try to print the lock state precisely with patch: http://review.whamcloud.com/#change,2383

            jay Jinshan Xiong (Inactive) added a comment - let's try to print the lock state precisely with patch: http://review.whamcloud.com/#change,2383

            first, we need to know the troublesome cl_lock's content.

            I've only been able to find one address on the stack that looks like a cl_lock. But, with lock->cll_ref = 99 it shouldn't have hit the assertion. Maybe something touched that memory after the atomic read?

            struct cl_lock {
              cll_ref = { 
                counter = 99
              },  
              cll_layers = { 
                next = 0xffff88...
                prev = 0xffff88...
              },  
              cll_linkage = { 
                next = 0xffff88...
                prev = 0xffff88...
              },  
              cll_descr = { 
                cld_obj = 0xffff88...
                cld_start = 0,
                cld_end = 18446744073709551615,
                cld_gid = 0,
                cld_mode = CLM_READ,
                cld_enq_flags = 0 
              },  
              cll_state = CLS_HELD,
              cll_wq = { 
                lock = { 
                   raw_lock = { 
                      slock = 169675293
                   }   
                },  
                task_list = { 
                  next = 0xffff88...
                  prev = 0xffff88...
                }   
              },  
              cll_guard = { 
                count = { 
                  counter = 1 
                },  
                wait_lock = { 
                   raw_lock = { 
                      slock = 1179666
                   }   
                },  
                wait_list = { 
                  next = 0xffff88...
                  prev = 0xffff88...
                },  
                owner = 0x0 
              },  
              cll_guarder = 0x0,
              cll_depth = 0,
              cll_intransit_owner = 0x0,
              cll_error = 0,
              cll_holds = 1,
              cll_users = 1,
              cll_flags = 0,
              cll_inclosure = { 
                  next = 0xffff88...
                  prev = 0xffff88...
              },  
              cll_conflict = 0x0,
              cll_reference = {<No data fields>},
              cll_holders = {<No data fields>},
              cll_obj_ref = 0x0 
            }
            
            nedbass Ned Bass (Inactive) added a comment - first, we need to know the troublesome cl_lock's content. I've only been able to find one address on the stack that looks like a cl_lock. But, with lock->cll_ref = 99 it shouldn't have hit the assertion. Maybe something touched that memory after the atomic read? struct cl_lock { cll_ref = { counter = 99 }, cll_layers = { next = 0xffff88... prev = 0xffff88... }, cll_linkage = { next = 0xffff88... prev = 0xffff88... }, cll_descr = { cld_obj = 0xffff88... cld_start = 0, cld_end = 18446744073709551615, cld_gid = 0, cld_mode = CLM_READ, cld_enq_flags = 0 }, cll_state = CLS_HELD, cll_wq = { lock = { raw_lock = { slock = 169675293 } }, task_list = { next = 0xffff88... prev = 0xffff88... } }, cll_guard = { count = { counter = 1 }, wait_lock = { raw_lock = { slock = 1179666 } }, wait_list = { next = 0xffff88... prev = 0xffff88... }, owner = 0x0 }, cll_guarder = 0x0, cll_depth = 0, cll_intransit_owner = 0x0, cll_error = 0, cll_holds = 1, cll_users = 1, cll_flags = 0, cll_inclosure = { next = 0xffff88... prev = 0xffff88... }, cll_conflict = 0x0, cll_reference = {<No data fields>}, cll_holders = {<No data fields>}, cll_obj_ref = 0x0 }

            People

              jay Jinshan Xiong (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: