Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4300

ptlrpcd threads deadlocked in cl_lock_mutex_get

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.5.0
    • 3
    • 11787

    Description

      TraveltimeLaunc thread held cl_lock and cause call of cl_lock_cancel. Lock cancel causes rpc OST_WRITE. But OST_WRITE can't be sent because all ptlrpcd are waiting for cl_lock.

      So, OST_WRITE rpc sticks at new requests queue and can't be sent because ptlrpcd_19 is blocked by osc lock up call.

      dump from ptlrpcd threads - each thread has a recovery set, and two "production" new and processed.
      each rpc is dumped in format
      $rpc_ptr : $opc : $rq_status

      thread 19 have queue
      print tread 19 - 0xffff881fe70fae70
      new requests
      rpc 0xffff880faf27e000 : 101 - 0
      rpc 0xffff881feff5d800 : 4 - 0
      request in processing

      All 32 ptlrpcd threads are waiting with this stack trace:

      > 2013-10-15T15:14:18.766544-05:00 c0-0c0s5n3 ptlrpcd_0       D 0000000000000000     0  4183      2 0x00000000
      > 2013-10-15T15:14:18.791825-05:00 c0-0c0s5n3 ffff881fe7085aa0 0000000000000046 ffff881fe7084010 ffff881fea497040
      > 2013-10-15T15:14:18.792017-05:00 c0-0c0s5n3 0000000000010b00 ffff881fe7085fd8 ffff881fe7085fd8 0000000000010b00
      > 2013-10-15T15:14:18.817016-05:00 c0-0c0s5n3 0000000000000013 ffff881fea497040 0000000000000000 ffff881fe9800040
      > 2013-10-15T15:14:18.817082-05:00 c0-0c0s5n3 Call Trace:
      > 2013-10-15T15:14:18.817244-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60
      > 2013-10-15T15:14:18.817388-05:00 c0-0c0s5n3 [<ffffffff81383f92>] __mutex_lock_slowpath+0x102/0x180
      > 2013-10-15T15:14:18.842380-05:00 c0-0c0s5n3 [<ffffffff81383993>] mutex_lock+0x23/0x40
      > 2013-10-15T15:14:18.842582-05:00 c0-0c0s5n3 [<ffffffffa0375806>] cl_lock_mutex_get+0x76/0xd0 [obdclass]
      > 2013-10-15T15:14:18.867656-05:00 c0-0c0s5n3 [<ffffffffa07bbc08>] lovsub_parent_lock+0x48/0x120 [lov]
      > 2013-10-15T15:14:18.867709-05:00 c0-0c0s5n3 [<ffffffffa07bc368>] lovsub_lock_state+0x68/0x1a0 [lov]
      > 2013-10-15T15:14:18.867880-05:00 c0-0c0s5n3 [<ffffffffa0373558>] cl_lock_state_signal+0x68/0x160 [obdclass]
      > 2013-10-15T15:14:18.893131-05:00 c0-0c0s5n3 [<ffffffffa0373e7a>] cl_lock_signal+0x5a/0x130 [obdclass]
      > 2013-10-15T15:14:18.893285-05:00 c0-0c0s5n3 [<ffffffffa072279d>] osc_lock_upcall+0x25d/0x610 [osc]
      > 2013-10-15T15:14:18.893419-05:00 c0-0c0s5n3 [<ffffffffa07041bd>] osc_enqueue_fini+0x9d/0x240 [osc]
      > 2013-10-15T15:14:18.918520-05:00 c0-0c0s5n3 [<ffffffffa0706eb7>] osc_enqueue_interpret+0xe7/0x1d0 [osc]
      > 2013-10-15T15:14:18.918702-05:00 c0-0c0s5n3 [<ffffffffa049733c>] ptlrpc_check_set+0x52c/0x1e20 [ptlrpc]
      > 2013-10-15T15:14:18.943647-05:00 c0-0c0s5n3 [<ffffffffa04c4acb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      > 2013-10-15T15:14:18.943885-05:00 c0-0c0s5n3 [<ffffffffa04c504b>] ptlrpcd+0x29b/0x3b0 [ptlrpc]
      > 2013-10-15T15:14:18.943979-05:00 c0-0c0s5n3 [<ffffffff8138de94>] kernel_thread_helper+0x4/0x10
      

      The following application process is waiting for an RPC to be sent:

      > 2013-10-15T15:14:31.200904-05:00 c0-0c0s5n3 TraveltimeLaunc S 00000001050e94cc     0 16793  16777 0x00000000
      > 2013-10-15T15:14:31.200927-05:00 c0-0c0s5n3 ffff880fe8979308 0000000000000082 ffff880fe8978010 ffff880faca8e040
      > 2013-10-15T15:14:31.200967-05:00 c0-0c0s5n3 0000000000010b00 ffff880fe8979fd8 ffff880fe8979fd8 0000000000010b00
      > 2013-10-15T15:14:31.226179-05:00 c0-0c0s5n3 0000000000000000 ffff880faca8e040 0000000000000000 ffffffff81667020
      > 2013-10-15T15:14:31.226195-05:00 c0-0c0s5n3 Call Trace:
      > 2013-10-15T15:14:31.226219-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60
      > 2013-10-15T15:14:31.251476-05:00 c0-0c0s5n3 [<ffffffffa021d74e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      > 2013-10-15T15:14:31.251533-05:00 c0-0c0s5n3 [<ffffffffa0731e56>] osc_extent_wait+0x576/0x630 [osc]
      > 2013-10-15T15:14:31.251563-05:00 c0-0c0s5n3 [<ffffffffa0732447>] osc_cache_wait_range+0x537/0x820 [osc]
      > 2013-10-15T15:14:31.276843-05:00 c0-0c0s5n3 [<ffffffffa0733199>] osc_cache_writeback_range+0xa69/0x1030 [osc]
      > 2013-10-15T15:14:31.276897-05:00 c0-0c0s5n3 [<ffffffffa071febe>] osc_lock_flush+0x7e/0x260 [osc]
      > 2013-10-15T15:14:31.302146-05:00 c0-0c0s5n3 [<ffffffffa0720181>] osc_lock_cancel+0xe1/0x1c0 [osc]
      > 2013-10-15T15:14:31.302187-05:00 c0-0c0s5n3 [<ffffffffa03730d5>] cl_lock_cancel0+0x75/0x160 [obdclass]
      > 2013-10-15T15:14:31.302229-05:00 c0-0c0s5n3 [<ffffffffa0373e1b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      > 2013-10-15T15:14:31.327498-05:00 c0-0c0s5n3 [<ffffffffa072162c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      > 2013-10-15T15:14:31.327566-05:00 c0-0c0s5n3 [<ffffffffa046baab>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
      > 2013-10-15T15:14:31.352863-05:00 c0-0c0s5n3 [<ffffffffa0479d1a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      > 2013-10-15T15:14:31.352917-05:00 c0-0c0s5n3 [<ffffffffa047cfec>] ldlm_cli_cancel_list_local+0xec/0x290 [ptlrpc]
      > 2013-10-15T15:14:31.378183-05:00 c0-0c0s5n3 [<ffffffffa047dee5>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
      > 2013-10-15T15:14:31.378216-05:00 c0-0c0s5n3 [<ffffffffa047f36f>] ldlm_prep_elc_req+0x3df/0x490 [ptlrpc]
      > 2013-10-15T15:14:31.403528-05:00 c0-0c0s5n3 [<ffffffffa047f448>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
      > 2013-10-15T15:14:31.403581-05:00 c0-0c0s5n3 [<ffffffffa0706963>] osc_enqueue_base+0x103/0x570 [osc]
      > 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc]
      > 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc]
      > 2013-10-15T15:14:31.428880-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 2013-10-15T15:14:31.428980-05:00 c0-0c0s5n3 [<ffffffffa07b4a54>] lov_lock_enqueue+0x1f4/0x890 [lov]
      > 2013-10-15T15:14:31.429041-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      > 2013-10-15T15:14:31.454182-05:00 c0-0c0s5n3 [<ffffffffa037869f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      > 2013-10-15T15:14:31.454296-05:00 c0-0c0s5n3 [<ffffffffa03792ce>] cl_lock_request+0x7e/0x270 [obdclass]
      > 2013-10-15T15:14:31.454354-05:00 c0-0c0s5n3 [<ffffffffa0883e9f>] cl_glimpse_lock+0x17f/0x490 [lustre]
      > 2013-10-15T15:14:31.479540-05:00 c0-0c0s5n3 [<ffffffffa088434d>] cl_glimpse_size0+0x19d/0x1c0 [lustre]
      > 2013-10-15T15:14:31.504845-05:00 c0-0c0s5n3 [<ffffffffa0837650>] ll_inode_revalidate_it+0x1b0/0x1d0 [lustre]
      > 2013-10-15T15:14:31.504912-05:00 c0-0c0s5n3 [<ffffffffa08376b6>] ll_getattr_it+0x46/0x180 [lustre]
      > 2013-10-15T15:14:31.504990-05:00 c0-0c0s5n3 [<ffffffffa083782c>] ll_getattr+0x3c/0x40 [lustre]
      > 2013-10-15T15:14:31.505021-05:00 c0-0c0s5n3 [<ffffffff81147355>] vfs_getattr+0x25/0x50
      > 2013-10-15T15:14:31.530127-05:00 c0-0c0s5n3 [<ffffffff81147920>] vfs_fstatat+0x80/0x90
      > 2013-10-15T15:14:31.530209-05:00 c0-0c0s5n3 [<ffffffff81147a5b>] vfs_stat+0x1b/0x20
      > 2013-10-15T15:14:31.530264-05:00 c0-0c0s5n3 [<ffffffff81147a84>] sys_newstat+0x24/0x50
      > 2013-10-15T15:14:31.530321-05:00 c0-0c0s5n3 [<ffffffff8138cdab>] system_call_fastpath+0x16/0x1b
      

      Attachments

        1. ptlrpcd-rpc.info
          233 kB
        2. rpcs.log
          136 kB
        3. quick-fix.diff
          2 kB

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: