Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9313

Soft lockup in ldlm_prepare_lru_list when at lock LRU limit

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When we've hit the LDLM lock LRU limit and are going in to lock reclaim/cancellation (either because we set an explicit lru_size or because the server is limiting the client lock count), we sometimes see soft lockups on the namespace lock (ns_lock) in ldlm_prepare_lru_list, called from the elc code.

      For example:
      [995914.635458] [<ffffffffa0c3278a>] ldlm_prepare_lru_list+0x1aa/0x500 [ptlrpc]
      [995914.643442] [<ffffffffa0c367a5>] ldlm_cancel_lru_local+0x15/0x40 [ptlrpc]
      [995914.651232] [<ffffffffa0c369dc>] ldlm_prep_elc_req+0x20c/0x480 [ptlrpc]
      [995914.658828] [<ffffffffa0c36c74>] ldlm_prep_enqueue_req+0x24/0x30 [ptlrpc]
      [995914.666606] [<ffffffffa0f7abe1>] osc_enqueue_base+0x1c1/0x6e0 [osc]
      [995914.673796] [<ffffffffa0f84147>] osc_lock_enqueue+0x357/0xa00 [osc]
      [995914.681002] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
      [995914.688511] [<ffffffffa0dd6ecc>] lov_lock_enqueue+0x9c/0x170 [lov]
      [995914.695616] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
      [995914.703133] [<ffffffffa09d8d62>] cl_lock_request+0x62/0x1e0 [obdclass]
      [995914.710649] [<ffffffffa0edf587>] cl_glimpse_lock+0x337/0x3d0 [lustre]
      [995914.718057] [<ffffffffa0edf8e7>] cl_glimpse_size0+0x1b7/0x1c0 [lustre]
      [995914.725562] [<ffffffffa0edac65>] ll_agl_trigger+0x115/0x4a0 [lustre]
      [995914.732871] [<ffffffffa0edb14d>] ll_agl_thread+0x15d/0x4b0 [lustre]
      [995914.740075] [<ffffffff81077874>] kthread+0xb4/0xc0
      [995914.745610] [<ffffffff81523498>] ret_from_fork+0x58/0x90

      The contention here is easy to reproduce by creating a few directories with a large number of small files (~100,000 per directory worked for me), then starting a number of ls processes - For example, doing:
      ls -laR * > /dev/null &

      A few times. (It is helpful if all files are on the same OST.)

      When the lru limit is hit (it's easiest to see by setting lru_size limit manually), contention on the namespace lock from the elc code becomes very painful. Even if soft lockups do not occur, a quick perf record shows most time being spent on this lock.

      This badly impacts the performance of the ls processes as well.

      My proposed solution is to limit ELC so to one process per namespace. In Cray testing, this solves the problem nicely, but still lets ELC function.

      Attachments

        Issue Links

          Activity

            People

              paf Patrick Farrell
              paf Patrick Farrell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: