Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9313

Soft lockup in ldlm_prepare_lru_list when at lock LRU limit

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When we've hit the LDLM lock LRU limit and are going in to lock reclaim/cancellation (either because we set an explicit lru_size or because the server is limiting the client lock count), we sometimes see soft lockups on the namespace lock (ns_lock) in ldlm_prepare_lru_list, called from the elc code.

      For example:
      [995914.635458] [<ffffffffa0c3278a>] ldlm_prepare_lru_list+0x1aa/0x500 [ptlrpc]
      [995914.643442] [<ffffffffa0c367a5>] ldlm_cancel_lru_local+0x15/0x40 [ptlrpc]
      [995914.651232] [<ffffffffa0c369dc>] ldlm_prep_elc_req+0x20c/0x480 [ptlrpc]
      [995914.658828] [<ffffffffa0c36c74>] ldlm_prep_enqueue_req+0x24/0x30 [ptlrpc]
      [995914.666606] [<ffffffffa0f7abe1>] osc_enqueue_base+0x1c1/0x6e0 [osc]
      [995914.673796] [<ffffffffa0f84147>] osc_lock_enqueue+0x357/0xa00 [osc]
      [995914.681002] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
      [995914.688511] [<ffffffffa0dd6ecc>] lov_lock_enqueue+0x9c/0x170 [lov]
      [995914.695616] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
      [995914.703133] [<ffffffffa09d8d62>] cl_lock_request+0x62/0x1e0 [obdclass]
      [995914.710649] [<ffffffffa0edf587>] cl_glimpse_lock+0x337/0x3d0 [lustre]
      [995914.718057] [<ffffffffa0edf8e7>] cl_glimpse_size0+0x1b7/0x1c0 [lustre]
      [995914.725562] [<ffffffffa0edac65>] ll_agl_trigger+0x115/0x4a0 [lustre]
      [995914.732871] [<ffffffffa0edb14d>] ll_agl_thread+0x15d/0x4b0 [lustre]
      [995914.740075] [<ffffffff81077874>] kthread+0xb4/0xc0
      [995914.745610] [<ffffffff81523498>] ret_from_fork+0x58/0x90

      The contention here is easy to reproduce by creating a few directories with a large number of small files (~100,000 per directory worked for me), then starting a number of ls processes - For example, doing:
      ls -laR * > /dev/null &

      A few times. (It is helpful if all files are on the same OST.)

      When the lru limit is hit (it's easiest to see by setting lru_size limit manually), contention on the namespace lock from the elc code becomes very painful. Even if soft lockups do not occur, a quick perf record shows most time being spent on this lock.

      This badly impacts the performance of the ls processes as well.

      My proposed solution is to limit ELC so to one process per namespace. In Cray testing, this solves the problem nicely, but still lets ELC function.

      Attachments

        Issue Links

          Activity

            [LU-9313] Soft lockup in ldlm_prepare_lru_list when at lock LRU limit

            Patch from LU-9230 has resolved this issue.

            adilger Andreas Dilger added a comment - Patch from LU-9230 has resolved this issue.

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/26477
            Subject: LU-9313 ldlm: Limit elc to one thread per namespace
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 32788dd7191935a3315afbc43865e3dfd2403c8e

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/26477 Subject: LU-9313 ldlm: Limit elc to one thread per namespace Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 32788dd7191935a3315afbc43865e3dfd2403c8e

            People

              paf Patrick Farrell (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: