[LU-9313] Soft lockup in ldlm_prepare_lru_list when at lock LRU limit Created: 10/Apr/17  Updated: 22/Nov/18  Resolved: 22/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Duplicate Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-9230 soft lockup on v2.9 Lustre clients (l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When we've hit the LDLM lock LRU limit and are going in to lock reclaim/cancellation (either because we set an explicit lru_size or because the server is limiting the client lock count), we sometimes see soft lockups on the namespace lock (ns_lock) in ldlm_prepare_lru_list, called from the elc code.

For example:
[995914.635458] [<ffffffffa0c3278a>] ldlm_prepare_lru_list+0x1aa/0x500 [ptlrpc]
[995914.643442] [<ffffffffa0c367a5>] ldlm_cancel_lru_local+0x15/0x40 [ptlrpc]
[995914.651232] [<ffffffffa0c369dc>] ldlm_prep_elc_req+0x20c/0x480 [ptlrpc]
[995914.658828] [<ffffffffa0c36c74>] ldlm_prep_enqueue_req+0x24/0x30 [ptlrpc]
[995914.666606] [<ffffffffa0f7abe1>] osc_enqueue_base+0x1c1/0x6e0 [osc]
[995914.673796] [<ffffffffa0f84147>] osc_lock_enqueue+0x357/0xa00 [osc]
[995914.681002] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
[995914.688511] [<ffffffffa0dd6ecc>] lov_lock_enqueue+0x9c/0x170 [lov]
[995914.695616] [<ffffffffa09d8813>] cl_lock_enqueue+0x63/0x120 [obdclass]
[995914.703133] [<ffffffffa09d8d62>] cl_lock_request+0x62/0x1e0 [obdclass]
[995914.710649] [<ffffffffa0edf587>] cl_glimpse_lock+0x337/0x3d0 [lustre]
[995914.718057] [<ffffffffa0edf8e7>] cl_glimpse_size0+0x1b7/0x1c0 [lustre]
[995914.725562] [<ffffffffa0edac65>] ll_agl_trigger+0x115/0x4a0 [lustre]
[995914.732871] [<ffffffffa0edb14d>] ll_agl_thread+0x15d/0x4b0 [lustre]
[995914.740075] [<ffffffff81077874>] kthread+0xb4/0xc0
[995914.745610] [<ffffffff81523498>] ret_from_fork+0x58/0x90

The contention here is easy to reproduce by creating a few directories with a large number of small files (~100,000 per directory worked for me), then starting a number of ls processes - For example, doing:
ls -laR * > /dev/null &

A few times. (It is helpful if all files are on the same OST.)

When the lru limit is hit (it's easiest to see by setting lru_size limit manually), contention on the namespace lock from the elc code becomes very painful. Even if soft lockups do not occur, a quick perf record shows most time being spent on this lock.

This badly impacts the performance of the ls processes as well.

My proposed solution is to limit ELC so to one process per namespace. In Cray testing, this solves the problem nicely, but still lets ELC function.



 Comments   
Comment by Gerrit Updater [ 10/Apr/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/26477
Subject: LU-9313 ldlm: Limit elc to one thread per namespace
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 32788dd7191935a3315afbc43865e3dfd2403c8e

Comment by Andreas Dilger [ 22/Nov/18 ]

Patch from LU-9230 has resolved this issue.

Generated at Sat Feb 10 02:25:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.