[LU-55] Finish SMP scalability work (public tracking ticket) Created: 25/Jan/11  Updated: 25/Jan/11  Resolved: 25/Jan/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.0.0

Type: Improvement Priority: Major
Reporter: Liang Zhen (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 10671

 Description   

Because IT-2 is internal task which is only visible to Whamcloud engineers, so I create this public ticket to tracking efforts and record discussions.

all patches:

http://git.whamcloud.com/gitweb/?p=fs/lustre-dev.git;a=shortlog;h=refs/heads/liang/b_smp



 Comments   
Comment by Liang Zhen (Inactive) [ 25/Jan/11 ]

On large scale SMP system, SMP scalability of diskfs will be bottleneck of these operations under shared directory, there is no easy solution of this, I'm trying to workout a pdirops patch to improve SMP performance of shared directory, at the same time I'm thinking about adding a metadata changing operations driver to improvement performance:

following descriptions are just copied form IT-2

  • this driver includes two parts:
    • MDD dir fid cache
    • Metadata transaction schedulers
  • each CPU has a local fid cache (it's a 8 buckets hash-table, each bucket is a 8 items LRU, so we can cache 64 fids by default, of course we can increase number of buckets)
  • transaction schedulers are percpu threads pool in MDD layer
  • each time we create/remove file from a directory, we push the fid into the local CPU fid cache(LRU)
  • if the fid is already in the cache, then just increase a refcount(# of hits) on the fid and move the fid to top of LRU
  • when we push a new fid into local cache, the fid at bottom of LRU will be popped out
  • if refcount(# of hits) of fid in local CPU cache >= N (default is 32), we try to lookup the fid in global fid cache, which is a hash-table as well (do nothing if it already has reference to fid in global cache).
    • if the fid is not in the global cache, we just add it to global cache and set refcount to 1 (1 CPU is using it)
    • if the fid is already in global cache, we increase refcount of fid in global cache
  • if refcount of fid in global cache < M, we just return -1 and continue to run transaction in context of current thread (mdt service thread)
  • if refcount of fid in global cache is >= M (M is number of CPUs, which can be 2, 4... as many as we can get good performance of shared directory with ldiskfs/ext4, it's tunable), we specify M CPUs for the fid and return one CPU id within those M CPUs, if caller get a CPU id different with current CPU id, it will launch a MDD level "transaction request", and wakeup one MDD transaction thread to handle the "transaction request"

By this way, we only have very low overhead for common code (almost no contention), almost nothing changed for uniq directory operations. At the same time, we decreased contention on pdirops because changes on shared directory are localized on a few CPUs.

Also, we probably can sort unlink requests inside MDD schedulers to decrease disk seek somehow, but I'm not sure how much it can help

Comment by Robert Read (Inactive) [ 25/Jan/11 ]

I moved IT-2 to the Lustre project, so now it is visible to all.

Generated at Sat Feb 10 01:03:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.