[LU-9998] Default partition setup is not optimal for best metadata performance Created: 16/Sep/17  Updated: 09/Feb/18  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

b2_10


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here is MDS's CPU configuration.

[root@mds11 ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2101.000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-47

[root@mds11 ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 96940 MB
node 0 free: 90229 MB
node distances:
node   0 
  0:  10 

only single partition created by default for single CPU configuration.

[root@mds11 ~]# cat /proc/sys/lnet/cpu_partition_table 
0	: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

This default partition configuration is not optimal and affects huge metadata performance impact. especially stats and read operations.
Please see below test results with default and manual setting with 6 partitions.

Default partition (npartition=1)

mpirun -np 128 /work/tools/bin/mdtest -n 5000 -v -d /scratch0/dir0 -F -i 3 -p 10 -w 0 -u

SUMMARY: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :      90269.484      73210.911      83067.818       7212.787
   File stat         :     192519.466     191217.586     191843.135        532.702
   File read         :      84278.190      74407.351      78726.036       4123.061
   File removal      :     152552.089     141405.693     148541.612       5058.776
   Tree creation     :        576.227        129.569        332.039        184.718
   Tree removal      :         28.016         12.466         18.019          7.083
V-1: Entering timestamp...

npartition=6

[root@mds11 ~]# cat /proc/sys/lnet/cpu_partition_table 
0	: 0 1 2 3 24 25 26 27
1	: 4 5 6 7 28 29 30 31
2	: 8 9 10 11 32 33 34 35
3	: 12 13 14 15 36 37 38 39
4	: 16 17 18 19 40 41 42 43
5	: 20 21 22 23 44 45 46 47
mpirun -np 128 /work/tools/bin/mdtest -n 5000 -v -d /scratch0/dir0 -F -i 3 -p 10 -w 0 -u
SUMMARY: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     130215.199     112298.894     123903.497       8216.228
   File stat         :     447219.644     422373.391     436421.078      10400.374
   File read         :     224856.656     216383.752     219513.555       3796.625
   File removal      :     142603.040     138102.147     139843.976       1973.252
   Tree creation     :        561.879        170.631        379.767        160.865
   Tree removal      :         41.908         41.042         41.509          0.357
V-1: Entering timestamp...


 Comments   
Comment by Joseph Gmitter (Inactive) [ 18/Sep/17 ]

Hi Dmitry,

Can you please investigate and advise?

Thanks.
Joe

Comment by Gerrit Updater [ 17/Oct/17 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: https://review.whamcloud.com/29645
Subject: LU-9998 libcfs: split single NUMA node into partitions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4d373f2793ef32b47284a88d672433d574a14c20

Comment by Dmitry Eremin (Inactive) [ 17/Oct/17 ]

I would like to propose workaround for this. In my patch I return old behavior for machines with single NUMA node.

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29645/
Subject: LU-9998 libcfs: split single NUMA node into partitions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c9d14a01263bd0fb7a5fac853b5e2d34ff8cadab

Comment by Peter Jones [ 22/Dec/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 02/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30690
Subject: LU-9998 libcfs: split single NUMA node into partitions
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 6b337d0e241ce998b88e3b3475089fe330997d82

Comment by Gerrit Updater [ 09/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30690/
Subject: LU-9998 libcfs: split single NUMA node into partitions
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: f8736cdfe48ca70a6d293d55ad184dc6b34af312

Generated at Sat Feb 10 02:31:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.