[LU-7750] Slow operation of the file system Created: 05/Feb/16  Updated: 11/Apr/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Alex Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Version 2.7.65


Attachments: Text File rpc_stats.txt     Text File testhard.txt    
Rank (Obsolete): 9223372036854775807

 Description   

Hello!
Currently I am experiencing a problem with the file system performance.
Creating MySQL tables over time is incredibly large. At this point it is solved through the use of re-export nfs (but the removal of the directory is 200 times greater when using nfs) and productivity is still low.
Using only lustre client about 6 times lower when creating directories, with a distance of 12 times, comparing with the ZFS file system, ext4.
Perhaps due to the Lustre architecture itself can not be used for database MYSQL and small files?
What can you recommend to improve performance, so that you can use without re-export through nfs for MySQL databases, as well as small files?
Using the file system on the disk zpool of RAM does not solve the problem.

Mount lustre client
[root@client mount]# time mkdir

{0..10000}
real 0m5.571s
user 0m0.019s
sys 0m1.400s
[root@client mount]# time rm -rf {0..10000}

real 0m10.332s
user 0m0.121s
sys 0m3.577s

When re-exporting through nfs
[root@client mount]# time mkdir

{0..10000}
real 0m10.167s
user 0m0.024s
sys 0m0.364s
[root@client mount]# time rm -rf {0..10000}

real 3m19.770s
user 0m0.216s
sys 0m2.472s



 Comments   
Comment by Oleg Drokin [ 08/Feb/16 ]

The problem at hand is that Lustre does synchronous metadata updates (what that really means is that for any metadata modifications we send an RPC right away to the server and wait for it to return before continuing), which makes this much slower than say local filesystems that can do purely local modifications in memory as they know nobody else could be changing anything from any other node.

NFS is faster in some of those workloads because they do not have 100% cache consistency and can assume nobody touched any files from other nodes for some time which gives it a boost.

Typically people don't run databases off Lustre or other distributed filesystems because databases have their own propagation mechanisms (replication and the like) to do it more efficiently among other things. Databases are rather sync-heavy which is expensive on network filesystems too.

Comment by Alex [ 10/Feb/16 ]

Is it possible to reduce the waiting time of the query, or to increase the number of requests for faster processing? It was comparable to the performance of the local?
What settings should be made?

Comment by Alex [ 10/Feb/16 ]

In the attached file contains testhatd.txt
iozone –w –M –t 1 –s 300m –r 1m –i 0 –i 1 –F /hard/1/ –R
dd if=/dev/zero of=/hard/1/dd1 bs=1M count=300 oflag=direct conv=fdatasync
bonnie++ -d /hard/1/ -r 256 -u root

In the attached file contains rpc_stats.txt
lctl get_param osc.*.rpc_stats

Comment by Oleg Drokin [ 10/Feb/16 ]

The "waiting time" for a metadata update request consists of RPC roundtrip + server processing time, really. You can make your network faster and use a faster server, but that would still be several orders of magnitude slower than purely local updates you get with local filesystems.
Even though we allow up to 8 modification RPCs to MDS now, that does not help you much if you create files in the same directory due to how kernel takes a semaphore to avoid various races.
In effect if you cause your local filesystem to do fsync after every open or unlink, you'll also get a big performance impact for the same reason.

When you do not require sync IO, note that in your dd example, the oflag=direct means every rpc is synchronous too, the key to making that perform better is to increase your IO size then, e.g. if you do "dd if=/dev/zero of=/hard/1/dd1 bs=300M count=1 oflag=direct conv=fdatasync" - that would be faster as it would allow more than a single write RPC to be in flight.

Comment by Alex [ 10/Feb/16 ]

That is, the problem can be solved by using more CPU cores and increase network bandwidth?
Or you need to add more servers to the MDT to solve performance problems of small files? Additional settings of the file system need?

Comment by Oleg Drokin [ 10/Feb/16 ]

In the conditions you outline pure network bandwidth would not help, but lower network latency would.

If you are using tcp now nd have access to e.g. infiniband hardware - you can try using infiniband rdma to see how much it helps.

Pure local caching is still going significantly faster, though simply because there are so fewer layers involved.

You can add multiple MDT servers to increase possible metadata parallelism or additional OST servers to increase data parallelism, but is not going to make any individual operation much faster. So for example in your dd example - because there is no parallelism at all, adding OSTs would not help you much, but lower-latency network would help. Increassing parallelism (by increasing io size) would help too just by getting more data on the wire at teh same time even if no other changes were made in the system otherwise.

Comment by Oleg Drokin [ 10/Feb/16 ]

also if you can try to avoid directio completely, it's really expensive for lustre, but other network filesystems too. Do an fsync at the end of your IO if you must and you are still going to be better off (assuming the IO is actually bigger than 1M overall in between syncs, or even if smaller, but consisting of multiple chunks).

Comment by Alex [ 10/Feb/16 ]

To increase the performance of the file system when dealing with small files, will decrease network latency using infiniband?
By small means the files of 1KB or less

Using MySQL on network file systems are not the best idea, but in the case of lustre can not even use MySQL on this file system?

Comment by Oleg Drokin [ 10/Feb/16 ]

Yes, the lower is your network latency, the faster small file performance you get. Also stay clear of tcp if you can as that's all in itself is pretty expensive.

I am not aware of anybody running mysql on Lustre, but in general there's nothing that should prevent it from happening if you are ok with performance penalty.

Also I wonde what's teh reason behind such a setup? Do you plan to access the mysql files from several nodes at the same time? If not and you just want to use Lustre as some sort of a distributed storage - you can just create a big file on Lustre, format it as e.g. ext4 and mount that using loopback mount - this will make your small io and metadata performance a lot faster, but you won't be able to mount this file on several clients at the same time.

Without better understanding your use case it's hard to offer any additional advice.

Comment by Alex [ 10/Feb/16 ]

Creating a .img file formatting it in ext4, and a mounting loop, it had no effect. Once inside this file made creating directories or files, the number of transactions on the MDT very much. Simply deleting directories within a given loop (ext4) devices, directly depends on the MDT and the removal rate too (very slowly). In the near future I can give you the results of testing this solution with the loop device.

Until we solve this problem MySQL re-export nfs (async).

With small files similarly performance compared to a local LAN or exported via nfs example 250 times. At the local or the nfs, 10000 file operations per second, and Lustre, only 40 file operations per second.

Comment by Oleg Drokin [ 11/Feb/16 ]

Your observation is really strange. Once you create the loopback file and format it, any operations inside it should not really touch mdt (make sure the device is not full of "holes") as MDT does not really hold file data, only some of the metadata.

Comment by Alex [ 11/Feb/16 ]

Check on the loop device again, I will try to write a report on the performance.
Is it possible to use Lustre file system for http servers, where just a lot of small files, if we exclude the video files, images without using the loop device?

Comment by Oleg Drokin [ 11/Feb/16 ]

Yes, you can use Lustre for such a workload too and loopback files are not required, it's just a crutch for certain non-parallel workloads where you need a lot of metadata performance from a single node only.

Overall you'll still get slower rates if you plan to serve just a lot of small files from a single node compared to a local fs. But as you ramp up number of nodes that serve from the same filesystem (something you cannot do with a local FS) – you'd gain some of that back. With large files added back into the mix you also should be able to significantly exceed a single node bandwidth too.

Comment by Alex [ 11/Feb/16 ]

That is, can be used for http servers, if you add more MDT, OST servers due to this increase file system performance Lustre? And what additional settings need which configuration of servers it requires?

Comment by Oleg Drokin [ 11/Feb/16 ]

No, you do not add more Lustre servers, you add more clients (and thus more http servers) to increase parallelism. Once you have enough parallelism to fully lou dexisting Lustre servers you might consider adding more.

Comment by Alex [ 11/Apr/16 ]

Disc MDT created SSD Samsung SSD 850 EVO 120GB SATAIII(6Gb/s)
Performance:
Sequential Read
Up to 540 MB/sec Sequential Read * Performance may vary based on system hardware & configuration
Sequential Write
Up to 520 MB/sec Sequential Write * Performance may vary based on system hardware & configuration
Random Read (4 KB, QD32)
Up to 94000 IOPS Random Read * Performance may vary based on system hardware & configuration
Random Write (4KB, QD32)
Up to 88000 IOPS Random Write * Performance may vary based on system hardware & configuration
Random Read (4 KB, QD1)
Up to 10000 IOPS Random Read * Performance may vary based on system hardware & configuration
Random Write (4 KB, QD1)
Up to 40000 IOPS Random Write * Performance may vary based on system hardware & configuration

Please tell us why such a low productivity? Other SSD manufacturers give similar performance specified below.
Perhaps consistent with the stated performance and statistics of the zpool iostat gives it right, please tell me how to calculate the amount of writing, reading for MDT, OST and choose the drives for them?

capacity operations bandwidth
pool alloc free read write read write
---------------------------- ----- ----- ----- ----- ----- -----
sdb 12.8M 111G 0 340 0 1.48M
sdb 19.8M 111G 0 358 0 2.13M
sdb 19.9M 111G 0 360 0 1.84M
sdb 19.9M 111G 0 364 0 1.34M
sdb 20.2M 111G 0 360 0 1.53M

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 382.00 0.00 2624.00 6.87 0.96 2.52 2.51 95.70

Disc MDT created through memory DDR3 dd if=/dev/zero of=/dev/shm/mdt-disk0 bs=1M count=1 seek=65536

pool alloc free read write read write
----------------------- ----- ----- ----- ----- ----- -----
/dev/shm/mdt-disk0 15.0M 63.5G 0 4.93K 0 15.0M
/dev/shm/mdt-disk0 19.1M 63.5G 0 12.0K 0 37.4M
/dev/shm/mdt-disk0 19.4M 63.5G 0 13.0K 0 38.8M
/dev/shm/mdt-disk0 19.0M 63.5G 0 12.0K 0 38.2M
/dev/shm/mdt-disk0 19.3M 63.5G 0 13.0K 0 41.0M
/dev/shm/mdt-disk0 20.1M 63.5G 0 12.4K 0 42.4M
/dev/shm/mdt-disk0 20.5M 63.5G 0 12.5K 0 42.6M
/dev/shm/mdt-disk0 20.4M 63.5G 0 13.3K 0 41.5M
/dev/shm/mdt-disk0 24.0M 63.5G 0 13.7K 0 52.8M

Disc MDT created HDD SATA III mirror

capacity operations bandwidth
pool alloc free read write read write
hard-mdt0 1.10G 927G 0 491 0 1.42M
mirror 1.10G 927G 0 491 0 1.42M
d2 - - 0 141 0 2.65M
d3 - - 0 141 0 2.65M

hard-mdt0 1.12G 927G 0 498 0 1.76M
mirror 1.12G 927G 0 498 0 1.76M
d2 - - 0 160 0 2.84M
d3 - - 0 160 0 2.84M

hard-mdt0 1.12G 927G 0 586 0 1.56M
mirror 1.12G 927G 0 586 0 1.56M
d2 - - 0 125 0 2.90M
d3 - - 0 123 0 2.90M

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sde 0.00 0.00 4.00 163.00 32.00 5240.00 31.57 0.97 6.03 5.40 90.10
sdf 0.00 0.00 10.00 162.00 112.00 5240.00 31.12 1.07 6.38 4.97 85.40

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sde 0.00 0.00 0.00 150.00 0.00 5960.00 39.73 0.94 6.04 6.20 93.00
sdf 0.00 0.00 0.00 147.00 0.00 5952.00 40.49 0.77 5.02 5.18 76.10

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sde 0.00 0.00 0.00 136.00 0.00 4424.00 32.53 0.86 6.33 6.28 85.40
sdf 0.00 0.00 0.00 135.00 0.00 4368.00 32.36 0.94 6.94 6.89 93.00

Generated at Sat Feb 10 02:11:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.