[LU-7750] Slow operation of the file system Created: 05/Feb/16 Updated: 11/Apr/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Alex | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Version 2.7.65 |
||
| Attachments: |
|
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hello! Mount lustre client real 0m5.571s user 0m0.019s sys 0m1.400s [root@client mount]# time rm -rf {0..10000} real 0m10.332s When re-exporting through nfs real 0m10.167s user 0m0.024s sys 0m0.364s [root@client mount]# time rm -rf {0..10000} real 3m19.770s |
| Comments |
| Comment by Oleg Drokin [ 08/Feb/16 ] |
|
The problem at hand is that Lustre does synchronous metadata updates (what that really means is that for any metadata modifications we send an RPC right away to the server and wait for it to return before continuing), which makes this much slower than say local filesystems that can do purely local modifications in memory as they know nobody else could be changing anything from any other node. NFS is faster in some of those workloads because they do not have 100% cache consistency and can assume nobody touched any files from other nodes for some time which gives it a boost. Typically people don't run databases off Lustre or other distributed filesystems because databases have their own propagation mechanisms (replication and the like) to do it more efficiently among other things. Databases are rather sync-heavy which is expensive on network filesystems too. |
| Comment by Alex [ 10/Feb/16 ] |
|
Is it possible to reduce the waiting time of the query, or to increase the number of requests for faster processing? It was comparable to the performance of the local? |
| Comment by Alex [ 10/Feb/16 ] |
|
In the attached file contains testhatd.txt In the attached file contains rpc_stats.txt |
| Comment by Oleg Drokin [ 10/Feb/16 ] |
|
The "waiting time" for a metadata update request consists of RPC roundtrip + server processing time, really. You can make your network faster and use a faster server, but that would still be several orders of magnitude slower than purely local updates you get with local filesystems. When you do not require sync IO, note that in your dd example, the oflag=direct means every rpc is synchronous too, the key to making that perform better is to increase your IO size then, e.g. if you do "dd if=/dev/zero of=/hard/1/dd1 bs=300M count=1 oflag=direct conv=fdatasync" - that would be faster as it would allow more than a single write RPC to be in flight. |
| Comment by Alex [ 10/Feb/16 ] |
|
That is, the problem can be solved by using more CPU cores and increase network bandwidth? |
| Comment by Oleg Drokin [ 10/Feb/16 ] |
|
In the conditions you outline pure network bandwidth would not help, but lower network latency would. If you are using tcp now nd have access to e.g. infiniband hardware - you can try using infiniband rdma to see how much it helps. Pure local caching is still going significantly faster, though simply because there are so fewer layers involved. You can add multiple MDT servers to increase possible metadata parallelism or additional OST servers to increase data parallelism, but is not going to make any individual operation much faster. So for example in your dd example - because there is no parallelism at all, adding OSTs would not help you much, but lower-latency network would help. Increassing parallelism (by increasing io size) would help too just by getting more data on the wire at teh same time even if no other changes were made in the system otherwise. |
| Comment by Oleg Drokin [ 10/Feb/16 ] |
|
also if you can try to avoid directio completely, it's really expensive for lustre, but other network filesystems too. Do an fsync at the end of your IO if you must and you are still going to be better off (assuming the IO is actually bigger than 1M overall in between syncs, or even if smaller, but consisting of multiple chunks). |
| Comment by Alex [ 10/Feb/16 ] |
|
To increase the performance of the file system when dealing with small files, will decrease network latency using infiniband? Using MySQL on network file systems are not the best idea, but in the case of lustre can not even use MySQL on this file system? |
| Comment by Oleg Drokin [ 10/Feb/16 ] |
|
Yes, the lower is your network latency, the faster small file performance you get. Also stay clear of tcp if you can as that's all in itself is pretty expensive. I am not aware of anybody running mysql on Lustre, but in general there's nothing that should prevent it from happening if you are ok with performance penalty. Also I wonde what's teh reason behind such a setup? Do you plan to access the mysql files from several nodes at the same time? If not and you just want to use Lustre as some sort of a distributed storage - you can just create a big file on Lustre, format it as e.g. ext4 and mount that using loopback mount - this will make your small io and metadata performance a lot faster, but you won't be able to mount this file on several clients at the same time. Without better understanding your use case it's hard to offer any additional advice. |
| Comment by Alex [ 10/Feb/16 ] |
|
Creating a .img file formatting it in ext4, and a mounting loop, it had no effect. Once inside this file made creating directories or files, the number of transactions on the MDT very much. Simply deleting directories within a given loop (ext4) devices, directly depends on the MDT and the removal rate too (very slowly). In the near future I can give you the results of testing this solution with the loop device. Until we solve this problem MySQL re-export nfs (async). With small files similarly performance compared to a local LAN or exported via nfs example 250 times. At the local or the nfs, 10000 file operations per second, and Lustre, only 40 file operations per second. |
| Comment by Oleg Drokin [ 11/Feb/16 ] |
|
Your observation is really strange. Once you create the loopback file and format it, any operations inside it should not really touch mdt (make sure the device is not full of "holes") as MDT does not really hold file data, only some of the metadata. |
| Comment by Alex [ 11/Feb/16 ] |
|
Check on the loop device again, I will try to write a report on the performance. |
| Comment by Oleg Drokin [ 11/Feb/16 ] |
|
Yes, you can use Lustre for such a workload too and loopback files are not required, it's just a crutch for certain non-parallel workloads where you need a lot of metadata performance from a single node only. Overall you'll still get slower rates if you plan to serve just a lot of small files from a single node compared to a local fs. But as you ramp up number of nodes that serve from the same filesystem (something you cannot do with a local FS) – you'd gain some of that back. With large files added back into the mix you also should be able to significantly exceed a single node bandwidth too. |
| Comment by Alex [ 11/Feb/16 ] |
|
That is, can be used for http servers, if you add more MDT, OST servers due to this increase file system performance Lustre? And what additional settings need which configuration of servers it requires? |
| Comment by Oleg Drokin [ 11/Feb/16 ] |
|
No, you do not add more Lustre servers, you add more clients (and thus more http servers) to increase parallelism. Once you have enough parallelism to fully lou dexisting Lustre servers you might consider adding more. |
| Comment by Alex [ 11/Apr/16 ] |
|
Disc MDT created SSD Samsung SSD 850 EVO 120GB SATAIII(6Gb/s) Please tell us why such a low productivity? Other SSD manufacturers give similar performance specified below. capacity operations bandwidth Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util Disc MDT created through memory DDR3 dd if=/dev/zero of=/dev/shm/mdt-disk0 bs=1M count=1 seek=65536 pool alloc free read write read write Disc MDT created HDD SATA III mirror capacity operations bandwidth hard-mdt0 1.12G 927G 0 498 0 1.76M hard-mdt0 1.12G 927G 0 586 0 1.56M Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util |