Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.9.0
-
None
-
3
-
9223372036854775807
Description
Hi everyone, I have noticed a curiously bad small-file creation behaviour on Lustre 2.9.55.
I know that Lustre is inefficient when handling large amounts of small files and profits from the Metadata Servers running on SSDs – but while exploring just how bad this is, I found something curious.
My use case is simple: Create 50.000 40-byte files in a single directory. The "test.py" script below will do just that.
Since I wanted to find the theoretical speed of Lustre, I used the following setup:
- A single server played the role of MGS, MDT, OST and Client.
- All data storage happens via ldiskfs on a ramdisk
- 16GB Metadata
- 48GB Object Data
- All network accesses happen via TCP loopback
The final Lustre FS looks like this:
[-bash-4.3]$ lfs df -h UUID bytes Used Available Use% Mounted on ram-MDT0000_UUID 8.9G 46.1M 8.0G 1% /mnt/ram/client[MDT:0] ram-OST0000_UUID 46.9G 53.0M 44.4G 0% /mnt/ram/client[OST:0] filesystem_summary: 46.9G 53.0M 44.4G 0% /mnt/ram/client
Unfortunately, when running the test-script (which needs ~5 seconds on a local disk), I instead get these abysmal speeds:
[-bash-4.3]$ ./test.py /mnt/ram/client 2017-06-09 18:49:56,518 [INFO ] Creating 50k files in one directory... 2017-06-09 18:50:50,437 [INFO ] Reading 50k files... 2017-06-09 18:51:09,310 [INFO ] Deleting 50k files... 2017-06-09 18:51:20,604 [INFO ] Creation took: 53.92 seconds 2017-06-09 18:51:20,604 [INFO ] Reading took: 18.87 seconds 2017-06-09 18:51:20,604 [INFO ] Deleting took: 11.29 seconds
This tells me, that there is a rather fundamental performance issue within Lustre – and that it has nothing to do with the disk or network latency.
That – or my test script is broken – but I do not think it is.
If you're curious, here's how I set up the test scenario:
mkdir -p /mnt/ram/disk mount -t tmpfs -o size=64G tmpfs /mnt/ram/disk dd if=/dev/zero of=/mnt/ram/disk/mdt.img bs=1M count=16K dd if=/dev/zero of=/mnt/ram/disk/odt.img bs=1M count=48K losetup /dev/loop0 /mnt/ram/disk/mdt.img losetup /dev/loop1 /mnt/ram/disk/odt.img mkfs.lustre --mgs --mdt --fsname=ram --backfstype=ldiskfs --index=0 /dev/loop0 mkfs.lustre --ost --fsname=ram --backfstype=ldiskfs --index=0 --mgsnode=127.0.0.1@tcp0 /dev/loop1 mkdir -p /mnt/ram/mdt mount -t lustre -o defaults,noatime /dev/loop0 /mnt/ram/mdt mkdir -p /mnt/ram/ost mount -t lustre -o defaults,noatime /dev/loop1 /mnt/ram/ost mkdir -p /mnt/ram/client mount -t lustre 127.0.0.1@tcp0:/ram /mnt/ram/client chmod 1777 /mnt/ram/client
Thanks!
Hi Andreas.
Thanks for the reply.
Please note that I am indeed using ldiskfs already. The flow is:
So the effect is that each I/O operation goes like this:
Since the overhead of loopmount and tmpfs is virtually negligible – and the machine has 196 GB of RAM so does no swapping – the only speed block can be ldiskfs or Lustre.
Just for comparison's sake, I have created the same loop, but used an EXT4 file system directly – with the same settings as used by Lustre.
Then, I ran the performance test again:
As you can see, EXT4 adds about 1 second to the file creation speed, compared to "raw" tmpfs (2.11 sec vs. 1.23 sec).
Therefore, the write-amplification of 40 byte -> 4096 byte and other EXT4 overheads are present, but negligible.
The drastic, massive slow-down has to be because of something inside Lustre. Some kind of internal latency that gets added to every single read and write. It could be the LNET network layer, but since the packets never leave the machine, I could not imagine that this alone leads to a 10-20x slowdown.