[LU-6179] Lock ahead - Request extent locks from userspace - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0
Affects Version/s: None
Labels:
- bgti
- patch

Rank (Obsolete):
17290

Description

At the recent developers conference, Jinshan proposed a different method of approaching the performance problems described in ~~LU-6148~~.

Instead of introducing a new type of LDLM lock matching, we'd like to make it possible for user space to explicitly request LDLM locks asynchronously from the IO.

I've implemented a prototype version of the feature and will be uploading it for comments. I'll explain the state of the current version in a comment momentarily.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

anl_mpich_build_guide.txt
18 kB
24/Feb/17 7:29 PM
cug paper.pdf
714 kB
06/Jun/17 7:05 PM
lockahead_ladvise_mpich_patch
30 kB
24/Feb/17 7:29 PM
LockAheadResults.docx
516 kB
06/Jun/17 7:29 PM
LockAhead-TestReport.txt
1.36 MB
13/Apr/17 8:18 PM
LUSTRE-LockAhead-140417-1056-170.pdf
64 kB
14/Apr/17 4:09 PM
mmoore cug slides.pdf
1.15 MB
06/Jun/17 7:05 PM
sle11_build_tools.tar.gz
2.48 MB
24/Feb/17 7:29 PM

Issue Links

is blocking

LU-9962 Possible file size glimpse issue

Resolved

is related to

LU-19006 cl_mode_user_to_kernel returns unsigned enum, cannot be negative

Open

LU-10136 sanity test_255c: Ladvise test11 failed, 255

Resolved

LUDOC-379 Document ladvise lockahead feature

Resolved

LU-6181 Fix lock contention discovery to disable extend lock growth

Closed

LU-11618 implement ladvise rpc_size for optimized performance

Open

LU-6148 Strided lock proposal - Feature proposal for 2.8

Resolved

is related to

LU-7225 change ladvise wire protocol for lockahead and future usage

Resolved

(2 is related to, 1 is related to )

Activity

[LU-6179] Lock ahead - Request extent locks from userspace

Cong Xu (Inactive) added a comment - 06/Jun/17 7:46 PM

Hi Patrick,

Thanks for the great suggestions! We conducted more tests recently and was able to demonstrate the power of Lock Ahead in test "2.3 Vary Lustre Stripe Size (Independent I/O)".

In this test, the transfer size of each process is configured to be 1MB and the stripe size grows from 256KB to 16MB. When the stripe size equals to 16MB, 16 processes write a single stripe simultaneously, leading to lock contention issues. In this test, Lock Ahead code performs 21.5% better than original code.

Cong Xu (Inactive) added a comment - 06/Jun/17 7:46 PM Hi Patrick, Thanks for the great suggestions! We conducted more tests recently and was able to demonstrate the power of Lock Ahead in test "2.3 Vary Lustre Stripe Size (Independent I/O)". In this test, the transfer size of each process is configured to be 1MB and the stripe size grows from 256KB to 16MB. When the stripe size equals to 16MB, 16 processes write a single stripe simultaneously, leading to lock contention issues. In this test, Lock Ahead code performs 21.5% better than original code.

Patrick Farrell (Inactive) added a comment - 06/Jun/17 7:06 PM

Slides and paper from Cray User Group 2017 attached. They contain real performance #s on real hardware, including from real applications. Just for reference in case anyone is curious.

Patrick Farrell (Inactive) added a comment - 06/Jun/17 7:06 PM Slides and paper from Cray User Group 2017 attached. They contain real performance #s on real hardware, including from real applications. Just for reference in case anyone is curious.

Patrick Farrell (Inactive) added a comment - 06/Jun/17 7:01 PM

Cong,

Sorry to take a bit to get back to you.

Given the #s in section 2.4, you're barely seeing the problem, and lockahead does have some overhead. I wouldn't necessarily expect it to help in that case. It would be much easier to see if you had faster OSTs - So, I'd like to request RAM backed OSTs.

It's also possible something is wrong with the library. While I think we'll need RAM backed OSTs (or at least, much faster OSTs) to see benefit, we can explore this possibility as well.

Let's take one of the very simple tests, like a 1 stripe file with 1 process per client on 2 clients. I assume you're creating the file fresh before the test, but if not, please remove it and re-create it right before the test. Then, let's look at lock count before and after running IOR (add the -k option so the file isn't deleted, otherwise the locks will be cleaned up).

Specifically, on one of the clients, cat the lock count for the OST where the file is:
cat /sys/fs/lustre/ldlm/namespaces/[OST]/lock_count

Before and after the test.

If the file is not deleted and the lock count hasn't gone up, lock ahead didn't work for some reason.

Again, I think we'll need RAM backed OSTs regardless... But this would be useful even without that.

Patrick Farrell (Inactive) added a comment - 06/Jun/17 7:01 PM Cong, Sorry to take a bit to get back to you. Given the #s in section 2.4, you're barely seeing the problem, and lockahead does have some overhead. I wouldn't necessarily expect it to help in that case. It would be much easier to see if you had faster OSTs - So, I'd like to request RAM backed OSTs. It's also possible something is wrong with the library. While I think we'll need RAM backed OSTs (or at least, much faster OSTs) to see benefit, we can explore this possibility as well. Let's take one of the very simple tests, like a 1 stripe file with 1 process per client on 2 clients. I assume you're creating the file fresh before the test, but if not, please remove it and re-create it right before the test. Then, let's look at lock count before and after running IOR (add the -k option so the file isn't deleted, otherwise the locks will be cleaned up). Specifically, on one of the clients, cat the lock count for the OST where the file is: cat /sys/fs/lustre/ldlm/namespaces/ [OST] /lock_count Before and after the test. If the file is not deleted and the lock count hasn't gone up, lock ahead didn't work for some reason. Again, I think we'll need RAM backed OSTs regardless... But this would be useful even without that.

Cong Xu (Inactive) added a comment - 30/May/17 9:14 PM

Hi Jinshan,

Thanks for the suggestions! Yes, in our second test (Section 2.2 Vary number of processes (Independent I/O)), we started from 1 process per client with 8 clients (total 8 processes) to 64 processes per client with 8 clients (total 512 processes).

Hi Patrick,

Yes, we have tried simple test as you suggested. Please have a look at the results in section 2.4: Simple Test (1 process and 2 processes accessing a single shared file on one OST).

Cong Xu (Inactive) added a comment - 30/May/17 9:14 PM Hi Jinshan, Thanks for the suggestions! Yes, in our second test (Section 2.2 Vary number of processes (Independent I/O)), we started from 1 process per client with 8 clients (total 8 processes) to 64 processes per client with 8 clients (total 512 processes). Hi Patrick, Yes, we have tried simple test as you suggested. Please have a look at the results in section 2.4: Simple Test (1 process and 2 processes accessing a single shared file on one OST).

Jinshan Xiong (Inactive) added a comment - 30/May/17 7:40 PM

Let's reduce the number of processes per client and see how it goes. For example, let's do 1 process per client with 8 clients, and then 2 processes per client, etc.

Jinshan Xiong (Inactive) added a comment - 30/May/17 7:40 PM Let's reduce the number of processes per client and see how it goes. For example, let's do 1 process per client with 8 clients, and then 2 processes per client, etc.

Patrick Farrell (Inactive) added a comment - 30/May/17 6:23 PM

Cong,

Yes, that's true, but with that many processes and so few (and relatively slow) OSTs, you may not see any difference. For example, in this case, your OSTs are (best case) capable of 2700 MB/s total. That means each process only needs to provide 42 MB/s of that, and each node only ~340 MB/s. That means per OST, each node only needs to provide ~85 MB/s. That's not much, so I'm not surprised lockahead isn't giving any benefit.

Lockahead is really for situations where a single OST is faster than one client can write to it. One process on one client can generally write at between 1-2 GB/s, depending on various network and CPU properties. So these OSTs are quite slow for this testing.

So, this testing is sensitive to scale and latency issues. Are you able to do the small tests I requested? They should shed some light.

Patrick Farrell (Inactive) added a comment - 30/May/17 6:23 PM Cong, Yes, that's true, but with that many processes and so few (and relatively slow) OSTs, you may not see any difference. For example, in this case, your OSTs are (best case) capable of 2700 MB/s total. That means each process only needs to provide 42 MB/s of that, and each node only ~340 MB/s. That means per OST, each node only needs to provide ~85 MB/s. That's not much, so I'm not surprised lockahead isn't giving any benefit. Lockahead is really for situations where a single OST is faster than one client can write to it. One process on one client can generally write at between 1-2 GB/s, depending on various network and CPU properties. So these OSTs are quite slow for this testing. So, this testing is sensitive to scale and latency issues. Are you able to do the small tests I requested? They should shed some light.

Cong Xu (Inactive) added a comment - 30/May/17 5:47 PM

Hi Andreas,

Yes. In my second test, I launch 512 processes on 8 Lustre Clients (64 Processes/Client) to write a single shared file, there should be lock contentions in Lustre.

Cong Xu (Inactive) added a comment - 30/May/17 5:47 PM Hi Andreas, Yes. In my second test, I launch 512 processes on 8 Lustre Clients (64 Processes/Client) to write a single shared file, there should be lock contentions in Lustre.

Andreas Dilger added a comment - 30/May/17 5:36 PM

Cong, the lockahead code will only show a benefit if there is a single shared file with all threads writing to that file. Otherwise, Lustre will grant a single whole-file lock to each client at first write, and there is no lock contention.

Andreas Dilger added a comment - 30/May/17 5:36 PM Cong, the lockahead code will only show a benefit if there is a single shared file with all threads writing to that file. Otherwise, Lustre will grant a single whole-file lock to each client at first write, and there is no lock contention.

Patrick Farrell (Inactive) added a comment - 30/May/17 12:50 PM

If you're getting the maximum bandwidth already without lockahead, then it's definitely not going to help. There's no help for it to give.

I don't completely follow your description of the patterns, but that's OK. Can we try simplifying?

Let's try 1 stripe, 1 process from one node. What's the bandwidth #?
Then try 2 processes, one per node (so, 2 nodes) to a single file (again on one OST). What does that show? (Without lockahead)
(Also, please share your IOR command lines for these, like you did before.)
Then, if there's a difference in those cases, try lockahead in the second case.

If we've got everything set up right and the OSTs are fast enough for this to matter (I think they may not be), then the second case should be slower than the first (and lockahead should help). But it looks like each OST is capable of ~600-700 MB/s, that may not be enough to show this, depending on network latency, etc. I would expect to see the effect, but it might not show up. We make use of this primarily on much faster OSTs. (3-6 GB/s, for example) So if it doesn't show up, maybe you could try RAM backed OSTs?

Thanks!

Patrick Farrell (Inactive) added a comment - 30/May/17 12:50 PM If you're getting the maximum bandwidth already without lockahead, then it's definitely not going to help. There's no help for it to give. I don't completely follow your description of the patterns, but that's OK. Can we try simplifying? Let's try 1 stripe, 1 process from one node. What's the bandwidth #? Then try 2 processes, one per node (so, 2 nodes) to a single file (again on one OST). What does that show? (Without lockahead) (Also, please share your IOR command lines for these, like you did before.) Then, if there's a difference in those cases, try lockahead in the second case. If we've got everything set up right and the OSTs are fast enough for this to matter (I think they may not be), then the second case should be slower than the first (and lockahead should help). But it looks like each OST is capable of ~600-700 MB/s, that may not be enough to show this, depending on network latency, etc. I would expect to see the effect, but it might not show up. We make use of this primarily on much faster OSTs. (3-6 GB/s, for example) So if it doesn't show up, maybe you could try RAM backed OSTs? Thanks!

Cong Xu (Inactive) added a comment - 30/May/17 5:07 AM - edited

Hi Patrick,

Thanks for the comments! We have conducted 3 tests: perfect scenario, varying number of processes and varying Lustre Stripe Size. “LockAheadResults.docx” documents the details.

[Test 1] In the perfect scenario, we launch 4 Processes on 4 Lustre Clients (1 Process per Client), accessing 4 Lustre OSTs remotely. Both Original and Lock Ahead cases deliver 2700MB/s bandwidth. This is the maximum bandwidth of our Lustre file system. (Section 2.1 Perfect Scenario (Independent I/O))

[Test 2] To conduct the test that the Lock Ahead code should probably deliver superior performance than the original code, we launch up to 512 processes to perform independent I/O to our Lustre file system. The bandwidth of both Original and Lock Ahead cases are 2000MB/s. (Section 2.2 Vary number of processes (Independent I/O))

[Test 3] We also have investigated the effects of various Lustre Stripe Size on the I/O performance. We keep IOR Transfer Size constant (4MB), and increase the Lustre Stripe Size from 1MB to 64MB, Both Original and Lock Ahead cases deliver 2000MB/s bandwidth. (2.3 Vary Lustre Stripe Size (Independent I/O))

Cong Xu (Inactive) added a comment - 30/May/17 5:07 AM - edited Hi Patrick, Thanks for the comments! We have conducted 3 tests: perfect scenario, varying number of processes and varying Lustre Stripe Size. “LockAheadResults.docx” documents the details. [Test 1] In the perfect scenario, we launch 4 Processes on 4 Lustre Clients (1 Process per Client), accessing 4 Lustre OSTs remotely. Both Original and Lock Ahead cases deliver 2700MB/s bandwidth. This is the maximum bandwidth of our Lustre file system. (Section 2.1 Perfect Scenario (Independent I/O)) [Test 2] To conduct the test that the Lock Ahead code should probably deliver superior performance than the original code, we launch up to 512 processes to perform independent I/O to our Lustre file system. The bandwidth of both Original and Lock Ahead cases are 2000MB/s. (Section 2.2 Vary number of processes (Independent I/O)) [Test 3] We also have investigated the effects of various Lustre Stripe Size on the I/O performance. We keep IOR Transfer Size constant (4MB), and increase the Lustre Stripe Size from 1MB to 64MB, Both Original and Lock Ahead cases deliver 2000MB/s bandwidth. (2.3 Vary Lustre Stripe Size (Independent I/O))

Patrick Farrell (Inactive) added a comment - 15/May/17 3:05 PM

Cong,

I'm looking at your test results, and since the two ways of running gave almost identical results, I think we've got a problem, possibly a bottleneck somewhere else. (There could be a bug in the MPICH or Lustre side as well causing lockahead not to activate, but I did test both, so we'll assume no for the moment.)

First: What happens if you try just 4 processes and 4 aggregators, no lockahead? What's the result look like? That should avoid lock contention entirely and give better results... But I bet we're still going to see that same 2.6 GB/s final number.

What does 1 aggregator do with a 1 stripe file? What about 2 aggregators with a 1 stripe file, with and without lockahead?

And what about what should probably be the maximum performance case, 8 process FPP without collective I/O?

Patrick Farrell (Inactive) added a comment - 15/May/17 3:05 PM Cong, I'm looking at your test results, and since the two ways of running gave almost identical results, I think we've got a problem, possibly a bottleneck somewhere else. (There could be a bug in the MPICH or Lustre side as well causing lockahead not to activate, but I did test both, so we'll assume no for the moment.) First: What happens if you try just 4 processes and 4 aggregators, no lockahead? What's the result look like? That should avoid lock contention entirely and give better results... But I bet we're still going to see that same 2.6 GB/s final number . What does 1 aggregator do with a 1 stripe file? What about 2 aggregators with a 1 stripe file, with and without lockahead? And what about what should probably be the maximum performance case, 8 process FPP without collective I/O?

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 29/Jan/15 10:13 PM

Updated:: 12/May/25 7:37 AM

Resolved:: 21/Sep/17 11:57 AM