Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6179

Lock ahead - Request extent locks from userspace

Details

    • New Feature
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • None
    • 17290

    Description

      At the recent developers conference, Jinshan proposed a different method of approaching the performance problems described in LU-6148.

      Instead of introducing a new type of LDLM lock matching, we'd like to make it possible for user space to explicitly request LDLM locks asynchronously from the IO.

      I've implemented a prototype version of the feature and will be uploading it for comments. I'll explain the state of the current version in a comment momentarily.

      Attachments

        1. anl_mpich_build_guide.txt
          18 kB
        2. cug paper.pdf
          714 kB
        3. lockahead_ladvise_mpich_patch
          30 kB
        4. LockAheadResults.docx
          516 kB
        5. LockAhead-TestReport.txt
          1.36 MB
        6. LUSTRE-LockAhead-140417-1056-170.pdf
          64 kB
        7. mmoore cug slides.pdf
          1.15 MB
        8. sle11_build_tools.tar.gz
          2.48 MB

        Issue Links

          Activity

            [LU-6179] Lock ahead - Request extent locks from userspace

            Hi Patrick,

            Thanks for the great suggestions! We conducted more tests recently and was able to demonstrate the power of Lock Ahead in test "2.3 Vary Lustre Stripe Size (Independent I/O)".

            In this test, the transfer size of each process is configured to be 1MB and the stripe size grows from 256KB to 16MB. When the stripe size equals to 16MB, 16 processes write a single stripe simultaneously, leading to lock contention issues. In this test, Lock Ahead code performs 21.5% better than original code.

            czx0003 Cong Xu (Inactive) added a comment - Hi Patrick, Thanks for the great suggestions! We conducted more tests recently and was able to demonstrate the power of Lock Ahead in test "2.3 Vary Lustre Stripe Size (Independent I/O)". In this test, the transfer size of each process is configured to be 1MB and the stripe size grows from 256KB to 16MB. When the stripe size equals to 16MB, 16 processes write a single stripe simultaneously, leading to lock contention issues. In this test, Lock Ahead code performs 21.5% better than original code.

            Slides and paper from Cray User Group 2017 attached. They contain real performance #s on real hardware, including from real applications. Just for reference in case anyone is curious.

            paf Patrick Farrell (Inactive) added a comment - Slides and paper from Cray User Group 2017 attached. They contain real performance #s on real hardware, including from real applications. Just for reference in case anyone is curious.

            Cong,

            Sorry to take a bit to get back to you.

            Given the #s in section 2.4, you're barely seeing the problem, and lockahead does have some overhead. I wouldn't necessarily expect it to help in that case. It would be much easier to see if you had faster OSTs - So, I'd like to request RAM backed OSTs.

            It's also possible something is wrong with the library. While I think we'll need RAM backed OSTs (or at least, much faster OSTs) to see benefit, we can explore this possibility as well.

            Let's take one of the very simple tests, like a 1 stripe file with 1 process per client on 2 clients. I assume you're creating the file fresh before the test, but if not, please remove it and re-create it right before the test. Then, let's look at lock count before and after running IOR (add the -k option so the file isn't deleted, otherwise the locks will be cleaned up).

            Specifically, on one of the clients, cat the lock count for the OST where the file is:
            cat /sys/fs/lustre/ldlm/namespaces/[OST]/lock_count

            Before and after the test.

            If the file is not deleted and the lock count hasn't gone up, lock ahead didn't work for some reason.

            Again, I think we'll need RAM backed OSTs regardless... But this would be useful even without that.

            paf Patrick Farrell (Inactive) added a comment - Cong, Sorry to take a bit to get back to you. Given the #s in section 2.4, you're barely seeing the problem, and lockahead does have some overhead. I wouldn't necessarily expect it to help in that case. It would be much easier to see if you had faster OSTs - So, I'd like to request RAM backed OSTs. It's also possible something is wrong with the library. While I think we'll need RAM backed OSTs (or at least, much faster OSTs) to see benefit, we can explore this possibility as well. Let's take one of the very simple tests, like a 1 stripe file with 1 process per client on 2 clients. I assume you're creating the file fresh before the test, but if not, please remove it and re-create it right before the test. Then, let's look at lock count before and after running IOR (add the -k option so the file isn't deleted, otherwise the locks will be cleaned up). Specifically, on one of the clients, cat the lock count for the OST where the file is: cat /sys/fs/lustre/ldlm/namespaces/ [OST] /lock_count Before and after the test. If the file is not deleted and the lock count hasn't gone up, lock ahead didn't work for some reason. Again, I think we'll need RAM backed OSTs regardless... But this would be useful even without that.

            Hi Jinshan,

            Thanks for the suggestions! Yes, in our second test (Section 2.2 Vary number of processes (Independent I/O)), we started from 1 process per client with 8 clients (total 8 processes) to 64 processes per client with 8 clients (total 512 processes).

            Hi Patrick,

            Yes, we have tried simple test as you suggested. Please have a look at the results in section 2.4: Simple Test (1 process and 2 processes accessing a single shared file on one OST).

            czx0003 Cong Xu (Inactive) added a comment - Hi Jinshan, Thanks for the suggestions! Yes, in our second test (Section 2.2 Vary number of processes (Independent I/O)), we started from 1 process per client with 8 clients (total 8 processes) to 64 processes per client with 8 clients (total 512 processes). Hi Patrick, Yes, we have tried simple test as you suggested. Please have a look at the results in section 2.4: Simple Test (1 process and 2 processes accessing a single shared file on one OST).

            Let's reduce the number of processes per client and see how it goes. For example, let's do 1 process per client with 8 clients, and then 2 processes per client, etc.

            jay Jinshan Xiong (Inactive) added a comment - Let's reduce the number of processes per client and see how it goes. For example, let's do 1 process per client with 8 clients, and then 2 processes per client, etc.

            Cong,

            Yes, that's true, but with that many processes and so few (and relatively slow) OSTs, you may not see any difference. For example, in this case, your OSTs are (best case) capable of 2700 MB/s total. That means each process only needs to provide 42 MB/s of that, and each node only ~340 MB/s. That means per OST, each node only needs to provide ~85 MB/s. That's not much, so I'm not surprised lockahead isn't giving any benefit.

            Lockahead is really for situations where a single OST is faster than one client can write to it. One process on one client can generally write at between 1-2 GB/s, depending on various network and CPU properties. So these OSTs are quite slow for this testing.

            So, this testing is sensitive to scale and latency issues. Are you able to do the small tests I requested? They should shed some light.

            paf Patrick Farrell (Inactive) added a comment - Cong, Yes, that's true, but with that many processes and so few (and relatively slow) OSTs, you may not see any difference. For example, in this case, your OSTs are (best case) capable of 2700 MB/s total. That means each process only needs to provide 42 MB/s of that, and each node only ~340 MB/s. That means per OST, each node only needs to provide ~85 MB/s. That's not much, so I'm not surprised lockahead isn't giving any benefit. Lockahead is really for situations where a single OST is faster than one client can write to it. One process on one client can generally write at between 1-2 GB/s, depending on various network and CPU properties. So these OSTs are quite slow for this testing. So, this testing is sensitive to scale and latency issues. Are you able to do the small tests I requested? They should shed some light.

            Hi Andreas,

            Yes. In my second test, I launch 512 processes on 8 Lustre Clients (64 Processes/Client) to write a single shared file, there should be lock contentions in Lustre.

            czx0003 Cong Xu (Inactive) added a comment - Hi Andreas, Yes. In my second test, I launch 512 processes on 8 Lustre Clients (64 Processes/Client) to write a single shared file, there should be lock contentions in Lustre.

            Cong, the lockahead code will only show a benefit if there is a single shared file with all threads writing to that file. Otherwise, Lustre will grant a single whole-file lock to each client at first write, and there is no lock contention.

            adilger Andreas Dilger added a comment - Cong, the lockahead code will only show a benefit if there is a single shared file with all threads writing to that file. Otherwise, Lustre will grant a single whole-file lock to each client at first write, and there is no lock contention.

            If you're getting the maximum bandwidth already without lockahead, then it's definitely not going to help. There's no help for it to give.

            I don't completely follow your description of the patterns, but that's OK. Can we try simplifying?

            Let's try 1 stripe, 1 process from one node. What's the bandwidth #?
            Then try 2 processes, one per node (so, 2 nodes) to a single file (again on one OST). What does that show? (Without lockahead)
            (Also, please share your IOR command lines for these, like you did before.)
            Then, if there's a difference in those cases, try lockahead in the second case.

            If we've got everything set up right and the OSTs are fast enough for this to matter (I think they may not be), then the second case should be slower than the first (and lockahead should help). But it looks like each OST is capable of ~600-700 MB/s, that may not be enough to show this, depending on network latency, etc. I would expect to see the effect, but it might not show up. We make use of this primarily on much faster OSTs. (3-6 GB/s, for example) So if it doesn't show up, maybe you could try RAM backed OSTs?

            Thanks!

            paf Patrick Farrell (Inactive) added a comment - If you're getting the maximum bandwidth already without lockahead, then it's definitely not going to help. There's no help for it to give. I don't completely follow your description of the patterns, but that's OK. Can we try simplifying? Let's try 1 stripe, 1 process from one node. What's the bandwidth #? Then try 2 processes, one per node (so, 2 nodes) to a single file (again on one OST). What does that show? (Without lockahead) (Also, please share your IOR command lines for these, like you did before.) Then, if there's a difference in those cases, try lockahead in the second case. If we've got everything set up right and the OSTs are fast enough for this to matter (I think they may not be), then the second case should be slower than the first (and lockahead should help). But it looks like each OST is capable of ~600-700 MB/s, that may not be enough to show this, depending on network latency, etc. I would expect to see the effect, but it might not show up. We make use of this primarily on much faster OSTs. (3-6 GB/s, for example) So if it doesn't show up, maybe you could try RAM backed OSTs? Thanks!
            czx0003 Cong Xu (Inactive) added a comment - - edited

            Hi Patrick,

            Thanks for the comments! We have conducted 3 tests: perfect scenario, varying number of processes and varying Lustre Stripe Size. “LockAheadResults.docx” documents the details.

            [Test 1] In the perfect scenario, we launch 4 Processes on 4 Lustre Clients (1 Process per Client), accessing 4 Lustre OSTs remotely. Both Original and Lock Ahead cases deliver 2700MB/s bandwidth. This is the maximum bandwidth of our Lustre file system. (Section 2.1 Perfect Scenario (Independent I/O))

            [Test 2] To conduct the test that the Lock Ahead code should probably deliver superior performance than the original code, we launch up to 512 processes to perform independent I/O to our Lustre file system. The bandwidth of both Original and Lock Ahead cases are 2000MB/s. (Section 2.2 Vary number of processes (Independent I/O))

            [Test 3] We also have investigated the effects of various Lustre Stripe Size on the I/O performance. We keep IOR Transfer Size constant (4MB), and increase the Lustre Stripe Size from 1MB to 64MB, Both Original and Lock Ahead cases deliver 2000MB/s bandwidth. (2.3 Vary Lustre Stripe Size (Independent I/O))

            czx0003 Cong Xu (Inactive) added a comment - - edited Hi Patrick, Thanks for the comments! We have conducted 3 tests: perfect scenario, varying number of processes and varying Lustre Stripe Size. “LockAheadResults.docx” documents the details. [Test 1] In the perfect scenario, we launch 4 Processes on 4 Lustre Clients (1 Process per Client), accessing 4 Lustre OSTs remotely. Both Original and Lock Ahead cases deliver 2700MB/s bandwidth. This is the maximum bandwidth of our Lustre file system. (Section 2.1 Perfect Scenario (Independent I/O)) [Test 2] To conduct the test that the Lock Ahead code should probably deliver superior performance than the original code, we launch up to 512 processes to perform independent I/O to our Lustre file system. The bandwidth of both Original and Lock Ahead cases are 2000MB/s. (Section 2.2 Vary number of processes (Independent I/O)) [Test 3] We also have investigated the effects of various Lustre Stripe Size on the I/O performance. We keep IOR Transfer Size constant (4MB), and increase the Lustre Stripe Size from 1MB to 64MB, Both Original and Lock Ahead cases deliver 2000MB/s bandwidth. (2.3 Vary Lustre Stripe Size (Independent I/O))

            Cong,

            I'm looking at your test results, and since the two ways of running gave almost identical results, I think we've got a problem, possibly a bottleneck somewhere else. (There could be a bug in the MPICH or Lustre side as well causing lockahead not to activate, but I did test both, so we'll assume no for the moment.)

            First: What happens if you try just 4 processes and 4 aggregators, no lockahead? What's the result look like? That should avoid lock contention entirely and give better results... But I bet we're still going to see that same 2.6 GB/s final number.

            What does 1 aggregator do with a 1 stripe file? What about 2 aggregators with a 1 stripe file, with and without lockahead?

            And what about what should probably be the maximum performance case, 8 process FPP without collective I/O?

            paf Patrick Farrell (Inactive) added a comment - Cong, I'm looking at your test results, and since the two ways of running gave almost identical results, I think we've got a problem, possibly a bottleneck somewhere else. (There could be a bug in the MPICH or Lustre side as well causing lockahead not to activate, but I did test both, so we'll assume no for the moment.) First: What happens if you try just 4 processes and 4 aggregators, no lockahead? What's the result look like? That should avoid lock contention entirely and give better results... But I bet we're still going to see that same 2.6 GB/s final number . What does 1 aggregator do with a 1 stripe file? What about 2 aggregators with a 1 stripe file, with and without lockahead? And what about what should probably be the maximum performance case, 8 process FPP without collective I/O?

            People

              paf Patrick Farrell (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: