Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3579

Performance regression after applying fix for LU-1397

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.6
    • None
    • 2
    • 9057

    Description

      After applying the patch for LU-1397, Fujitsu has seen performance for their benchmark suite decrease by about 7%. This is pushing the system outside of the acceptance range, and so therefore a high priority for us. I am trying to get information about what exactly is going more slowly (what application/file operations), but would it be possible for someone to review the patch to see if there are any areas that could be improved?

      Thanks.

      Attachments

        Issue Links

          Activity

            [LU-3579] Performance regression after applying fix for LU-1397
            pjones Peter Jones added a comment -

            ok thanks Manish

            pjones Peter Jones added a comment - ok thanks Manish

            Hi,

            This ticket can be closed, since we do not have any more information in this case.

            Thank You,
            Manish

            manish Manish Patel (Inactive) added a comment - Hi, This ticket can be closed, since we do not have any more information in this case. Thank You, Manish
            green Oleg Drokin added a comment -

            if the initial open-create succeeded, the file definitely was created and opened.
            Now if it's not there later, it might have been just deleted? Can you monitor delete stats on MDT to confirm this?

            You can also enable rpctrace, but that will only show you what sort of RPCs were sent around, not names of files or some such.
            On a client you need to also do vfstrace to see the names of files. On MDT you need to enable "inode" tracer to see those details, but that would slow it down quite a bit I imagine.

            green Oleg Drokin added a comment - if the initial open-create succeeded, the file definitely was created and opened. Now if it's not there later, it might have been just deleted? Can you monitor delete stats on MDT to confirm this? You can also enable rpctrace, but that will only show you what sort of RPCs were sent around, not names of files or some such. On a client you need to also do vfstrace to see the names of files. On MDT you need to enable "inode" tracer to see those details, but that would slow it down quite a bit I imagine.

            Going back to the original, ENOENT issue. It looks like files are not being created at all. Or if they are being created they are then completely disappearing. Here is the description from the customer:

            It appears data is going missing during a Quantum Espresso job run:

            An example of the output from a Quantum Espresso failure is attached.

            Whilst the QuantumEspresso application runs it opens a scratch file for each MPI process. Occasionally when it tries to reopen one of these files it’s not there and it crashes with an error message like:

            Error in routine seqopn (16):
            error opening ./ausurf.igk2

            All input, output files for the run are in:
            /scratch/nick.wilson/parallel_benchmarks/AUSURF112.19274.1

            We believe this to be a Lustre issue as when it’s run against local disk for scratch space there are no failures, only when we use Lustre for scratch space does this happen.

            Do you have any opinions on what debug settings to run in order to get more information? I was thinking that rpctrace might be the best to start with. Any advice on it?

            Thanks.

            kitwestneat Kit Westneat (Inactive) added a comment - Going back to the original, ENOENT issue. It looks like files are not being created at all. Or if they are being created they are then completely disappearing. Here is the description from the customer: It appears data is going missing during a Quantum Espresso job run: An example of the output from a Quantum Espresso failure is attached. Whilst the QuantumEspresso application runs it opens a scratch file for each MPI process. Occasionally when it tries to reopen one of these files it’s not there and it crashes with an error message like: Error in routine seqopn (16): error opening ./ausurf.igk2 All input, output files for the run are in: /scratch/nick.wilson/parallel_benchmarks/AUSURF112.19274.1 We believe this to be a Lustre issue as when it’s run against local disk for scratch space there are no failures, only when we use Lustre for scratch space does this happen. Do you have any opinions on what debug settings to run in order to get more information? I was thinking that rpctrace might be the best to start with. Any advice on it? Thanks.

            Well the patch we applied was actually part of LU-1234. I'm rereading the bugs and perhaps I misread it originally. Basically the issue is that the applications are getting ENOENT errors on 75% of the runs. I thought that the patch in LU-1234 fixed the ENOENTs in 1397, but rereading it, it sounds like it was actually introduced by an earlier version of the LU-1234 patch? It's interesting to note however that we didn't run into the ENOENT issue after applying it, though it could have been luck.

            Oh I just realized I didn't say the version history in my initial post. They are running 2.1.3 servers and were running 2.1.3 clients. After running into the ENOENT issue, we upgraded them to 2.1.6 clients and that's when we first saw the regression. We then built a version of 2.1.3 only with:
            http://review.whamcloud.com/#/c/2400/

            The regression was still in that version as well. We are rerunning the test with 2.1.3 clients just to confirm that nothing else has changed, and the regression is definitely in the LU-1234 patch.

            kitwestneat Kit Westneat (Inactive) added a comment - Well the patch we applied was actually part of LU-1234 . I'm rereading the bugs and perhaps I misread it originally. Basically the issue is that the applications are getting ENOENT errors on 75% of the runs. I thought that the patch in LU-1234 fixed the ENOENTs in 1397, but rereading it, it sounds like it was actually introduced by an earlier version of the LU-1234 patch? It's interesting to note however that we didn't run into the ENOENT issue after applying it, though it could have been luck. Oh I just realized I didn't say the version history in my initial post. They are running 2.1.3 servers and were running 2.1.3 clients. After running into the ENOENT issue, we upgraded them to 2.1.6 clients and that's when we first saw the regression. We then built a version of 2.1.3 only with: http://review.whamcloud.com/#/c/2400/ The regression was still in that version as well. We are rerunning the test with 2.1.3 clients just to confirm that nothing else has changed, and the regression is definitely in the LU-1234 patch.
            pjones Peter Jones added a comment -

            Can we take a step back here? What motivated applying the patch for LU-1397 in the first place? Reading the notes relating to that, it was deemed unnecessary and abandoned. Are any other patches applied or is this otherwise a vanilla 2.1.6 deployment across the board?

            pjones Peter Jones added a comment - Can we take a step back here? What motivated applying the patch for LU-1397 in the first place? Reading the notes relating to that, it was deemed unnecessary and abandoned. Are any other patches applied or is this otherwise a vanilla 2.1.6 deployment across the board?

            People

              green Oleg Drokin
              manish Manish Patel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: