[LU-3579] Performance regression after applying fix for LU-1397 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
2
Rank (Obsolete):
9057

Description

After applying the patch for ~~LU-1397~~, Fujitsu has seen performance for their benchmark suite decrease by about 7%. This is pushing the system outside of the acceptance range, and so therefore a high priority for us. I am trying to get information about what exactly is going more slowly (what application/file operations), but would it be possible for someone to review the patch to see if there are any areas that could be improved?

Thanks.

Attachments

Issue Links

is related to

LU-1397 ENOENT on open()

Resolved

Activity

[LU-3579] Performance regression after applying fix for LU-1397

Peter Jones added a comment - 14/Nov/14 2:48 PM

ok thanks Manish

Peter Jones added a comment - 14/Nov/14 2:48 PM ok thanks Manish

Manish Patel (Inactive) added a comment - 14/Nov/14 4:11 AM

Hi,

This ticket can be closed, since we do not have any more information in this case.

Thank You,
Manish

Manish Patel (Inactive) added a comment - 14/Nov/14 4:11 AM Hi, This ticket can be closed, since we do not have any more information in this case. Thank You, Manish

Oleg Drokin added a comment - 15/Jul/13 8:56 PM

if the initial open-create succeeded, the file definitely was created and opened.
Now if it's not there later, it might have been just deleted? Can you monitor delete stats on MDT to confirm this?

You can also enable rpctrace, but that will only show you what sort of RPCs were sent around, not names of files or some such.
On a client you need to also do vfstrace to see the names of files. On MDT you need to enable "inode" tracer to see those details, but that would slow it down quite a bit I imagine.

Oleg Drokin added a comment - 15/Jul/13 8:56 PM if the initial open-create succeeded, the file definitely was created and opened. Now if it's not there later, it might have been just deleted? Can you monitor delete stats on MDT to confirm this? You can also enable rpctrace, but that will only show you what sort of RPCs were sent around, not names of files or some such. On a client you need to also do vfstrace to see the names of files. On MDT you need to enable "inode" tracer to see those details, but that would slow it down quite a bit I imagine.

Kit Westneat (Inactive) added a comment - 15/Jul/13 2:39 PM

Going back to the original, ENOENT issue. It looks like files are not being created at all. Or if they are being created they are then completely disappearing. Here is the description from the customer:

It appears data is going missing during a Quantum Espresso job run:

An example of the output from a Quantum Espresso failure is attached.

Whilst the QuantumEspresso application runs it opens a scratch file for each MPI process. Occasionally when it tries to reopen one of these files it’s not there and it crashes with an error message like:

Error in routine seqopn (16):
error opening ./ausurf.igk2

All input, output files for the run are in:
/scratch/nick.wilson/parallel_benchmarks/AUSURF112.19274.1

We believe this to be a Lustre issue as when it’s run against local disk for scratch space there are no failures, only when we use Lustre for scratch space does this happen.

Do you have any opinions on what debug settings to run in order to get more information? I was thinking that rpctrace might be the best to start with. Any advice on it?

Thanks.

Kit Westneat (Inactive) added a comment - 15/Jul/13 2:39 PM Going back to the original, ENOENT issue. It looks like files are not being created at all. Or if they are being created they are then completely disappearing. Here is the description from the customer: It appears data is going missing during a Quantum Espresso job run: An example of the output from a Quantum Espresso failure is attached. Whilst the QuantumEspresso application runs it opens a scratch file for each MPI process. Occasionally when it tries to reopen one of these files it’s not there and it crashes with an error message like: Error in routine seqopn (16): error opening ./ausurf.igk2 All input, output files for the run are in: /scratch/nick.wilson/parallel_benchmarks/AUSURF112.19274.1 We believe this to be a Lustre issue as when it’s run against local disk for scratch space there are no failures, only when we use Lustre for scratch space does this happen. Do you have any opinions on what debug settings to run in order to get more information? I was thinking that rpctrace might be the best to start with. Any advice on it? Thanks.

Kit Westneat (Inactive) added a comment - 11/Jul/13 6:59 PM

Well the patch we applied was actually part of ~~LU-1234~~. I'm rereading the bugs and perhaps I misread it originally. Basically the issue is that the applications are getting ENOENT errors on 75% of the runs. I thought that the patch in ~~LU-1234~~ fixed the ENOENTs in 1397, but rereading it, it sounds like it was actually introduced by an earlier version of the ~~LU-1234~~ patch? It's interesting to note however that we didn't run into the ENOENT issue after applying it, though it could have been luck.

Oh I just realized I didn't say the version history in my initial post. They are running 2.1.3 servers and were running 2.1.3 clients. After running into the ENOENT issue, we upgraded them to 2.1.6 clients and that's when we first saw the regression. We then built a version of 2.1.3 only with:
http://review.whamcloud.com/#/c/2400/

The regression was still in that version as well. We are rerunning the test with 2.1.3 clients just to confirm that nothing else has changed, and the regression is definitely in the ~~LU-1234~~ patch.

Kit Westneat (Inactive) added a comment - 11/Jul/13 6:59 PM Well the patch we applied was actually part of LU-1234 . I'm rereading the bugs and perhaps I misread it originally. Basically the issue is that the applications are getting ENOENT errors on 75% of the runs. I thought that the patch in LU-1234 fixed the ENOENTs in 1397, but rereading it, it sounds like it was actually introduced by an earlier version of the LU-1234 patch? It's interesting to note however that we didn't run into the ENOENT issue after applying it, though it could have been luck. Oh I just realized I didn't say the version history in my initial post. They are running 2.1.3 servers and were running 2.1.3 clients. After running into the ENOENT issue, we upgraded them to 2.1.6 clients and that's when we first saw the regression. We then built a version of 2.1.3 only with: http://review.whamcloud.com/#/c/2400/ The regression was still in that version as well. We are rerunning the test with 2.1.3 clients just to confirm that nothing else has changed, and the regression is definitely in the LU-1234 patch.

Peter Jones added a comment - 11/Jul/13 6:28 PM

Can we take a step back here? What motivated applying the patch for ~~LU-1397~~ in the first place? Reading the notes relating to that, it was deemed unnecessary and abandoned. Are any other patches applied or is this otherwise a vanilla 2.1.6 deployment across the board?

Peter Jones added a comment - 11/Jul/13 6:28 PM Can we take a step back here? What motivated applying the patch for LU-1397 in the first place? Reading the notes relating to that, it was deemed unnecessary and abandoned. Are any other patches applied or is this otherwise a vanilla 2.1.6 deployment across the board?

People

Assignee:: Oleg Drokin

Reporter:: Manish Patel (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Jul/13 6:09 PM

Updated:: 14/Nov/14 2:48 PM

Resolved:: 14/Nov/14 2:48 PM