[LU-3579] Performance regression after applying fix for LU-1397 Created: 11/Jul/13  Updated: 14/Nov/14  Resolved: 14/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Manish Patel (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-1397 ENOENT on open() Resolved
Severity: 2
Rank (Obsolete): 9057

 Description   

After applying the patch for LU-1397, Fujitsu has seen performance for their benchmark suite decrease by about 7%. This is pushing the system outside of the acceptance range, and so therefore a high priority for us. I am trying to get information about what exactly is going more slowly (what application/file operations), but would it be possible for someone to review the patch to see if there are any areas that could be improved?

Thanks.



 Comments   
Comment by Peter Jones [ 11/Jul/13 ]

Can we take a step back here? What motivated applying the patch for LU-1397 in the first place? Reading the notes relating to that, it was deemed unnecessary and abandoned. Are any other patches applied or is this otherwise a vanilla 2.1.6 deployment across the board?

Comment by Kit Westneat (Inactive) [ 11/Jul/13 ]

Well the patch we applied was actually part of LU-1234. I'm rereading the bugs and perhaps I misread it originally. Basically the issue is that the applications are getting ENOENT errors on 75% of the runs. I thought that the patch in LU-1234 fixed the ENOENTs in 1397, but rereading it, it sounds like it was actually introduced by an earlier version of the LU-1234 patch? It's interesting to note however that we didn't run into the ENOENT issue after applying it, though it could have been luck.

Oh I just realized I didn't say the version history in my initial post. They are running 2.1.3 servers and were running 2.1.3 clients. After running into the ENOENT issue, we upgraded them to 2.1.6 clients and that's when we first saw the regression. We then built a version of 2.1.3 only with:
http://review.whamcloud.com/#/c/2400/

The regression was still in that version as well. We are rerunning the test with 2.1.3 clients just to confirm that nothing else has changed, and the regression is definitely in the LU-1234 patch.

Comment by Kit Westneat (Inactive) [ 15/Jul/13 ]

Going back to the original, ENOENT issue. It looks like files are not being created at all. Or if they are being created they are then completely disappearing. Here is the description from the customer:

It appears data is going missing during a Quantum Espresso job run:

An example of the output from a Quantum Espresso failure is attached.

Whilst the QuantumEspresso application runs it opens a scratch file for each MPI process. Occasionally when it tries to reopen one of these files it’s not there and it crashes with an error message like:

Error in routine seqopn (16):
error opening ./ausurf.igk2

All input, output files for the run are in:
/scratch/nick.wilson/parallel_benchmarks/AUSURF112.19274.1

We believe this to be a Lustre issue as when it’s run against local disk for scratch space there are no failures, only when we use Lustre for scratch space does this happen.

Do you have any opinions on what debug settings to run in order to get more information? I was thinking that rpctrace might be the best to start with. Any advice on it?

Thanks.

Comment by Oleg Drokin [ 15/Jul/13 ]

if the initial open-create succeeded, the file definitely was created and opened.
Now if it's not there later, it might have been just deleted? Can you monitor delete stats on MDT to confirm this?

You can also enable rpctrace, but that will only show you what sort of RPCs were sent around, not names of files or some such.
On a client you need to also do vfstrace to see the names of files. On MDT you need to enable "inode" tracer to see those details, but that would slow it down quite a bit I imagine.

Comment by Manish Patel (Inactive) [ 14/Nov/14 ]

Hi,

This ticket can be closed, since we do not have any more information in this case.

Thank You,
Manish

Comment by Peter Jones [ 14/Nov/14 ]

ok thanks Manish

Generated at Sat Feb 10 01:35:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.