[LU-8441] Text file busy error after overwriting file - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.5.5
Labels:
- llnl
- llnlfixready
Environment:
lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Here's our reproducer:

sh -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)'

This looks similiar to ~~LU-6232~~. This affects emacs which is impacting our users. This behaviour is a difference in how xemacs vs. vi deal with files that they already have open. With vi, it always writes to a temporary new file which it then moves over top of the file being edited. With xemacs, the original file is moved to <file>~ and a new file written on the first write. After that it overwrites the new file. One can see this by running them, saving a file, checking the inode number with "ls -i <file>", and then repeating the save and check operations. With xemacs the inode number won't change for each save. With vi, it will.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debug.lu8441.tar
11.70 MB
17/Aug/17 10:12 PM
lu8441.logs.tar
53.50 MB
14/Nov/17 10:42 PM

Issue Links

duplicates

LU-8019 Openlock breakage

Resolved

is duplicated by

LU-7727 open with FMODE_EXEC fails with ETXTBSY after a failed FMODE_WRITE open attempt on a read only client

Resolved

Activity

[LU-8441] Text file busy error after overwriting file

Jian Yu added a comment - 22/Nov/17 1:55 AM

Thank you, Olaf. I'm closing this ticket as a duplicate of ~~LU-8019~~.

Jian Yu added a comment - 22/Nov/17 1:55 AM Thank you, Olaf. I'm closing this ticket as a duplicate of LU-8019 .

Olaf Faaland added a comment - 22/Nov/17 12:38 AM - edited

Thanks, Jian and Oleg.

That's all LLNL needs for this. You can close notfix (or whatever your normal process is).

Olaf Faaland added a comment - 22/Nov/17 12:38 AM - edited Thanks, Jian and Oleg. That's all LLNL needs for this. You can close notfix (or whatever your normal process is).

Oleg Drokin added a comment - 22/Nov/17 12:16 AM

This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive.

The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.

Oleg Drokin added a comment - 22/Nov/17 12:16 AM This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive. The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.

Jian Yu added a comment - 21/Nov/17 7:21 PM

Hi Olaf,
According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from.

Hi Oleg,
About the factors that lead to the issue, could you please give some hints? Thank you.

Jian Yu added a comment - 21/Nov/17 7:21 PM Hi Olaf, According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from. Hi Oleg, About the factors that lead to the issue, could you please give some hints? Thank you.

Olaf Faaland added a comment - 21/Nov/17 6:10 PM

Hi Jian and Oleg,

Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket.

For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.

Olaf Faaland added a comment - 21/Nov/17 6:10 PM Hi Jian and Oleg, Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket. For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.

Jian Yu added a comment - 21/Nov/17 5:54 PM

Hi Olaf,
The fix is ~~LU-8019~~ and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for ~~LU-3544~~, which contains more patches.

Jian Yu added a comment - 21/Nov/17 5:54 PM Hi Olaf, The fix is LU-8019 and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for LU-3544 , which contains more patches.

Oleg Drokin added a comment - 21/Nov/17 5:47 PM

It looks like ~~LU-8109~~ and all the preceding patches including ~~LU-3544~~ is necessary to fix this, but the exact serie is quite long and and not exactly known at this time. what is known is there were multiple problems introduced and fixed along the way until finally ironing out all wrinkles here.

Oleg Drokin added a comment - 21/Nov/17 5:47 PM It looks like LU-8109 and all the preceding patches including LU-3544 is necessary to fix this, but the exact serie is quite long and and not exactly known at this time. what is known is there were multiple problems introduced and fixed along the way until finally ironing out all wrinkles here.

Peter Jones added a comment - 20/Nov/17 6:49 PM

Ah yes. I did not read far enough back in the comments. So, yujian I think that the situation is now that this is believed to be a duplicate of ~~LU-4367~~ but that change is not a simple back port to a 2.5.x branch and so the question is - is this an impactful enough issue to warrant taking on the risk of introducing a largish change that has not been proven in production environments elsewhere?

Peter Jones added a comment - 20/Nov/17 6:49 PM Ah yes. I did not read far enough back in the comments. So, yujian I think that the situation is now that this is believed to be a duplicate of LU-4367 but that change is not a simple back port to a 2.5.x branch and so the question is - is this an impactful enough issue to warrant taking on the risk of introducing a largish change that has not been proven in production environments elsewhere?

Olaf Faaland added a comment - 20/Nov/17 5:33 PM

So am I correct in thinking that this means that the original theory that this is a duplicate of ~~LU-7727~~ holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution

No, the patch for ~~LU-7727~~ is very specific to a read-only mount. The patch says, in effect, "If the mount is read-only, and the file is being opened with the O_WRITE flag, then fail immediately instead of sending a request to the MDT".

Without that patch, a request is sent to the MDT which then causes side-affects on other clients.

It has a similar change on the MDT, for the case where an MDT receives an open request from a client which has mounted read-only.

In our case, no clients are mounted read-only, so neither of those changes would have any effect.

or else upgrade to your 2.8.x distribution?

Correct, when all our systems are at 2.8.x we do not expect to see this.

Since the issue does not occur in 2.8.x, it may well be that there is an existing patch that could be backported, but ~~LU-7727~~ is not it.

Olaf Faaland added a comment - 20/Nov/17 5:33 PM So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution No, the patch for LU-7727 is very specific to a read-only mount. The patch says, in effect, "If the mount is read-only, and the file is being opened with the O_WRITE flag, then fail immediately instead of sending a request to the MDT". Without that patch, a request is sent to the MDT which then causes side-affects on other clients. It has a similar change on the MDT, for the case where an MDT receives an open request from a client which has mounted read-only. In our case, no clients are mounted read-only, so neither of those changes would have any effect. or else upgrade to your 2.8.x distribution? Correct, when all our systems are at 2.8.x we do not expect to see this. Since the issue does not occur in 2.8.x, it may well be that there is an existing patch that could be backported, but LU-7727 is not it.

Peter Jones added a comment - 20/Nov/17 5:08 PM

Olaf

So am I correct in thinking that this means that the original theory that this is a duplicate of ~~LU-7727~~ holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution or else upgrade to your 2.8.x distribution?

Peter

Peter Jones added a comment - 20/Nov/17 5:08 PM Olaf So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution or else upgrade to your 2.8.x distribution? Peter

People

Assignee:: Jian Yu

Reporter:: Teresa Kamakea (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 26/Jul/16 11:13 PM

Updated:: 22/Nov/17 1:55 AM

Resolved:: 22/Nov/17 1:55 AM