Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8441

Text file busy error after overwriting file

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.8.0, Lustre 2.5.5
    • lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      Here's our reproducer:

      sh -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)'
      

      This looks similiar to LU-6232. This affects emacs which is impacting our users. This behaviour is a difference in how xemacs vs. vi deal with files that they already have open. With vi, it always writes to a temporary new file which it then moves over top of the file being edited. With xemacs, the original file is moved to <file>~ and a new file written on the first write. After that it overwrites the new file. One can see this by running them, saving a file, checking the inode number with "ls -i <file>", and then repeating the save and check operations. With xemacs the inode number won't change for each save. With vi, it will.

      Attachments

        1. debug.lu8441.tar
          11.70 MB
        2. lu8441.logs.tar
          53.50 MB

        Issue Links

          Activity

            [LU-8441] Text file busy error after overwriting file
            yujian Jian Yu added a comment -

            Thank you, Olaf. I'm closing this ticket as a duplicate of LU-8019.

            yujian Jian Yu added a comment - Thank you, Olaf. I'm closing this ticket as a duplicate of LU-8019 .
            ofaaland Olaf Faaland added a comment - - edited

            Thanks, Jian and Oleg.

            That's all LLNL needs for this. You can close notfix (or whatever your normal process is).

            ofaaland Olaf Faaland added a comment - - edited Thanks, Jian and Oleg. That's all LLNL needs for this. You can close notfix (or whatever your normal process is).
            green Oleg Drokin added a comment -

            This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive.

            The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.

            green Oleg Drokin added a comment - This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive. The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.
            yujian Jian Yu added a comment -

            Hi Olaf,
            According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from.

            Hi Oleg,
            About the factors that lead to the issue, could you please give some hints? Thank you.

            yujian Jian Yu added a comment - Hi Olaf, According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from. Hi Oleg, About the factors that lead to the issue, could you please give some hints? Thank you.
            ofaaland Olaf Faaland added a comment -

            Hi Jian and Oleg,

            Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket.

            For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.

            ofaaland Olaf Faaland added a comment - Hi Jian and Oleg, Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket. For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.
            yujian Jian Yu added a comment -

            Hi Olaf,
            The fix is LU-8019 and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for LU-3544, which contains more patches.

            yujian Jian Yu added a comment - Hi Olaf, The fix is LU-8019 and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for LU-3544 , which contains more patches.
            green Oleg Drokin added a comment -

            It looks like LU-8109 and all the preceding patches including LU-3544 is necessary to fix this, but the exact serie is quite long and and not exactly known at this time. what is known is there were multiple problems introduced and fixed along the way until finally ironing out all wrinkles here.

            green Oleg Drokin added a comment - It looks like LU-8109 and all the preceding patches including LU-3544 is necessary to fix this, but the exact serie is quite long and and not exactly known at this time. what is known is there were multiple problems introduced and fixed along the way until finally ironing out all wrinkles here.
            pjones Peter Jones added a comment -

            Ah yes. I did not read far enough back in the comments. So, yujian I think that the situation is now that this is believed to be a duplicate of LU-4367 but that change is not a simple back port to a 2.5.x branch and so the question is - is this an impactful enough issue to warrant taking on the risk of introducing a largish change that has not been proven in production environments elsewhere?

            pjones Peter Jones added a comment - Ah yes. I did not read far enough back in the comments. So, yujian I think that the situation is now that this is believed to be a duplicate of LU-4367 but that change is not a simple back port to a 2.5.x branch and so the question is - is this an impactful enough issue to warrant taking on the risk of introducing a largish change that has not been proven in production environments elsewhere?
            ofaaland Olaf Faaland added a comment -

            So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution

            No, the patch for LU-7727 is very specific to a read-only mount.  The patch says, in effect, "If the mount is read-only, and the file is being opened with the O_WRITE flag, then fail immediately instead of sending a request to the MDT".

            Without that patch, a request is sent to the MDT which then causes side-affects on other clients.

            It has a similar change on the MDT, for the case where an MDT receives an open request from a client which has mounted read-only.

            In our case, no clients are mounted read-only, so neither of those changes would have any effect.

            or else upgrade to your 2.8.x distribution?

            Correct, when all our systems are at 2.8.x we do not expect to see this.

            Since the issue does not occur in 2.8.x, it may well be that there is an existing patch that could be backported, but LU-7727 is not it.

            ofaaland Olaf Faaland added a comment - So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution No, the patch for LU-7727 is very specific to a read-only mount.  The patch says, in effect, "If the mount is read-only, and the file is being opened with the O_WRITE flag, then fail immediately instead of sending a request to the MDT". Without that patch, a request is sent to the MDT which then causes side-affects on other clients. It has a similar change on the MDT, for the case where an MDT receives an open request from a client which has mounted read-only. In our case, no clients are mounted read-only, so neither of those changes would have any effect. or else upgrade to your 2.8.x distribution? Correct, when all our systems are at 2.8.x we do not expect to see this. Since the issue does not occur in 2.8.x, it may well be that there is an existing patch that could be backported, but LU-7727 is not it.
            pjones Peter Jones added a comment -

            Olaf

            So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution or else upgrade to your 2.8.x distribution?

            Peter

            pjones Peter Jones added a comment - Olaf So am I correct in thinking that this means that the original theory that this is a duplicate of LU-7727 holds true and that you would no longer expect to see this either when you apply the fix to your 2.5.x distribution or else upgrade to your 2.8.x distribution? Peter

            People

              yujian Jian Yu
              kamakea1 Teresa Kamakea (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: