Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8441

Text file busy error after overwriting file

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.8.0, Lustre 2.5.5
    • lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      Here's our reproducer:

      sh -c 'cd /p/lscratchd/$USER && (f=toss-3321; rm -f $f; cp /bin/ls $f; od -N1 $f; ./$f; echo > $f; rm -f $f)'
      

      This looks similiar to LU-6232. This affects emacs which is impacting our users. This behaviour is a difference in how xemacs vs. vi deal with files that they already have open. With vi, it always writes to a temporary new file which it then moves over top of the file being edited. With xemacs, the original file is moved to <file>~ and a new file written on the first write. After that it overwrites the new file. One can see this by running them, saving a file, checking the inode number with "ls -i <file>", and then repeating the save and check operations. With xemacs the inode number won't change for each save. With vi, it will.

      Attachments

        1. debug.lu8441.tar
          11.70 MB
        2. lu8441.logs.tar
          53.50 MB

        Issue Links

          Activity

            [LU-8441] Text file busy error after overwriting file
            yujian Jian Yu added a comment -

            Thank you, Olaf. I'm closing this ticket as a duplicate of LU-8019.

            yujian Jian Yu added a comment - Thank you, Olaf. I'm closing this ticket as a duplicate of LU-8019 .
            ofaaland Olaf Faaland added a comment - - edited

            Thanks, Jian and Oleg.

            That's all LLNL needs for this. You can close notfix (or whatever your normal process is).

            ofaaland Olaf Faaland added a comment - - edited Thanks, Jian and Oleg. That's all LLNL needs for this. You can close notfix (or whatever your normal process is).
            green Oleg Drokin added a comment -

            This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive.

            The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.

            green Oleg Drokin added a comment - This is mostly due to lingering file opens for write that got cached on the client. So when the exec comes it sees the file is opened for write and bails out (server side). We tried to just obtain a necessary ldlm lock before opening, but that proved to be very expensive. The cached open on the other hand is the real problem here, originally mostly aimed at nfs opened files, it managed to be enabled for other types of opens at times leading to such problems.
            yujian Jian Yu added a comment -

            Hi Olaf,
            According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from.

            Hi Oleg,
            About the factors that lead to the issue, could you please give some hints? Thank you.

            yujian Jian Yu added a comment - Hi Olaf, According to debug logs, there were some analyses in the previous comment https://jira.hpdd.intel.com/browse/LU-8441?focusedCommentId=205806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205806 about where the ETXTBSY is coming from. Hi Oleg, About the factors that lead to the issue, could you please give some hints? Thank you.
            ofaaland Olaf Faaland added a comment -

            Hi Jian and Oleg,

            Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket.

            For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.

            ofaaland Olaf Faaland added a comment - Hi Jian and Oleg, Thank you for investigating. Given the complexity and risk we can close this notfix and we will do the same in our local ticket. For my education, can you tell me where the EBUSY is coming from in our broken case, and describe of some of the factors that lead to this? It need not be a complete and perfect description, just some hints that help understand the relevant code paths.
            yujian Jian Yu added a comment -

            Hi Olaf,
            The fix is LU-8019 and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for LU-3544, which contains more patches.

            yujian Jian Yu added a comment - Hi Olaf, The fix is LU-8019 and the prior patches. While trying to back-port the patch, I found it had a long dependency chain, including those for LU-3544 , which contains more patches.

            People

              yujian Jian Yu
              kamakea1 Teresa Kamakea (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: