Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2276

is open() idempotent in regards to being restarted after a signal interrupts it?

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.3
    • None
    • EL6
    • 3
    • 5437

    Description

      On our EL6 jenkins builder, we do all of the build work on a Lustre 2.1.3 system.

      Occasionally and sporadically we will see the following from a git checkout command:

      error: git checkout-index: unable to create file foo (File exists)

      Through a very basic grepping and following of the source it seems that the core of the error message is coming from write_entry() in entry.c:

      fd = open_output_fd(path, ce, to_tempfile);
      if (fd < 0)

      { free(new); return error("unable to create file %s (%s)", path, strerror(errno)); }

      So looking into open_output_fd() there is a call to create_file() which does:

      return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);

      I am able to prevent the problem from happening with 100% success by simply giving the git checkout a "-q" argument to prevent it from emitting progress reports. This would seem to indicate that the problem likely revolves around the fact that the progress reporting uses SIGALRM.

      Given that O_CREAT | O_EXCL are used in the open() call and that SIGALRM (along with SA_RESTART) is being used frequently to do progress updates, it seems reasonable to suspect that the problem is that open() is being interrupted (but only after it creates the file and before completing) by the progress reporting mechanism's SIGALRM and when the progress reporting is done, open() is restarted automatically (due to the use of SA_RESTART) and fails because the file exists and O_CREAT | O_EXCL are used in the open() call.

      Does this seem like a reasonable hypothesis?

      If it does, where does the problem lie here? Is it that SA_RESTART should not be used since it's not safe with open() and O_CREAT | O_EXCL (and every system call caller should be handling EINTR) or should the open() be idempotent so that it can be restarted automatically with SA_RESTART? If open() is not required to be idempotent and this failure is legal and expected, a citation would be useful in getting the git folks to fix their code.

      If open() is not required to be idempotent, it's use with O_CREAT | O_EXCL and SA_RESTART seems fatally flawed and I'd like to be able to point that out to the git maintainers, but as above, it would be useful to be able to present some proof that idempotency is not required and they would need to be able to hand this EINTR themselves rather than relying on SA_RESTART.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              brian Brian Murrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: